The world’s leading publication for data science, AI, and ML professionals.

4 Tricky Data Manipulation Operations With Python Pandas

Pushing the limits

Photo by Joppe Spaa on Unsplash
Photo by Joppe Spaa on Unsplash

Pandas is one of the most popular data analysis and manipulation libraries in the Data Science ecosystem. It is so for a few reasons.

One of the reasons is that Pandas makes it possible to do complex operations simply and intuitively. You can achieve much more than you would anticipate in one line of Pandas code.

In this article, we will go over 4 data manipulation operations that seem a little complex at first. After we are done, you will be amazed at how simple it is to complete them with Pandas.

Since real-life datasets can be quite messy, data manipulation or cleaning is a major part of the data science workflow.

I prepared a sample dataset that demonstrates some issues we are likely to encounter in real-life datasets.

Note: This story is originally published on datasciencehowto.com.

I will also share the code that creates the data frame just in case you want to practice on your own.

df = pd.DataFrame({
    "last_date": ["2021-03-14", "2021-01-13", 
                  "2021-09-15", "2021-01-01", 
                  "2020-12-10","2021-04-02"],
    "price": ["$950.0", "$450.5", "$1,100.5", 
              "$360.0", "$850.0", "$440.5"],
    "duration": ["2 days", "1 week", "5 days", 
                 "1 month", "10 days", "2 weeks"],
    "amenities": ["tv, dishwasher", "TV, washer", 
                  "dishwasher, washer", "washer", 
                  "tv", "dishwasher"]
})
df
df (image by author)
df (image by author)

The values are of different types such as date, number, text, and list. However, they are stored with the object data type.

df.dtypes
last_date    object
price        object
duration     object
amenities    object
dtype: object

We should always have the data stored with the proper data type. Otherwise, we won’t be able to use some data type-specific operations.


Trick 1: Make the price proper

The price column seems for us but not for the software. "$360.0" is a sequence of characters and does not imply a quantity or numerical value.

Let’s update it to a format that the computer can actually understand it is 360.

df["price_cleaned"] = df["price"].str[1:].str.replace(",","").astype("float")

The first step is to take the value starting from the second character because the first one is the "$".

Then we remove the comma used as the thousands separator by using the replace function under the str accessor.

The final step is to change the data type to float.

Here is the updated data frame:

df (image by author)
df (image by author)

Trick 2: Has TV?

The amenities are listed in the amenities column. Assume I’m particularly interested in TV and want to create a column that indicates if there is a TV.

There are multiple ways of performing this task. One of the efficient ways is to use a list comprehension especially if you are working with a large data frame.

We need to take into account that "TV" and "tv" are the same.

df["has_tv"] = pd.Series(['tv' in row.lower() for row in df.amenities]).replace({False:0, True:1})

Let’s elaborate on the list comprehension first:

['tv' in row.lower() for row in df.amenities]
  • It takes all the rows in the amenities column.
  • Converts it to lower case so that we catch both "TV" and "tv".
  • Checks if "tv" exists in the row.

This list comprehension returns True or False but I want to convert them to 1 or 0, respectively.

The replace function of Pandas can easily accomplish this task so I convert the list to Pandas series and then apply the replace function.

The data frame now looks like this:

df (image by author)
df (image by author)

Trick 3: How old is the entry?

There is a column called "last date" in the data frame. I’m more interested in how old this entry or ad is. Thus, I need to calculate the number of days between today and the last date.

The following code creates a column called "since last date" that contains the information I need.

df["since_last_date"] = (pd.to_datetime("2021-11-25") - df["last_date"].astype("datetime64")).dt.days

We can calculate the difference between the two dates as long as they have the proper data type. Thus, I update the data type of the last date column using the "astype" function.

The "to_datetime" function takes a string and converts it to date.

The output of this subtraction operation is as follows:

0   256 days
1   316 days
2    71 days
3   328 days
4   350 days
5   237 days
Name: last_date, dtype: timedelta64[ns]

We can extract the quantity from this output using the "days" method under the "dt" accessor.

df (image by author)
df (image by author)

Trick 4: Standard duration?

The duration column represents a length of time but in different units. I think you would agree that it is not very useful in this format.

Let’s create a new duration column that shows the duration as the number of days.

What we need to do is:

  • Take the second word and convert it to the number of days.
  • Multiply this value with the quantity shown in the first word.
df["duration_days"] = df["duration"].str.split(" ", expand=True)[1].replace(
    {"week": 7, "weeks": 7, "days": 1, "day": 1, "month": 30}
).astype("int") * df["duration"].str.split(" ", expand=True)[0].astype("int")

The code seems complicated but it becomes very clear when we go over it step by step.

The following part takes the second word in the duration column (e.g. week, weeks, month)

df["duration"].str.split(" ", expand=True)[1]

The replace function replaces the words with the appropriate number of days.

replace(
    {"week": 7, "weeks": 7, "days": 1, "day": 1, "month": 30}
)

Then the "astype" function converts the data type to integer.


The second part of the multiplication takes the first word in the duration column and converts the data type to integer.

df["duration"].str.split(" ", expand=True)[0].astype("int")

Multiplying these two values gives us a standard duration in terms of the number of days.

df (image by author)
df (image by author)

You do not have to do all of these operations at once. Another option is to create a column that has the unit such as month, week, day, and so on. Then, you can update the value accordingly.


Pandas provides numerous functions and methods that simplify and expedite the data analysis and manipulation processes.

Considering the amount of time spent on cleaning the raw data, Pandas is a highly important tool for both data scientists and data analysts.

You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you do so using the following link, I will receive a portion of your membership fee at no additional cost to you.

Join Medium with my referral link – Soner Yıldırım


Thank you for reading. Please let me know if you have any feedback.


Related Articles