The world’s leading publication for data science, AI, and ML professionals.

3 Python Operations for Solving Specific Data Processing Tasks Efficiently

Leverage the flexibility of Pandas and Python

Photo by Federico Beccari on Unsplash
Photo by Federico Beccari on Unsplash

Raw data that comes to you is almost always different from the preferred or required format. Your workflow starts with getting the raw data into the specified format of choice, which takes up a substantial amount of your time.

Thankfully, there are lots of tools made available to us that expedite this process. As these tools evolve, they get better at solving even specific tasks very efficiently. Pandas has been around quite a long time and it has become one of the most widely-used data analysis and cleaning tools.

The built-in functionalities of Python also make it easy to deal with data operations. It’s no surprise that Python is the dominant language in the Data Science ecosystem.

In this article, we’ll go over three specific cases and learn how to leverage the flexibility of Python and Pandas to solve them.


1. Expand date ranges

We’re likely to encounter this task when working with time series data. Consider we have a dataset that shows the lifecycle of products at different stores as shown below:

(image by author)
(image by author)

For some other downstream tasks, we need to convert this dataset into the following format:

(image by author)
(image by author)

We basically create a separate row for each date between the start and end dates. This is also known as expanding the data. We’ll use some Pandas and built-in Python functions to complete this task.

Let’s create a sample dataset with mock data in this format in case you want to practice yourself.

import pandas as pd

lifecycle = pd.DataFrame({
    "store_id": [1130, 1130, 1130, 1460, 1460],
    "product_id": [103, 104, 112, 130, 160],
    "start_date": ["2022-10-01", "2022-09-14", "2022-07-20", "2022-06-30", "2022-12-10"],
    "end_date": ["2022-10-15", "2022-11-06", "2022-09-10", "2022-07-20", "2023-01-10"]
})

The first step is to make sure data is stored with proper data types.

lifecycle.dtypes

# output
store_id       int64
product_id     int64
start_date    object
end_date      object
dtype: object

We need to convert the dates to datetime data type to be able to use the date_range function in the following step.

lifecycle = lifecycle.astype(
   {"start_date": "datetime64[ns]", "end_date": "datetime64[ns]"}
)

lifecycle.dtypes

# output
store_id       int64
product_id     int64
start_date    datetime64[ns]
end_date      datetime64[ns]
dtype: object

To expand the date, we first need to get all the dates between the start and end dates, which can be done using the date_range function.

Here is an example that expands the dates in the first row of our DataFrame.

pd.date_range(lifecycle.start_date[0], lifecycle.end_date[0])

# output
DatetimeIndex(['2022-10-01', '2022-10-02', '2022-10-03', '2022-10-04',
               '2022-10-05', '2022-10-06', '2022-10-07', '2022-10-08',
               '2022-10-09', '2022-10-10', '2022-10-11', '2022-10-12',
               '2022-10-13', '2022-10-14', '2022-10-15'],
              dtype='datetime64[ns]', freq='D')

We need to do this for all the rows, which can be done in a few different ways. We’ll use a list comprehension and save the output in a new column called "date".

lifecycle.loc[:, "date"] = [
    pd.date_range(x, y) for x, y in zip(lifecycle["start_date"], lifecycle["end_date"])
]

The zip function creates an iterator of tuples from the given iterables. The iterables in our case are the start date and end date columns. The list comprehension allows for applying the date_range function to each pair of dates on this iterator.

Here is the updated version of the DataFrame:

(image by author)
(image by author)

The final step is to create a separate row for each date in the date column. The explode function is the perfect fit for this task. After expanding the dates, we won’t need the start date and end date columns so let’s also drop them.

lifecycle = lifecycle.explode(column="date")

# drop the start and end date columns
lifecycle = lifecycle.drop(["start_date", "end_date"], axis=1)

lifecycle.head()
(image by author)
(image by author)

That’s it! We got the data in the desired format.


2. Extract multiple items from a list

Let’s say we need to extract some items from a long list. What we have is the indices of these items.

If it was a single item, we could just pass its index as shown below:

names = ["Jane", "John", "Max", "Ashley", "Max"]

# get second item
names[1]

# output
'John'

If it was a slice from the list (e.g. first 3 items, items between second and fifth items), we can determine the slice with the starting and ending index.

names = ["Jane", "John", "Max", "Ashley", "Max"]

# get the first three items
names[:3]

# output
['Jane', 'John', 'Max']

What if it was a long list of items with random index values? It’d be tedious to write a loop and extract items one by one. A much better approach is to use the itemgetter function in the built-in operator module.

We have a scores list that contains 40 items.

scores = [1225, 598, 922, 1565, 225, 1173, 1658, 1112, 1339, 1521, 737, 
          1374, 1676, 1183, 1598, 1948, 1123, 1185, 560, 1335, 1867, 270, 
          1901, 1089, 905, 1282, 1205, 873, 1860, 456, 1645, 1499, 525, 
          1696, 1703, 1157, 1791, 1930, 1268, 1760]

We need to extract items whose index values are given in a separate list.

items_to_extract = [0, 3, 4, 8, 12, 23, 30, 32, 36]

I did not make the list very long for demonstration purposes but think of this case when working with large data structures.

Here is how to use the itemgetter function to extract items with the given index values from the scores list.

from operator import itemgetter

items_requested = itemgetter(*items_to_extract)(scores)

print(items_requested)
# output
(1225, 1565, 225, 1339, 1676, 1089, 1645, 525, 1791)

It returns a tuple of items but we can easily convert it to a list using the list constructor ( list(items_requested) ).


3. Sort DataFrame by frequency

The sort_values function of Pandas is used for sorting the rows of the DataFrame. We can sort by the values in a single column or multiple columns.

Let’s first create a sample DataFrame.

df = pd.DataFrame({
    "category": ["A", "A", "C", "B", "B", "B", "C", "B", "C", "B", "B", "C"],
    "issue_level": [5, 3, 5, 1, 1, 1, 2, 4, 2, 3, 5, 2],
    "month": ["Jan", "Mar", "Feb", "Mar", "Feb", "June", "July", "Dec", "Mar", "Feb", "Apr", "Feb"],
    "year": [2023, 2022, 2022, 2023, 2021, 2023, 2020, 2021, 2022, 2022, 2023, 2022]
})
df (image by author)
df (image by author)

We want to sort the rows by the category column. The order should be based on the frequency of values (i.e. the most frequent value is the first and so on).

The value_counts function gives us the frequency of values in a column:

df["category"].value_counts()

# output
B    6
C    4
A    2
Name: category, dtype: int64

So the order in the category column should be B, A, C.

There are different ways of solving this task but one of the simplest ones is to group the rows by category and use the transform function to create a frequency column. We can then use this column for sorting the rows.

df.loc[:, "frequency"] = df.groupby("category")["category"].transform(pd.Series.count)

After creating this column, the DataFrame looks like this:

df (image by author)
df (image by author)

We can now sort the rows by the frequency column. Make sure to reset the index after sorting. Otherwise, the original index values of rows will remain.

df = df.sort_values(by="frequency", ascending=False).reset_index(drop=True)

Here is the final DataFrame:

df (image by author)
df (image by author)

We can choose to drop the frequency column if it’s only used for sorting the rows. Another thing to note here is that if there are multiple categories with the same frequency value, we can specify an additional criteria for sorting (e.g. first sort by frequency and then category name alphabetically). In this case, both columns are passed to the sort_values function as a list.


Final words

Python is the most popular language in the data science ecosystem. Its intuitive and easy-to-learn syntax play a key role in making Python the first choice of Programming language in data science. Another thing that makes us use Python a lot is its ready-to-use solutions for even specific cases. In this article, we mentioned three of such cases. I hope you find them helpful and will use them somewhere in your data science journey.

Thank you for reading. Please let me know if you have any feedback.


Related Articles