
Pandas is a must-have Python library if you’re working with data. Whether you’re a programmer, data scientist, analyst, or researcher, you’ll find it makes handling structured data much easier. It gives you flexible, intuitive tools to work with even the most complex datasets.
As you dive deeper into Pandas, mastering it can significantly boost your productivity and streamline your workflow. We will be exploring 7 essential tips that will help you leverage the library’s full potential and tackle data challenges more effectively.
To illustrate the following tips, I’ll be using a dataset from Kaggle’s Airbnb listings. You can fetch the dataset here. (License: CC0: Public Domain) This dataset comprises three CSV files: calendar.csv
, listings.csv
, and reviews.csv
. These files contain information about property availability, detailed listing attributes, and user reviews, respectively. By working with real-world data, I’ll demonstrate how Pandas can efficiently handle and analyze complex, multi-file datasets typically encountered in data science projects.
1. Reading data in chunks
There are often scenarios we encounter where the size of the data is more than the available memory (RAM) we have. In such cases, it’s a good idea to read data in chunks from the file so that the system doesn’t run out of memory. This approach allows you to process data incrementally, reducing memory usage and potentially speeding up your workflow.
2. Render a progress bar while using the apply() function
I have always been a user of the [tqdm](https://pypi.org/project/tqdm/)
library. For those who don’t know about tqdm
, it is a popular choice for displaying progress bars during iterative operations. It provides a seamless way to track the progress of loops and other iterable processes, which is particularly helpful when working with large datasets. However, I encountered a challenge in applying tqdm
to Pandas’ apply()
functions, which are commonly used to perform element-wise operations on DataFrame columns or rows.
To my delight, I discovered that tqdm
does, in fact, support Pandas’ apply()
methods. All it takes is a small addition to your code. Instead of using the standard apply()
function, you can simply replace it with progress_apply()
. This will automatically render a progress bar, giving you valuable insights into the execution of your apply()
operations. Before that, just make sure to import tqdm
and add a line of code below it to integrate it with pandas.
For demonstration, let’s go back to the listings.csv
dataset and convert the column last_review
from type string
to datetime
. I’ll use the progress_apply function instead which will create a progress bar.
If you don’t have tqdm
installed, you can use the command below to install it:
pip3 install tqdm

3. Effortlessly Populate Multiple Columns in Pandas Using apply() with result_type="expand"
While using the apply function, I often used to encounter scenarios where I need to return multiple values from a function simultaneously and store these values in separate columns. That’s when I discovered the result_type="expand"
parameter which would help me do it.
It was such a time-saver! I no longer had to create separate functions and iterate over the same set of values just to store them in a separate column. I could just create a single function and return multiple values.
In the example below, I have demonstrated the same. I’m returning two values – the day and month name based on the review date. These values are then populated simultaneously in the Day
and Month
columns respectively.

4. Processing a DataFrame in Parallel
I have often faced scenarios where certain processes take too long to complete. When a DataFrame has millions of rows and you need to iterate through each row to extract certain information, it can get slow real quick.
This is where I use the multiprocessing module. I split the DataFrames (using np.array_split()) into multiple chunks (depending on the number of cores available in your system) and process each chunk of the DataFrame in parallel. This can especially be useful in cases where the system has multiple cores which most modern systems are capable of.
Once the chunks are processed and the desired output for each chunk is obtained, it can be concatenated back to a single DataFrame.
Let’s use the reviews dataset for this one. It has more than 480k reviews by different users across listings.
For demonstration purposes, we will be creating a function that will simulate the time required for sentiment prediction of a review assuming each prediction will take 0.1 second.

As you can see, it will take more than 13 hours if we make predictions one by one!
Let’s speed this up. First, let’s split the reviews DataFrame into multiple batches so that each batch is processed by a separate core. We can then create a multiprocessing Pool to distribute the computation over multiple cores.

Using multiprocessing, we have managed to bring down the prediction time required from more than 13 hours to less than 13 minutes!
Each batch consists of 7521 reviews and there are a total of 64 batches. In this scenario, I was able to set n_cores
more than the actual number of cores my system has. This is because during the execution of time.sleep(0.1)
the CPU remains idle and each process interleaves for other process to execute. If your process is CPU intensive, it is recommended to keep n_cores
less than the actual number of cores your system has.
5. Perform Complex Merging with merge() and indicator=True
Merging is quite a common operation performed by individuals who deal with data. However, sometimes it can get quite complicated to understand if any particular data points were lost during the merging process. It might be due to a plethora of reasons – the worst one being, malformed or faulty data.
This is where the indicator=True
parameter comes in handy. When you enable this parameter, it creates a new column named _merge
which can denote three different scenarios based on the type of merge operation performed.
- _leftonly – indicates that the row’s key only exists in the left DataFrame and it couldn’t find a match in the right DataFrame
- _rightonly – indicates that the row’s key only exists in the right DataFrame and it couldn’t find a match in the left DataFrame
- both – indicates that the row’s key exists in both the DataFrames and data was successfully merged.
These values can also come in handy as a filter to apply during data manipulation tasks.
In the example below, I’ll be performing an outer merge between reviews and listing DataFrames.
I’m performing an outer merge with the parameter indicator=True
to identify which listings have no reviews/missing reviews. Ideally, listings with the parameter _merge
set to _"leftonly" will be missing reviews.
merged_df = listings.merge(
reviews,
left_on="id",
right_on="listing_id",
how="outer",
indicator=True
)
merged_df.shape

Below are some examples of listings with no reviews

6. Segmenting Data into Price Brackets Using pd.cut()
pd.cut() is a powerful function that can be used when you need to segment data into multiple bins. It can also act as a way to convert continuous values into categorical values.
One such scenario is demonstrated in the example below. We will be Segmenting the price of each listing into multiple price brackets (bins).
We can set a predetermined number of price brackets – "$0 – $100", "$101 – $250", "$251 – $500", "$500 – $1000", and "$1000+".
# Create bins and label for each bin
bins = [0, 100, 250, 500, 1000, float('inf')]
labels = ["$0 - $100", "$101 - $250", "$251 - $500", "$500 - $1000", "$1000+"]
listings["price_bucket"] = pd.cut(listings["price"], bins=bins, labels=labels)
Please note here that the number of labels (5) is less than number of bins (6) by one. This is because the initial two values in the bin belong to the first label.

We can see that the majority of the listings (3400) lie in the range between $101–$250 with the least amount of listings (29) in the $1000+ range.
7. Cross-Tabulation Analysis of Prices and Room Types
Using the above data, we can go one step further and perform a cross-tabulation between the price brackets and room types available for those price brackets.
Here’s where pd.crosstab() comes into play. It can be used to perform a simple cross-tabulation between price_bucket
and room_type
.
room_type_breakup = pd.crosstab(
listings["price_bucket"],
listings["room_type"],
margins=True # This will add the row and column totals
)

The above screenshot shows the distribution of room types across different price brackets. The last row and column will be the sum of row and column totals. Set margins=False
if it’s not desired.
By now, you’ve learned several powerful Pandas techniques that will significantly improve your Data Processing workflow. From managing large datasets with chunked reading to speeding up operations through parallel processing, these advanced methods will help you handle complex data tasks more efficiently.
Please do let me know if you recently found any other interesting techniques that improved your workflow or if you found a more efficient way to handle the above techniques!