The world’s leading publication for data science, AI, and ML professionals.

5 Pandas Tricks You Should Start Using in 2021

Start the New Year with one of the best New Year's resolutions: Learn more pandas.

Photo by Michael Payne on Unsplash
Photo by Michael Payne on Unsplash

Pandas needs no introduction as it became the de facto tool for Data Analysis in Python. As a Data Scientist, I use pandas daily and it never ceases to amaze me with better ways of achieving my goals.

For pandas newbies – Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

The name pandas is derived from the term "panel data", an econometrics term for datasets that include observations over multiple time periods for the same individuals.

In this article, I’m going to show you 5 pandas tricks that will make you more productive when:

  • joining data with pandas DataFrames,
  • working with dates and times,
  • making Excel reports,
  • storing DataFrames to disk.

Let’s begin.

Pandas Data Analysis Series

1. Date Ranges

Photo by Waldemar Brandt on Unsplash
Photo by Waldemar Brandt on Unsplash

When fetching the data from a database or an external API, you usually have to specify the start date and the end date for your query.

Operating with dates can be tricky, but don’t worry, pandas got us covered!

There is a data_range function, which returns dates incremented by days, months or years, etc. It also works with time.

Let’s say you need a date range incremented by days.

date_from = "2019-01-01"
date_to = "2019-01-12"
date_range = pd.date_range(date_from, date_to, freq="D")
date_range
The output of a pandas date_range function.
The output of a pandas date_range function.

Let’s transform the generated date_range to start and end dates, which can be passed directly to an API or a database query.

for i, (date_from, date_to) in enumerate(zip(date_range[:-1], date_range[1:]), 1):
    date_from = date_from.date().isoformat()
    date_to = date_to.date().isoformat()
    print("%d. date_from: %s, date_to: %s" % (i, date_from, date_to)
# the output with date_from and date_to
1. date_from: 2019-01-01, date_to: 2019-01-02
2. date_from: 2019-01-02, date_to: 2019-01-03
3. date_from: 2019-01-03, date_to: 2019-01-04
4. date_from: 2019-01-04, date_to: 2019-01-05
5. date_from: 2019-01-05, date_to: 2019-01-06
6. date_from: 2019-01-06, date_to: 2019-01-07
7. date_from: 2019-01-07, date_to: 2019-01-08
8. date_from: 2019-01-08, date_to: 2019-01-09
9. date_from: 2019-01-09, date_to: 2019-01-10
10. date_from: 2019-01-10, date_to: 2019-01-11
11. date_from: 2019-01-11, date_to: 2019-01-12

2. Merge with indicator

Photo by pine watt on Unsplash
Photo by pine watt on Unsplash

Pandas has a merge function, which does exactly as the names suggests – it merges two Dataframes with a database-style join.

Merging two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common columns

If you’re like me, you’ve used merge function many times. But what surprised me is that I didn’t know about the indicator argument.

Indicator argument adds a _merge column to a DataFrame, which tells you from which DataFrame came the row:

  • left DataFrame,
  • right DataFrame,
  • or both DataFrames.

The _merge column can be very useful when working with bigger datasets to validate the merge operation.

Let’s define two DataFrames and merge them with the merge indicator.

left = pd.DataFrame({"key": ["key1", "key2", "key3", "key4"], "value_l": [1, 2, 3, 4]})
Rows in the left DataFrame
Rows in the left DataFrame
right = pd.DataFrame({"key": ["key3", "key2", "key1", "key6"], "value_r": [3, 2, 1, 6]})
Rows in the right DataFrame
Rows in the right DataFrame
df_merge = left.merge(right, on='key', how='left', indicator=True)
Entries in the df_merge DataFrames with merge column
Entries in the df_merge DataFrames with merge column

We can use the _merge column to check if there is an expected number of rows with values from both DataFrames.

df_merge._merge.value_counts()
# output of values count function
both          3
left_only     1
right_only    0
Name: _merge, dtype: int64

3. Nearest merge

Photo by Tim Johnson on Unsplash
Photo by Tim Johnson on Unsplash

When working with financial data, like stocks or cryptocurrency, we may need to combine quotes (price changes) with actual trades.

Let’s say we would like to merge each trade with a quote that occurred a few milliseconds before it. How can we achieve this with pandas?

Pandas has a function merge_asof, which enables merging DataFrames by the nearest key (timestamp in our example). The datasets quotes and trades are taken from pandas example.

The quotes DataFrame contains price changes for different stocks. Usually, there are many more quotes than trades.

quotes = pd.DataFrame(
    [
        ["2016-05-25 13:30:00.023", "GOOG", 720.50, 720.93],
        ["2016-05-25 13:30:00.023", "MSFT", 51.95, 51.96],
        ["2016-05-25 13:30:00.030", "MSFT", 51.97, 51.98],
        ["2016-05-25 13:30:00.041", "MSFT", 51.99, 52.00],
        ["2016-05-25 13:30:00.048", "GOOG", 720.50, 720.93],
        ["2016-05-25 13:30:00.049", "AAPL", 97.99, 98.01],
        ["2016-05-25 13:30:00.072", "GOOG", 720.50, 720.88],
        ["2016-05-25 13:30:00.075", "MSFT", 52.01, 52.03],
    ],
    columns=["timestamp", "ticker", "bid", "ask"],
)
quotes['timestamp'] = pd.to_datetime(quotes['timestamp'])
Entries in the quotes DataFrame
Entries in the quotes DataFrame

The trades DataFrame contains trades of different stocks.

trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:00.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:00.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:00.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:00.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:00.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
Entries in the trades DataFrame
Entries in the trades DataFrame

We merge trades and quotes by tickers, where the latest quote can be 10 ms behind the trade.

If a quote is more than 10 ms behind the trade or there isn’t any quote, the bid and ask for that quote will be null (AAPL ticker in this example).

df_merge = pd.merge_asof(trades, quotes, on="timestamp", by='ticker', tolerance=pd.Timedelta('10ms'), direction='backward')
Entries in the df_merge DataFrame
Entries in the df_merge DataFrame

4. Create an Excel report

Photo by Lukas Blazek on Unsplash
Photo by Lukas Blazek on Unsplash

We can create Excel reports directly in pandas (with the help of XlsxWriter library).

This is a major time saver – no more saving a DataFrame to CSV and then copy-pasting and formatting it in Excel. We can also add all kinds of charts, etc.

Let’s define a DataFrame and then create an Excel report with it.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["a", "b", "c"])

The code snippet below creates an Excel report.

report_name = 'example_report.xlsx'
sheet_name = 'Sheet1'
writer = pd.ExcelWriter(report_name, engine='xlsxwriter')
df.to_excel(writer, sheet_name=sheet_name, index=False)
# writer.save()

We can also add charts to the Excel report.

We need to define the type of the chart (line chart in our example) and the data series for the chart (the data series needs to be in the Excel spreadsheet).

# define the workbook
workbook = writer.book
worksheet = writer.sheets[sheet_name]
# create a chart line object
chart = workbook.add_chart({'type': 'line'})
# configure the series of the chart from the spreadsheet
# using a list of values instead of category/value formulas:
#     [sheetname, first_row, first_col, last_row, last_col]
chart.add_series({
    'categories': [sheet_name, 1, 0, 3, 0],
    'values':     [sheet_name, 1, 1, 3, 1],
})
# configure the chart axes
chart.set_x_axis({'name': 'Index', 'position_axis': 'on_tick'})
chart.set_y_axis({'name': 'Value', 'major_gridlines': {'visible': False}})
# place the chart on the worksheet
worksheet.insert_chart('E2', chart)
# output the excel file
writer.save()
Excel report created with pandas
Excel report created with pandas

5. Save the disk space

Photo by 铮 夏 on Unsplash
Photo by 铮 夏 on Unsplash

When working on multiple Data Science projects, you usually end up with many preprocessed datasets from different experiments. The small SSD on a laptop can get cluttered quickly.

Pandas enables us to compress the dataset when saving it and then reading back in compressed format.

Let’s create a big pandas DataFrame with random numbers.

import numpy as np
df = pd.DataFrame(np.random.randn(50000,300))
Dataframe with random numbers.
Dataframe with random numbers.

When we save this file as CSV, it takes almost 300 MB on the hard drive.

df.to_csv('random_data.csv', index=False)

We can reduce the file size to 136 MB by using gz extension instead of csv.

df.to_csv('random_data.gz', index=False)

We don’t lose any functionality with this as it is also easy to read the gzipped data to the DataFrame.

df = pd.read_csv('random_data.gz')

Conclusion

Photo by Johannes Plenio on Unsplash
Photo by Johannes Plenio on Unsplash

In case you would like to run these examples you can download the Jupyter notebook.

These tricks help me daily to be more productive with pandas. Let me know in the comments which is your favorite pandas trick.

Let’s connect

Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub


Related Articles