The world’s leading publication for data science, AI, and ML professionals.

EDA with Polars: Step-by-Step Guide for Pandas Users (Part 1)

Level up your data analysis with Polars

Photo by Mitul Grover on Unsplash
Photo by Mitul Grover on Unsplash

Introduction

Every once in a while, along comes a tool that significantly alters the way that Data Analysis is done. I believe that Polars is one of those tools, so in this series of posts, I’ll deep dive into this library, compare it to a more well known and established library – Pandas, and will showcase the analysis workflow using an example dataset.

What is Polars?

Polars is a blazingly fast DataFrame library written in Rust. Lucky for us (data scientists/analysts) it has a very well documented Python wrapper that exposes a complete set of features to wrangle data and build data pipelines. Here are the main advantages I’ve seen after switching to Polars:

  • Much faster pre-processing operations
  • Ability to handle larger than RAM datasets
  • Better quality of code due to the need to properly structure data pipelines

You can see the full set of benefits in this user guide and the speed comparisons in this H20 benchmark.

Switching from Pandas

At the first glance, Pandas and Polars seem to be quite similar e.g. methods like .read_csv() or .head() are shared between them so you can perform basic exploratory operations without any changes. But the more you start working the library, the more you’ll notice how different the two libraries are. From syntaxis to the way of thinking, switching to Polars is no easy task. That’s why I hope that these posts will help you get started.

Setup

To follow along the project, make sure to pull this GitHub repo with the latest notebook. Data used in this project can be downloaded from Kaggle (CC0: Public Domain). It’s a dataset about YouTube’s top trending videos and should provide enough complexity for this series of posts. Also, you’ll need Pandas and Polars to be installed which can be done using pip for both packages.

Now that everything is setup, let’s jump into the project! The main goal here is to get you more familiar with Polars, so make sure to follow along or practice the concepts on your own dataset. Once again, in the GitHub repo you can find the notebook with all the code used here.

Data Processing

Reading Data

Reading data is going to be familiar to Pandas users since it uses exactly the same methods. Let’s read in the statistics for United Kingdom videos for the further analysis and print out the DataFrame shapes.

# Define file path
path  = './youtube/GBvideos.csv'

# Read in using Polars
df_pl = pl.read_csv(path)
print(df_pl.shape)
>>> (38916, 16)

# Read in using Pandas
df_pd = pd.read_csv(path)
print(df_pd.shape)
>>> (38916, 16)

If you time both of these read in operations, you’ll have your first "wow" moment with Polars. On my laptop, Polars reads in the file in ~110 ms and Pandas reads it in ~ 270 ms. That’s 2.5x speedup, but you’ll frequently see reading/writing operation speed ups much more than this (especially with larger files).

Common Exploratory Methods

What’s the first thing you do when you read in the data? I assume you print out a head (or sample), check the data types, shapes, etc. Polars shares a lot of these high-level methods with Pandas, so you can explore the first rows using .head() method, and the last rows using .tail() method.

print(df_pl.head(2))
print(df_pl.tail(2))
print(df_pl.sample(2))
First two rows. Screenshot by author.
First two rows. Screenshot by author.

While you’ll be able to see the outputs, it’s formatting is not ideal. You can change how the outputs get displayed using the Config . For example, to make the printouts wider you can set the maximum number of characters per row to 200 like so pl.Config.set_tbl_width_chars(200) . Then the output is going to look more pleasant.

Wide format. Screenshot by author.
Wide format. Screenshot by author.

There are a lot of other parameters you can adjust (e.g. hide column data types), so make sure to checkout the Config documentation.

Selecting Columns

By Data Types

You might have seen from the printout that Polars has a bit different set of data types for its columns. Numeric columns are usually assigned types of Int32 , Int64 , Float32 , Float64 and categorical columns are usually assigned types of Utf8 . To select columns by data type, the so-called selectors can be used together with the .select() method. These selectors are a relatively new addition to the API and they give us more intuitive way of selecting columns. For example, below you can see the code to select all the numeric and all the categorical columns.

import polars.selectors as cs

# Polars feature selectiom
numeric_data_pl = df_pl.select(cs.numeric())
categorical_data_pl = df_pl.select(cs.string())

# Pandas feature selection
numeric_data_pd = df_pd.select_dtypes(include="number")
categorical_data_pd = df_pd.select_dtypes(exclude="number")
Numerical and Categorical Features. Screenshot by author.
Numerical and Categorical Features. Screenshot by author.

By Column Name

If you want to select a column by name, you can use the same .select() method but now you’ll need to provide a column’s name wrapped into pl.col() . To select multiple columns, just provide the names in a list, very similar to Pandas

subset_pl = df_pl.select(
    pl.col("likes"), 
    pl.col("views"),
    pl.col("comment_count")
)
susbet_pd = df_pd[["likes", "views", "comment_count"]]

# This will also work
susbet_pl_2 = df_pl[["likes", "views", "comment_count"]]

It should be noted that df_pl[["likes", "views", "comment_count"]] will also work due to syntactic sugar implemented by Polars. Still, it’s a good practice to write the full statements, so I recommend you get some practice writing it both ways.

Selecting Rows

To select specific rows you’ll need to use .filter() method. Notice that Polars doesn’t have index which means that commands like .iloc are not available. Let’s find out how many rows in the dataset have less than 1000 views. This should be very small because it’s highly unlikely that an unpopular video gets into Trending tab.

filtered_pl = df_pl.filter(pl.col("views") < 1000)
filtered_pl.shape
>>> (6, 16)

filtered_pd = df_pd[df_pd['views'] < 1000]
filtered_pd.shape
>>> (6, 16)

Data Quality Checks

For a more complicated use case, let’s perform basic data quality checks. When doing a data quality check, it’s always a good idea to check for the number of missing rows per column and the number of static columns. In Pandas, this can be done very simply using in-built checks and aggregations.

missing = df_pd.isna().sum()
missing = missing[missing > 0]
static = df_pd.nunique() == 1
static = static[static]

print("Missing rows:")
print(missing)
print("nStatic Columns:")
print(static)

>>> Missing rows:
>>> description    612
>>> dtype: int64
>>> Static Columns:
>>> Series([], dtype: bool)

With Polars this part is a bit more involved and requires chaining a few methods.

missing = (
    df_pl.select(pl.all().is_null().sum())
    .melt(value_name="missing")
    .filter(pl.col("missing") > 0)
)
static = (
    df_pl.select(pl.all().n_unique())
    .melt(value_name="unique")
    .filter(pl.col("unique") == 1)
)
print("Missing columns:")
print(missing)

print("nStatic columns:")
print(static)
Missing columns:
shape: (0, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ ---      ┆ ---     │
│ str      ┆ u32     │
╞══════════╪═════════╡
└──────────┴─────────┘

Static columns:
shape: (0, 2)
┌──────────┬────────┐
│ variable ┆ unique │
│ ---      ┆ ---    │
│ str      ┆ u32    │
╞══════════╪════════╡
└──────────┴────────┘

Let’s break down the code to calculate number of missing rows.

  • df_pl.select(pl.all()) repeats the specified operation for all columns
  • .is_null().sum() sums up the number of Null values (Polars representation of NA)
  • .melt() transforms wide DataFrame into long format
  • .filter(pl.col("missing") > 0) filters out the columns where no rows are missing

You can see that chaining these operations is quite easy and the flow is quite similar to PySpark. Even though the code is a bit more involved, the execution speed is ~4x faster in Polars.

What is surprising is that the results of data quality checks don’t match up. With Pandas, there are 612 missing rows in description column, whereas with Polars we don’t see this. This is because Polars treats missing strings as empty strings "" , so they don’t appear in the null counts. You can easily replace these strings with Null values if you wish using .replace() method.

Data Pre-Processing

There are two steps that need to be done to prepare the data:

  • Convert date columns into datetime format
  • Replace category IDs with the actual category names

To do this, we’ll need to use .with_columns() method because it returns the entire DataFrame with altered columns. .select() method in this case wouldn’t work, because it would only return the processed columns. In addition, we’ll need to use the .str namespace which is very similar to Pandas. This namespace has all the operations available for strings, e.g. .str.contains() or .str.to_lowercase() (see all here). If you’re interest in working with strings in Polars, checkout this post after finishing with this one.

# Pandas datetime conversion
df_pd['publish_time'] = pd.to_datetime(df_pd['publish_time'])
df_pd['trending_date'] = pd.to_datetime(
    df_pd['trending_date'], format='%y.%d.%m'
)

# Polars datetime conversion
df_pl = df_pl.with_columns(
    pl.col('trending_date').str.to_date(format='%y.%d.%m'),
    pl.col('publish_time').str.to_datetime()
)

To replace a category, we’ll simply apply the .map_dict() method which is similar to .map() in Pandas. In Polars, .map() only works with functions, so keep this in mind.

import json

# Load ID to Category mapping
with open('./youtube/US_category_id.json', 'r') as f:
    categories = json.load(f)

id_to_category = {}
for c in categories['items']:
    id_to_category[int(c['id'])] = c['snippet']['title']

# Pandas mapping
df_pd['category_id'] = df_pd['category_id'].map(id_to_category)

# Polars mapping
df_pl = df_pl.with_columns(pl.col("category_id").map_dict(id_to_category))

Now that the data is ready, let’s finally do some analysis!

Basic Exploratory Data Analysis

This section will cover the some of the most important techniques when doing EDA, namely univariate data analysis, aggregates, and visualisations.

Univariate Data Analysis

Univariate data analysis is the simplest analysis you can do yet it’s crucial. Looking at one variable at a time, it can give you a better sense of data and can guide your further explorations.

Categorical Columns

Since we’ve mapped the category IDs to their actual names in data pre-processing section, let’s see their distribution using .value_counts() .

# Polars value counts
category_counts = df_pl['category_id'].value_counts(sort=True).head()
print(category_counts)
┌──────────────────┬────────┐
│ category_id      ┆ counts │
│ ---              ┆ ---    │
│ str              ┆ u32    │
╞══════════════════╪════════╡
│ Music            ┆ 13754  │
│ Entertainment    ┆ 9124   │
│ People &amp; Blogs   ┆ 2926   │
│ Film &amp; Animation ┆ 2577   │
│ Howto &amp; Style    ┆ 1928   │
└──────────────────┴────────┘

The value_counts() operation is also present in Polars, just don’t forget to set sort=True if you want the same behaviour as in Pandas. Next, we can use this information to create a basic bar plot. Plotting with Polars is relatively simple, although there’s no in-built plotting methods so far like with Pandas. Some plotting libraries don’t take Polars Series as inputs, but lucky for us it’s quite easy to convert the Series to a commonly acceptable formats – Python lists, NumPy arrays and Pandas Series.

# Convert to Python list
list_data = df_pl"category_id"].to_list()
# Convert to NumPy
numpy_data = df_pl["category_id"].to_numpy()
# Convert to Pandas
pandas_data = df_pl["category_id"].to_pandas()
# Barplot
sns.barplot(
    y=category_counts["category_id"].to_numpy(),
    x=category_counts["counts"].to_numpy(),
    color="#306e81",
)
plt.title("Category Counts in Trending Data")
plt.xlabel("Counts")
plt.show()
Image by author.
Image by author.

Looks like Music is by far the most frequent category in YouTube trends followed by Entertainment which is not surprising. On the other hand, people doing content for Shows, Non-Profit, Travel, and Autos & Vehicles will have a much harder time getting into Trending.

Numerical Columns

Numerical features univariate analysis can be performed using .describe() method which behaves very similar to Pandas. Also, we can plot a histogram of log views. Log transformation is done to handle heavy outliers like the video with a 424 million views. Interestingly, the minimum number of views a video in trending has is just 851.

views_stats = df_pl.select(pl.col("views")).describe()
print(views_stats)

sns.histplot(df_pl['views'].log())
plt.title("Log Views Distribution in Trending")
plt.show()
Screenshot and image by author.
Screenshot and image by author.

There are many more columns to explore, so I’d recommend you go explore them on your own since you have the tools now. After you’re done, let’s move to more complicated forms of analysis

Multivariate Data Analysis

First things first, which channels appear most frequently in the trending page? We can again use .value_counts() but let’s use the .groupby().agg() method instead since it’s much more flexible and will be useful going forward. I’m going to group by the channel title and count the number of rows using .count() method.

channel_popularity = (
    df_pl.groupby(pl.col("channel_title"))
    .agg(pl.count().alias("trending_count"))
    .sort(pl.col("trending_count"), descending=True)
)

print(channel_popularity.head())
shape: (5, 2)
┌───────────────────────────────────┬────────────────┐
│ channel_title                     ┆ trending_count │
│ ---                               ┆ ---            │
│ str                               ┆ u32            │
╞═══════════════════════════════════╪════════════════╡
│ The Tonight Show Starring Jimmy ... ┆ 208            │
│ Jimmy Kimmel Live                 ┆ 207            │
│ TheEllenShow                      ┆ 207            │
│ Saturday Night Live               ┆ 206            │
│ WWE                               ┆ 205            │
└───────────────────────────────────┴────────────────┘

Similar to Pandas, inside of .groupby() you need to specify the name of the column that you want to create aggregates for. Notice that for Polars, you need to wrap the column name with pl.col() however it will also work without it due to implemented syntactic sugar. Inside of the .agg() you usually need to provide the column name that you want to aggregate but in this case I’m using a pl.count() method since I want to count rows. Notice that you can re-name any aggregate/column you create using .alias() method.

Let’s create a few other statistics namely:

  • Number of unique trendy videos
  • Total number of views, likes and comments
  • Average number of views likes and comments
channel_stats_pl = df_pl.groupby("channel_title").agg(
    pl.count().alias("trending_count"), # number of occurences in the dataset
    pl.col("title").n_unique().alias("number_of_trending_videos"), # number of unique trending videos
    pl.col("views").sum().alias("total_views"), # total number of views
    pl.col("likes").sum().alias("total_likes"), # total number of likes
    pl.col("comment_count").sum().alias("total_comments"), # total number of comments
    pl.col("views").mean().alias("average_views"), # average number of views
    pl.col("likes").mean().alias("average_likes"), # average number of likes
    pl.col("comment_count").mean().alias("average_comments"), # average number of comments
)
print(channel_stats_pl.sample(5))
Image by author.
Image by author.

Looks like everything is working as expected. The same aggregates could be implemented in Pandas using their .agg() method.

channel_stats_pd = df_pd.groupby("channel_title").agg(
    trending_count=pd.NamedAgg(column="title", aggfunc="count"),
    number_of_trending_videos=pd.NamedAgg(column="title", aggfunc="nunique"),
    total_views=pd.NamedAgg(column="views", aggfunc="sum"),
    average_views=pd.NamedAgg(column="views", aggfunc="mean"),
    total_likes=pd.NamedAgg(column="likes", aggfunc="sum"),
    average_likes=pd.NamedAgg(column="likes", aggfunc="mean"),
    total_comments=pd.NamedAgg(column="comment_count", aggfunc="sum"),
    average_comments=pd.NamedAgg(column="comment_count", aggfunc="mean"),
)

The results should be the same, but the execution time is ~10x faster in Polars.

These aggregates are usually a good first step in the analysis, and they might be useful down the line (e.g. in the dashboard), so it would be nice to refactor the aggregation code into a function.

def make_aggregates(df: pl.DataFrame, groupby: str, agg_features: list[str]) -> pl.DataFrame:
    # Aggregates that measure popularity using video counts
    popularity_aggs = [
        pl.count().alias("trending_count"),
        pl.col("title").n_unique().alias("number_of_trending_videos"),
    ]
    # Aggregates that measure popularity using metrics of the videos
    metrics_agg = []
    for agg in agg_features:
        if agg not in df.columns:
            print(f"{agg} not in the dataframe. Skipping...")
        else:
            metrics_agg.append(pl.col(agg).sum().alias(f"total_{agg}"))
            metrics_agg.append(pl.col(agg).mean().alias(f"average_{agg}"))

    stats = df.groupby(groupby).agg(popularity_aggs + metrics_agg)
    stats = stats.sort("trending_count", descending=True)
    return stats

Above you can see how this aggregation function could look like. Notice that we can store the required aggregate operations in a list before passing them into a Polars .agg() method which is quite powerful. Now, this function can be applied not only for the channel_title columns, but also e.g. for the category_id .

channel_aggs = make_aggregates(
    df = df_pl, 
    groupby = "channel_title", 
    agg_features = ["views", "likes", "comment_count"]
)
category_aggs = make_aggregates(
    df = df_pl, 
    groupby = "category_id", 
    agg_features = ["views", "likes", "comment_count"]
)

print("Top Channels")
print(channel_aggs.head())

print("nTop Categories")
print(category_aggs.head())
Image by author.
Image by author.

From the screenshot above you can see that the top trending channels and the top trending categories. These aggregates could be further put into a dashboard or used for further analysis.

Writing Data

Saving the DataFrames to disk is actually quite easy. All you need to do is to make sure that the folder you’re writing to exists. Otherwise, the process is very similar to Pandas, except you use .write_parquet() method and not .to_parquet() (I’m making this mistake at least once a day).

channel_aggs.write_parquet("./data/channel_aggs.parquet")
category_aggs.write_parquet("./data/category_aggs.parquet")

Conclusion

Good job for making it this far! Overall, you’ve seen how to do the following in Polars:

  • Read in data
  • Investigate the DataFrames
  • Perform basic data quality checks
  • Select required columns/rows
  • Perform basic cleaning
  • Perform basic univariate analysis
  • Perform basic multivariate analysis
  • Write out the DataFrames to Parquet file

This is a great start, so let’s wrap up the first part of this series. Make sure to practice it on your own dataset because I firmly believe that the best way to learn is by practicing, practicing, and practicing. Thank you for reading and see you in the next part!

What Next?

If you’ve got the basics of Polars, it’s time to move on to something more advanced. Part 2 of this series covers more complicated aggregates and analytical functions which are essential for any data professional.

EDA with Polars: Step-by-Step Guide to Aggregate and Analytic Functions (Part 2)

Not a Medium Member yet?

Join Medium with my referral link – Antons Tocilins-Ruberts


Related Articles