
Introduction
Every once in a while, along comes a tool that significantly alters the way that Data Analysis is done. I believe that Polars is one of those tools, so in this series of posts, I’ll deep dive into this library, compare it to a more well known and established library – Pandas, and will showcase the analysis workflow using an example dataset.
What is Polars?
Polars is a blazingly fast DataFrame library written in Rust. Lucky for us (data scientists/analysts) it has a very well documented Python wrapper that exposes a complete set of features to wrangle data and build data pipelines. Here are the main advantages I’ve seen after switching to Polars:
- Much faster pre-processing operations
- Ability to handle larger than RAM datasets
- Better quality of code due to the need to properly structure data pipelines
You can see the full set of benefits in this user guide and the speed comparisons in this H20 benchmark.
Switching from Pandas
At the first glance, Pandas and Polars seem to be quite similar e.g. methods like .read_csv()
or .head()
are shared between them so you can perform basic exploratory operations without any changes. But the more you start working the library, the more you’ll notice how different the two libraries are. From syntaxis to the way of thinking, switching to Polars is no easy task. That’s why I hope that these posts will help you get started.
Setup
To follow along the project, make sure to pull this GitHub repo with the latest notebook. Data used in this project can be downloaded from Kaggle (CC0: Public Domain). It’s a dataset about YouTube’s top trending videos and should provide enough complexity for this series of posts. Also, you’ll need Pandas and Polars to be installed which can be done using pip for both packages.
Now that everything is setup, let’s jump into the project! The main goal here is to get you more familiar with Polars, so make sure to follow along or practice the concepts on your own dataset. Once again, in the GitHub repo you can find the notebook with all the code used here.
Data Processing
Reading Data
Reading data is going to be familiar to Pandas users since it uses exactly the same methods. Let’s read in the statistics for United Kingdom videos for the further analysis and print out the DataFrame shapes.
# Define file path
path = './youtube/GBvideos.csv'
# Read in using Polars
df_pl = pl.read_csv(path)
print(df_pl.shape)
>>> (38916, 16)
# Read in using Pandas
df_pd = pd.read_csv(path)
print(df_pd.shape)
>>> (38916, 16)
If you time both of these read in operations, you’ll have your first "wow" moment with Polars. On my laptop, Polars reads in the file in ~110 ms and Pandas reads it in ~ 270 ms. That’s 2.5x speedup, but you’ll frequently see reading/writing operation speed ups much more than this (especially with larger files).
Common Exploratory Methods
What’s the first thing you do when you read in the data? I assume you print out a head (or sample), check the data types, shapes, etc. Polars shares a lot of these high-level methods with Pandas, so you can explore the first rows using .head()
method, and the last rows using .tail()
method.
print(df_pl.head(2))
print(df_pl.tail(2))
print(df_pl.sample(2))

While you’ll be able to see the outputs, it’s formatting is not ideal. You can change how the outputs get displayed using the Config
. For example, to make the printouts wider you can set the maximum number of characters per row to 200 like so pl.Config.set_tbl_width_chars(200)
. Then the output is going to look more pleasant.

There are a lot of other parameters you can adjust (e.g. hide column data types), so make sure to checkout the Config documentation.
Selecting Columns
By Data Types
You might have seen from the printout that Polars has a bit different set of data types for its columns. Numeric columns are usually assigned types of Int32
, Int64
, Float32
, Float64
and categorical columns are usually assigned types of Utf8
. To select columns by data type, the so-called selectors
can be used together with the .select()
method. These selectors
are a relatively new addition to the API and they give us more intuitive way of selecting columns. For example, below you can see the code to select all the numeric and all the categorical columns.
import polars.selectors as cs
# Polars feature selectiom
numeric_data_pl = df_pl.select(cs.numeric())
categorical_data_pl = df_pl.select(cs.string())
# Pandas feature selection
numeric_data_pd = df_pd.select_dtypes(include="number")
categorical_data_pd = df_pd.select_dtypes(exclude="number")


By Column Name
If you want to select a column by name, you can use the same .select()
method but now you’ll need to provide a column’s name wrapped into pl.col()
. To select multiple columns, just provide the names in a list, very similar to Pandas
subset_pl = df_pl.select(
pl.col("likes"),
pl.col("views"),
pl.col("comment_count")
)
susbet_pd = df_pd[["likes", "views", "comment_count"]]
# This will also work
susbet_pl_2 = df_pl[["likes", "views", "comment_count"]]
It should be noted that df_pl[["likes", "views", "comment_count"]]
will also work due to syntactic sugar implemented by Polars. Still, it’s a good practice to write the full statements, so I recommend you get some practice writing it both ways.
Selecting Rows
To select specific rows you’ll need to use .filter()
method. Notice that Polars doesn’t have index which means that commands like .iloc
are not available. Let’s find out how many rows in the dataset have less than 1000 views. This should be very small because it’s highly unlikely that an unpopular video gets into Trending tab.
filtered_pl = df_pl.filter(pl.col("views") < 1000)
filtered_pl.shape
>>> (6, 16)
filtered_pd = df_pd[df_pd['views'] < 1000]
filtered_pd.shape
>>> (6, 16)
Data Quality Checks
For a more complicated use case, let’s perform basic data quality checks. When doing a data quality check, it’s always a good idea to check for the number of missing rows per column and the number of static columns. In Pandas, this can be done very simply using in-built checks and aggregations.
missing = df_pd.isna().sum()
missing = missing[missing > 0]
static = df_pd.nunique() == 1
static = static[static]
print("Missing rows:")
print(missing)
print("nStatic Columns:")
print(static)
>>> Missing rows:
>>> description 612
>>> dtype: int64
>>> Static Columns:
>>> Series([], dtype: bool)
With Polars this part is a bit more involved and requires chaining a few methods.
missing = (
df_pl.select(pl.all().is_null().sum())
.melt(value_name="missing")
.filter(pl.col("missing") > 0)
)
static = (
df_pl.select(pl.all().n_unique())
.melt(value_name="unique")
.filter(pl.col("unique") == 1)
)
print("Missing columns:")
print(missing)
print("nStatic columns:")
print(static)
Missing columns:
shape: (0, 2)
┌──────────┬─────────┐
│ variable ┆ missing │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════╪═════════╡
└──────────┴─────────┘
Static columns:
shape: (0, 2)
┌──────────┬────────┐
│ variable ┆ unique │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════╪════════╡
└──────────┴────────┘
Let’s break down the code to calculate number of missing rows.
df_pl.select(pl.all())
repeats the specified operation for all columns.is_null().sum()
sums up the number of Null values (Polars representation of NA).melt()
transforms wide DataFrame into long format.filter(pl.col("missing") > 0)
filters out the columns where no rows are missing
You can see that chaining these operations is quite easy and the flow is quite similar to PySpark. Even though the code is a bit more involved, the execution speed is ~4x faster in Polars.
What is surprising is that the results of data quality checks don’t match up. With Pandas, there are 612 missing rows in description
column, whereas with Polars we don’t see this. This is because Polars treats missing strings as empty strings ""
, so they don’t appear in the null counts. You can easily replace these strings with Null values if you wish using .replace()
method.
Data Pre-Processing
There are two steps that need to be done to prepare the data:
- Convert date columns into
datetime
format - Replace category IDs with the actual category names
To do this, we’ll need to use .with_columns()
method because it returns the entire DataFrame with altered columns. .select()
method in this case wouldn’t work, because it would only return the processed columns. In addition, we’ll need to use the .str
namespace which is very similar to Pandas. This namespace has all the operations available for strings, e.g. .str.contains()
or .str.to_lowercase()
(see all here). If you’re interest in working with strings in Polars, checkout this post after finishing with this one.
# Pandas datetime conversion
df_pd['publish_time'] = pd.to_datetime(df_pd['publish_time'])
df_pd['trending_date'] = pd.to_datetime(
df_pd['trending_date'], format='%y.%d.%m'
)
# Polars datetime conversion
df_pl = df_pl.with_columns(
pl.col('trending_date').str.to_date(format='%y.%d.%m'),
pl.col('publish_time').str.to_datetime()
)
To replace a category, we’ll simply apply the .map_dict()
method which is similar to .map()
in Pandas. In Polars, .map()
only works with functions, so keep this in mind.
import json
# Load ID to Category mapping
with open('./youtube/US_category_id.json', 'r') as f:
categories = json.load(f)
id_to_category = {}
for c in categories['items']:
id_to_category[int(c['id'])] = c['snippet']['title']
# Pandas mapping
df_pd['category_id'] = df_pd['category_id'].map(id_to_category)
# Polars mapping
df_pl = df_pl.with_columns(pl.col("category_id").map_dict(id_to_category))
Now that the data is ready, let’s finally do some analysis!
Basic Exploratory Data Analysis
This section will cover the some of the most important techniques when doing EDA, namely univariate data analysis, aggregates, and visualisations.
Univariate Data Analysis
Univariate data analysis is the simplest analysis you can do yet it’s crucial. Looking at one variable at a time, it can give you a better sense of data and can guide your further explorations.
Categorical Columns
Since we’ve mapped the category IDs to their actual names in data pre-processing section, let’s see their distribution using .value_counts()
.
# Polars value counts
category_counts = df_pl['category_id'].value_counts(sort=True).head()
print(category_counts)
┌──────────────────┬────────┐
│ category_id ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════════╪════════╡
│ Music ┆ 13754 │
│ Entertainment ┆ 9124 │
│ People & Blogs ┆ 2926 │
│ Film & Animation ┆ 2577 │
│ Howto & Style ┆ 1928 │
└──────────────────┴────────┘
The value_counts()
operation is also present in Polars, just don’t forget to set sort=True
if you want the same behaviour as in Pandas. Next, we can use this information to create a basic bar plot. Plotting with Polars is relatively simple, although there’s no in-built plotting methods so far like with Pandas. Some plotting libraries don’t take Polars Series as inputs, but lucky for us it’s quite easy to convert the Series to a commonly acceptable formats – Python lists, NumPy arrays and Pandas Series.
# Convert to Python list
list_data = df_pl"category_id"].to_list()
# Convert to NumPy
numpy_data = df_pl["category_id"].to_numpy()
# Convert to Pandas
pandas_data = df_pl["category_id"].to_pandas()
# Barplot
sns.barplot(
y=category_counts["category_id"].to_numpy(),
x=category_counts["counts"].to_numpy(),
color="#306e81",
)
plt.title("Category Counts in Trending Data")
plt.xlabel("Counts")
plt.show()

Looks like Music is by far the most frequent category in YouTube trends followed by Entertainment which is not surprising. On the other hand, people doing content for Shows, Non-Profit, Travel, and Autos & Vehicles will have a much harder time getting into Trending.
Numerical Columns
Numerical features univariate analysis can be performed using .describe()
method which behaves very similar to Pandas. Also, we can plot a histogram of log views. Log transformation is done to handle heavy outliers like the video with a 424 million views. Interestingly, the minimum number of views a video in trending has is just 851.
views_stats = df_pl.select(pl.col("views")).describe()
print(views_stats)
sns.histplot(df_pl['views'].log())
plt.title("Log Views Distribution in Trending")
plt.show()


There are many more columns to explore, so I’d recommend you go explore them on your own since you have the tools now. After you’re done, let’s move to more complicated forms of analysis
Multivariate Data Analysis
First things first, which channels appear most frequently in the trending page? We can again use .value_counts()
but let’s use the .groupby().agg()
method instead since it’s much more flexible and will be useful going forward. I’m going to group by the channel title and count the number of rows using .count()
method.
channel_popularity = (
df_pl.groupby(pl.col("channel_title"))
.agg(pl.count().alias("trending_count"))
.sort(pl.col("trending_count"), descending=True)
)
print(channel_popularity.head())
shape: (5, 2)
┌───────────────────────────────────┬────────────────┐
│ channel_title ┆ trending_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════════╪════════════════╡
│ The Tonight Show Starring Jimmy ... ┆ 208 │
│ Jimmy Kimmel Live ┆ 207 │
│ TheEllenShow ┆ 207 │
│ Saturday Night Live ┆ 206 │
│ WWE ┆ 205 │
└───────────────────────────────────┴────────────────┘
Similar to Pandas, inside of .groupby()
you need to specify the name of the column that you want to create aggregates for. Notice that for Polars, you need to wrap the column name with pl.col()
however it will also work without it due to implemented syntactic sugar. Inside of the .agg()
you usually need to provide the column name that you want to aggregate but in this case I’m using a pl.count()
method since I want to count rows. Notice that you can re-name any aggregate/column you create using .alias()
method.
Let’s create a few other statistics namely:
- Number of unique trendy videos
- Total number of views, likes and comments
- Average number of views likes and comments
channel_stats_pl = df_pl.groupby("channel_title").agg(
pl.count().alias("trending_count"), # number of occurences in the dataset
pl.col("title").n_unique().alias("number_of_trending_videos"), # number of unique trending videos
pl.col("views").sum().alias("total_views"), # total number of views
pl.col("likes").sum().alias("total_likes"), # total number of likes
pl.col("comment_count").sum().alias("total_comments"), # total number of comments
pl.col("views").mean().alias("average_views"), # average number of views
pl.col("likes").mean().alias("average_likes"), # average number of likes
pl.col("comment_count").mean().alias("average_comments"), # average number of comments
)
print(channel_stats_pl.sample(5))

Looks like everything is working as expected. The same aggregates could be implemented in Pandas using their .agg()
method.
channel_stats_pd = df_pd.groupby("channel_title").agg(
trending_count=pd.NamedAgg(column="title", aggfunc="count"),
number_of_trending_videos=pd.NamedAgg(column="title", aggfunc="nunique"),
total_views=pd.NamedAgg(column="views", aggfunc="sum"),
average_views=pd.NamedAgg(column="views", aggfunc="mean"),
total_likes=pd.NamedAgg(column="likes", aggfunc="sum"),
average_likes=pd.NamedAgg(column="likes", aggfunc="mean"),
total_comments=pd.NamedAgg(column="comment_count", aggfunc="sum"),
average_comments=pd.NamedAgg(column="comment_count", aggfunc="mean"),
)
The results should be the same, but the execution time is ~10x faster in Polars.
These aggregates are usually a good first step in the analysis, and they might be useful down the line (e.g. in the dashboard), so it would be nice to refactor the aggregation code into a function.
def make_aggregates(df: pl.DataFrame, groupby: str, agg_features: list[str]) -> pl.DataFrame:
# Aggregates that measure popularity using video counts
popularity_aggs = [
pl.count().alias("trending_count"),
pl.col("title").n_unique().alias("number_of_trending_videos"),
]
# Aggregates that measure popularity using metrics of the videos
metrics_agg = []
for agg in agg_features:
if agg not in df.columns:
print(f"{agg} not in the dataframe. Skipping...")
else:
metrics_agg.append(pl.col(agg).sum().alias(f"total_{agg}"))
metrics_agg.append(pl.col(agg).mean().alias(f"average_{agg}"))
stats = df.groupby(groupby).agg(popularity_aggs + metrics_agg)
stats = stats.sort("trending_count", descending=True)
return stats
Above you can see how this aggregation function could look like. Notice that we can store the required aggregate operations in a list before passing them into a Polars .agg()
method which is quite powerful. Now, this function can be applied not only for the channel_title
columns, but also e.g. for the category_id
.
channel_aggs = make_aggregates(
df = df_pl,
groupby = "channel_title",
agg_features = ["views", "likes", "comment_count"]
)
category_aggs = make_aggregates(
df = df_pl,
groupby = "category_id",
agg_features = ["views", "likes", "comment_count"]
)
print("Top Channels")
print(channel_aggs.head())
print("nTop Categories")
print(category_aggs.head())

From the screenshot above you can see that the top trending channels and the top trending categories. These aggregates could be further put into a dashboard or used for further analysis.
Writing Data
Saving the DataFrames to disk is actually quite easy. All you need to do is to make sure that the folder you’re writing to exists. Otherwise, the process is very similar to Pandas, except you use .write_parquet()
method and not .to_parquet()
(I’m making this mistake at least once a day).
channel_aggs.write_parquet("./data/channel_aggs.parquet")
category_aggs.write_parquet("./data/category_aggs.parquet")
Conclusion
Good job for making it this far! Overall, you’ve seen how to do the following in Polars:
- Read in data
- Investigate the DataFrames
- Perform basic data quality checks
- Select required columns/rows
- Perform basic cleaning
- Perform basic univariate analysis
- Perform basic multivariate analysis
- Write out the DataFrames to Parquet file
This is a great start, so let’s wrap up the first part of this series. Make sure to practice it on your own dataset because I firmly believe that the best way to learn is by practicing, practicing, and practicing. Thank you for reading and see you in the next part!
What Next?
If you’ve got the basics of Polars, it’s time to move on to something more advanced. Part 2 of this series covers more complicated aggregates and analytical functions which are essential for any data professional.
EDA with Polars: Step-by-Step Guide to Aggregate and Analytic Functions (Part 2)