In the first part of this series, we did data cleaning and manipulation on a dataset that contains trending video statistics in the US. In this article, we will analyze and visualize the data to infer valuable insights. We will be using Pandas and Seaborn libraries for data analysis and visualizations.
After cleaning and some preprocessing, the dataset contains 13 columns as below:
us.columns
Index(['trending_date', 'title', 'channel_title', 'category_id',
'publish_time', 'tags', 'views', 'likes','dislikes','comment_count',
'comments_disabled', 'ratings_disabled', 'video_error_or_removed',
'time_diff'],
dtype='object')
The time_diff column indicates the difference between a video is published and it becomes trending. Let’s first calculate the average value of this column.
us.time_diff.mean()
Timedelta('16 days 05:21:53.236220664')
The average time is 16 days and 5 hours. This value does not tell us much about the time difference. For instance, the time diff column might contain mostly low values and some very high values that bring the average up.
In order to have obtain a thorough overview about the distribution of time difference, we need to also calculate other descriptive statistics such as median and mode. Another solution is to visualize the distribution of this column which makes it easier to understand.
The data type of the time_diff column is timedelta. We need to convert it to a numerical variable to plot its distribution. One way is to divide the time_diff column by another timedelta interval. For instance, we can convert it to hours as below.
us['time_diff_hour'] = us['time_diff'] / pd.Timedelta('1 hour')
us['time_diff_hour'].mean()
389.36
us['time_diff_hour'].mean() / 24
16.22
When we take the mean and divide by 24, we get the same value as the mean of the time_diff column.
One type of visualization that gives us an overview of the distribution is the box plot.
sns.boxplot(data=us, y='time_diff_hour')

There are some extreme values that distort the distribution plots. The average value of this column is 389 but we observe outliers as high as 100000. Let’s see if we can afford to eliminate the outliers.
len(us)
40949
len(us[us.time_diff_hour > 600])
588
The number of rows in which the time difference is more than 600 hours is 588 which is very small compared to the total number rows in the dataset. Thus, we can drop these outliers.
us = us[us.time_diff_hour <= 600]
We have eliminated a substantial amount of the outliers. Another type of visualization to check the distribution is histogram which divides the value range of a continuous variable into discrete bins and counts the number of observations in each bin.
The displot function of Seaborn can be used to create a histogram as below.
sns.displot(data=us, x='time_diff_hour', kind='hist',
aspect=1.5, bins=20)

Most of the values are around 100 so it is like to become trending in approximately 4 days.
I wonder which channels have the most trending videos. We can easily see the top 10 using the value_counts function of pandas.
us.channel_title.value_counts()[:10]
ESPN 203
The Tonight Show Starring Jimmy Fallon 197
TheEllenShow 193
Vox 193
Netflix 193
The Late Show with Stephen Colbert 187
Jimmy Kimmel Live 186
Late Night with Seth Meyers 183
Screen Junkies 182
NBA 181
We can also compare the average views of trending videos published by these channels. It would be interesting to check if the order of average views is the same as the order of trending video count.
The group by function of pandas with multiple aggregate functions will give us what we need.
us['views_mil'] = us['views'] / 1000000
us[['channel_title','views_mil']].groupby('channel_title')
.agg(['mean','count'])
.sort_values(by=('views_mil','count'), ascending=False)[:10]

The Screen Junkies channel has the highest average which is about 1.75 million per video. ESPN has the second lowest average although it has the highest number of trending videos.
We can also find out how the number of trending videos changes over time. The first step is to group the observations (i.e. rows) by date. Then we will sort them by date to have a proper time series.
daily = us[['trending_date']].value_counts().reset_index()
.sort_values(by='trending_date').reset_index(drop=True)

The daily dataframe contains the date and the number of videos that become trending in each date. We can now generate a line plot based on the daily dataframe.
sns.relplot(data=daily, x='trending_date', y=0,
kind='line', aspect=2.5)
We use the relplot function of Seaborn and choose the line plot by using the kind parameter. The aspect parameter adjusts the ratio of the width and height of the visualization.

We observe an interesting trend. Most of the values are between 190 and 200 with a few exceptional days.
Let’s also find the trending video with the highest number of views. There are multiple ways to accomplish this task. What we will do is to sort the dataframe according to the view in descending order and display the title and views of the first row.
us.sort_values(by='views', ascending=False)[['title','views']].iloc[0,:]
title childish gambino this is america official video
views 217750076
The most trending video in terms of the number of views have been viewed over 200 million times.
Conclusion
We have unveiled some findings about the statistics of trending videos published on Youtube in 2017 and 2018. Youtube has further increased its popularity since 2018 so the statistics might be very different now.
However, the main focus on this couple of articles is to practice data analysis and visualization with Pandas and Seaborn. There is, of course, much more we can do on this dataset. Feel free to explore on your own using Pandas and Seaborn or any other library of your taste.
Thank you for reading. Please let me know if you have any feedback.