
I watch videos on Youtube more than I watch TV. I strongly believe that lots of people do the same. Some commentators that I watch on TV are now hosting shows on Youtube.
The first and foremost motivation is, of course, monetary. People can make a substantial amount of money on Youtube. How much you make depends on how much your videos are watched.
In this couple of articles, we will analyze trending video statistics in the US. The first part consists of data cleaning and manipulation with Pandas because the dataset is not in the most appealing format for analysis. In the second part, we will analyze and visualize the data to infer valuable insights. We will be using Seaborn library to create the visualizations.
The first step is to import the libraries and read the dataset into a dataframe.
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
us = pd.read_csv("/home/soner/Downloads/datasets/USvideos.csv")
us.shape
(40949, 16)
The dataset contains 16 columns and approximately 40 thousand rows.
Here is the list of 16 columns.
us.columns
Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id','publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count','thumbnail_link', 'comments_disabled', 'ratings_disabled','video_error_or_removed', 'description'],
dtype='object')
Some of the columns are redundant for our analysis so we will drop them.
us.drop(['video_id','thumbnail_link', 'description'], axis=1, inplace=True)
us.shape
(40949, 16)
Three columns have been dropped. The axis parameter indicates whether to drop columns or row. The inplace parameter is used to save the changes in the dataframe.
We should always check the data types of columns because some functions and methods can only be used with certain data types. For instance, the dates must be stored with datetime data type so that we can use the date time functionalities of pandas.
us.dtypes
trending_date object
title object
channel_title object
category_id int64
publish_time object
tags object
views int64
likes int64
dislikes int64
comment_count int64
comments_disabled bool
ratings_disabled bool
video_error_or_removed bool
dtype: object
Trending date and publish time columns have object data type which needs to changed as datetime. Let’s take a look at these columns first.

The publish time column can easily be converted with the astype function because it is in a proper format.
us['publish_time'] = us['publish_time'].astype('datetime64[ns]')
For the trending date column, we need some reformatting. The first two characters represent the last two digits of year (e.g. 17 is 2017). Since the data belongs to years 2017 and 2018, a simple solution is to concatenate the column with ’20’.
us['trending_date'] = ['20'] + us['trending_date']
The order in trending date column is year, day, and month which is not the standard format. However, we can easily handle it using the format parameter of the to_datetime function.
us['trending_date'] = pd.to_datetime(us['trending_date'], format = "%Y.%d.%m")
Here is how these two columns look now.

Let’s also check the data types.
us[['trending_date','publish_time']].dtypes
trending_date datetime64[ns]
publish_time datetime64[ns]
dtype: object
The difference between the time a video is published and its trending time might be a useful piece of information for the analysis. Since both columns have datetime data type, we can easily calculate the time difference.
us['time_diff'] = us['trending_date'] - us['publish_time']
us['time_diff'][:5]
0 0 days 06:46:59
1 0 days 16:30:00
2 1 days 04:54:36
3 0 days 12:59:56
4 1 days 05:58:19
Name: time_diff, dtype: timedelta64[ns]
We have calculated the difference by using a minus sign and assigned it to a new column called "time_diff". The data type of this column is timedelta which is the standard for time intervals.
The publish time column contains both the date and time. We can easily access each part by using the date and time methods under the dt accessor.
print(us['publish_time'].dt.date[0])
2017-11-13
print(us['publish_time'].dt.time[0])
17:13:01
We should also do some preprocessing on textual data. For instance, we might need to find the most common words in the trending video titles.
us['title'][:5]
0 WE WANT TO TALK ABOUT OUR MARRIAGE
1 The Trump Presidency: Last Week Tonight with J...
2 Racist Superman | Rudy Mancuso, King Bach & Le...
3 Nickelback Lyrics: Real or Fake?
4 I Dare You: GOING BALD!?
Name: title, dtype: object
Some titles contain all capital letters whereas some is mixture of capital and lower case letters. Let’s convert them all to lower case letters.
The lower method of str accessor will do the job for us.
us['title'] = us['title'].str.lower()
us['title'][:5]
0 we want to talk about our marriage
1 the trump presidency: last week tonight with j...
2 racist superman | rudy mancuso, king bach & le...
3 nickelback lyrics: real or fake?
4 i dare you: going bald!?
Name: title, dtype: object
It is better to also remove the punctuation so that we have cleaner textual data. There are many alternatives to remove punctuation. One of the simplest ways to use the replace function of str accessor and perform regex-based replacement.
us['title'] = us['title'].str.replace(r'[^ws]+', '')
us['title'][:5]
0 we want to talk about our marriage
1 the trump presidency last week tonight with jo...
2 racist superman rudy mancuso king bach lele ...
3 nickelback lyrics real or fake
4 i dare you going bald
Name: title, dtype: object
You may get confused because of the dots in the second and third lines. These are not parts of the string in those cells. They are just longer than the number of characters pandas display based on the current display settings.
Here is the title in the second line.
us['title'][1]
'the trump presidency last week tonight with john oliver hbo'
We are done with data cleaning and manipulation. Let’s do a couple of simple calculations. For instance, we may want to know the average time difference between a video is published and it becomes trending.
The data type of the time diff column is timedelta which allows for using aggregate functions so we can apply the mean function.
us.time_diff.mean()
Timedelta('16 days 05:21:53.236220664')
The average time is 16 days and 5 hours which is sorter than I could anticipate.
The dataset contains approximately 40 thousand rows. I wonder if most of the videos uploaded by different channels. One way to give us insight is to check the number of unique channels.
us.channel_title.nunique()
2207
There are 2207 unique channels in the dataset which means some of them have many trending videos.
Conclusion
We have done some data cleaning and manipulation on the Youtube trending video statistics dataset available on Kaggle. In the second article, we will focus on Data Analysis part and try to obtain some insights into what makes a video trending.
I think it would be a highly challenging task to predict if a video will be trending. However, we can explore the similarities between them.
Stay tuned for the next article!
Thank you for reading. Please let me know if you have any feedback.