Sql is a language used for managing data in relational databases that store data in tabular form with labelled rows and columns. We query data from a relational database with the select statement of SQL. The select statement is highly versatile and flexible in terms of data transformation and filtering operations.
In that sense, SQL can be considered as a data analysis tool. The advantage of using SQL for data transformation and filtering is that we only retrieve the data we need. It is more practical and efficient than retrieving all the data and then applying these operations.
In this article, we will use SQL statements and functions to analyze YouTube trending video statistics. The dataset is available on Kaggle. I created an SQL table that contains a small part of this dataset.
Note: I’m using MySQL as the database management system. Although SQL syntax is mostly the same for all database management systems, there might be small differences.
The table is called "trending" and it has the following structure.

We have the dates a video is published and becomes trending. We also have the title and channel of the video. The views and likes are the other two features the dataset contains.
Regarding all these features (i.e. columns) we can do a bunch of different operations. For instance, a simple one can be finding the top 5 channels in terms of the number of trending videos.
mysql> select channel_title, count(*) as number_of_videos
-> from trending
-> group by channel_title
-> order by number_of_videos desc
-> limit 5;
+-----------------+------------------+
| channel_title | number_of_videos |
+-----------------+------------------+
| Washington Post | 28 |
| Netflix | 28 |
| ESPN | 27 |
| TED-Ed | 27 |
| CNN | 27 |
+-----------------+------------------+
We select the channel title column and count the number of rows. The "as" keyword is used to assign a new name to the aggregated columns. The group by clause is used to group the videos (i.e. rows) based on channels. Finally, we sort the results in descending order using the order by clause and display the first 5.
The number of videos seems to be too low because I only included the ones published in January, 2018.
We may want to see the title of the most-viewed video.
mysql> select title, views
-> from trending
-> where views = (select max(views) from trending);

The query above contains a nested select statement. It is used with the where clause to find the desired condition which is the maximum values in the views column.
The most-viewed video in this table has been watched almost 60 million times.
SQL provides many different options for filtering the data. In the previous example, we found out that the most-viewed video belongs to Bruno Mars. We can filter the titles to only see the videos belong to Bruno Mars.
mysql> select distinct(title)
-> from trending
-> where title like "%Bruno Mars%";

We do not have to provide the exact value for filtering if we use the like keyword. The "%" represents any character so "%Bruno Mars%" represents any value that contains the "Bruno Mars" phrase. The distinct keyword is used to remove the duplicates.
If we are not sure about characters being lower or uppercase, we can convert all the characters to lower or upper case before filtering.
mysql> select distinct(lower(title))
-> from trending
-> where title like "%bruno mars%";

The dataset contains the published date of videos and when they become trending. We can calculate the average time it takes for a video to become trending.
Before calculating the difference, we need to extract the date part from the publish time column because it contains both the date and time.
mysql> select trending_date, publish_time
-> from trending
-> limit 3;
+---------------+---------------------+
| trending_date | publish_time |
+---------------+---------------------+
| 2018-01-02 | 2018-01-01 15:30:03 |
| 2018-01-02 | 2018-01-01 01:05:59 |
| 2018-01-02 | 2018-01-01 14:21:14 |
+---------------+---------------------+
The date function extracts the date part and the datediff function calculates the difference. Thus, we can calculate the average difference as follows:
mysql> select avg(datediff(trending_date, date(publish_time)))
-> as avg_diff
-> from trending;
+----------+
| avg_diff |
+----------+
| 3.9221 |
+----------+
The datediff functions takes two dates separated by a comma and calculates the difference. It takes 3.92 days on average for a video to become trending.
We can also calculate the average difference for videos that are published in a specific time period. We just need to add a where clause for filtering.
mysql> select avg(datediff(trending_date, date(publish_time))) as avg_diff
-> from trending
-> where hour(publish_time) > 20;
+----------+
| avg_diff |
+----------+
| 4.4825 |
+----------+
We extract the hour value from publish time and use it in the where clause for filtering.
SQL provides functions for data aggregation which can be implemented in the select statement. For instance, we can calculate the average ratio of likes over views of videos published by Netflix.
mysql> select avg(likes / views)
-> from trending
-> where channel_title = "Netflix";
+--------------------+
| avg(likes / views) |
+--------------------+
| 0.01816295 |
+--------------------+
The average value is close to 0.02 so Netflix videos have approximately 2 percent like over view ratio.
Let’s write a slightly more complicated query and calculate the average video views of channels that published more than 25 videos. We will also sort the results in descending order by the averages.
mysql> select channel_title, avg(views) as avg_views,
-> count(title) as number_of_videos
-> from trending
-> group by channel_title
-> having number_of_videos > 25
-> order by avg_views desc;

The retrieved data contains 3 columns. One is the channel title column and the other two are aggregated columns. We filter the channels based on the number of videos.
You may have noticed that we used the "having" clause instead of the "where" clause for filtering. The "having" clause is used for filtering based on aggregated columns.
Conclusion
We have done some examples to analyze the YouTube trending video statistics. The examples clearly demonstrate that SQL can also be used as a data analysis tool.
I think SQL is a required skill for data scientists or analysts. We should at least know how to query a relational database. Complex queries that perform data transformation and manipulation operations while retrieving the data have the potential to save memory. They also ease the tasks that need to be done afterwards.
Thank you for reading. Please let me know if you have any feedback.