Applying NLP Techniques to Understand the Latest Blackpink Comeback

This is the first of a series of articles that will cover textual Data Collection, data preprocessing, and sentiment analysis. In this article specifically, I will talk about why I wanted to collect comments from Blackpink’s latest music video, How You Like That, and then walk you through how you can build your own dataset of YouTube comments from any video you want.
If you would like to cut to the chase and start collecting comments immediately, you can follow the script at my repo:
Otherwise, let’s get started!
For this project, I was interested in analyzing YouTube comments on Blackpink’s latest music video, "How You Like That", released on June 26th, 2020.
By breaking records such as the most-viewed YouTube clip in 24 hours, most-viewed K-Pop act in 24 hours, and the fastest video to reach 200 million views, Blackpink has more than shown the world that it is a group to reckon with. However, prior to "How You Like That", the group’s last album Kill this Love was released more than a year ago on April 4th, 2019. Between then and June 26th 2020, only one more song, a collaboration with Lady Gaga called "Sour Candy", was released. Now that Blackpink has graced its fans with more content featuring just themselves, I was interested in how BLINKs (the official fandom name for Blackpink) were responding to it.
What are fans saying about the four members and their agency, YG? What are their sentiments toward Blackpink’s latest song and the individual members? Do these sentiments differ across languages?
These guiding questions motivated me to apply sentiment analysis on comments from their music video. I chose YouTube as a data source not only because it is a popular social media platform, but because it is also the second largest search engine, with 3 billion searches per month as of 2017. This makes it a valuable resource for entertainment companies to promote their artists’ new singles and albums. In addition, since each video is accompanied by a comment section, these promotional videos also become a forum for fans to directly engage with artists and other fans.
Beyond those points, I also wanted to challenge myself with building and cleaning a Dataset from scratch. I encourage you to do the same for the following reasons:
- Exposure to web scraping and using APIs: knowing how to gather data will be incredibly useful for augmenting an existing dataset or creating a new one to address questions and hypotheses you may have about a topic.
- Greater customization over data: you have greater control over which features you’ll include in your custom dataset, and you can change them up to fit your needs as you analyze your data.
- Practice data cleaning techniques: often times, publicly available datasets have been cleaned and pruned to some extent. Meanwhile, YouTube comments, and social media texts in general, are difficult to work with due to slang words, abbreviations, misspellings, emojis, and irony and sarcasm. Cleaning these types of texts will force you to consider the effectiveness and consequences of each technique.
For this project, I chose to familiarize myself with APIs by querying comments through YouTube’s Data API. The following sections will walk you through how I collected comments of interest. Some understanding of Python is assumed. I’ve also included a short introduction to APIs and JSON. If you’re already familiar with them, you can skip directly to the Data Collection section.
Quick Primer on API and JSON
What is an API?

API is short for Application Programming Interface. Its role is to send requests from the user to a service provider, and then return the results generated by the service provider back to the user. Geeksforgeeks uses the example of searching for a hotel room on an online website; the API sends the request of the user to the hotel booking’s website, and then returns the most relevant data from the website to the intended user. In that sense, APIs, especially those produced by large companies, offer tools for users to obtain data of interest.
What is JSON?

According to w3schools, JSON, short for JavaScript Object Notation, is a lightweight format for storing and transporting data. JSON has very similar syntax to dictionaries in Python. JSON is denoted by curly braces, and its data is stored in key:value pairs separated by commas.
This data format is important to know since it’s the most common format for responses from APIs. For example, the response provided by the YouTube Data API is a JSON object.
Data Collection
A Quick Tutorial on Setting Up YouTube API Credentials
- Head over to Google Developer’s Console and create a new project.



- Once you’ve set up a new project, select + ENABLE APIS AND SERVICES

- Search for YouTube Data API v3 and click on Enable.

- Then return to Credentials. You can do so by clicking on the hamburger menu, ☰

- Select + CREATE CREDENTIALS, and then API Key.

According to the developer docs, we do not need user authorization to retrieve information about a public YouTube channel, so an API key is all we’ll need to collect comments off a video.
- Finally, install the Google API Client for Python.
pip install --upgrade google-api-python-client
If you’re curious, you can read more about setting up Google APIs with Python here:
Using YouTube Data API v3 to Query YouTube Comments
Once we have our credentials set up, we can now start collecting comments! We’ll first build the service for calling the YouTube API:
Now let’s take a look at the resource of interest. In order to obtain all YouTube comments on a specific video, we’ll need to send a request for CommentThreads. An example request in Python for a commentThread will look like the following:
# you only need to build the service once
service = build_service('path/to/apikey.json')
response = service.commentThreads().list(
part='snippet',
maxResults=100,
textFormat='plainText',
order='time',
videoId='ioNng23DkIM'
).execute()
Of the parameters listed above, there are two parameters that are required, part
, ** and exactly one of allThreadsRelatedToChannelId**
, `channelId, **
id`, and videoId**
. For the `partparameter, we need to pass a comma-separated list consisting of any combination of **
id`, snippet**
, and `replies. The **
snippet**[keyword will return basic details about the comment thread and the thread's top-level comment, while](https://developers.google.com/youtube/v3/docs/commentThreads#replies.comments[])
replies` contains a list of replies to the top level comment.
The second required parameter is a filter, and we can choose between allThreadsRelatedToChannelId
, channelId
, id
, and videoId
. Since I was interested in just the YouTube comments on Blackpink’s How You Like That, I chose to filter by videoId
.
A video’s ID can be obtained from its YouTube link. They will generally look like this:
https://www.youtube.com/watch?v=ioNng23DkIM
The video ID in this case will be ioNng23DkIM. And in general, the video ID follows ‘?v=’.
But sometimes a link may look like the following, such as when you obtain a link through the share option on a video:
https://youtu.be/ioNng23DkIM
In that case, the ID will be directly after ‘youtu.be’.
We can handle both cases with the following function (although this is unnecessary if you’ll be manually sourcing YouTube video links. If that’s the case, you can just copy the ID part of the link.)
Deciding on the Items of Interest
For this project, I was only interested in top level comments, the number of replies and likes, and whether the commenter also rated (liked) the video, so I passed just the string ‘snippet’ to parameter part.
After running the code above, you’ll get a JSON response that looks like this:
{
"kind": "youtube#commentThreadListResponse",
"etag": etag,
"nextPageToken": string,
"pageInfo": {
"totalResults": integer,
"resultsPerPage": integer
},
"items": [
commentThread Resource
]
}
The items of interest are nextPageToken
and items
. Let’s talk about items
first. The key items
contains a list of commentThreads
, and each commentThread
consists of the following:
{
"kind": "youtube#commentThread",
"etag": etag,
"id": string,
"snippet": {
"channelId": string,
"videoId": string,
"topLevelComment": comments Resource,
"canReply": boolean,
"totalReplyCount": unsigned integer,
"isPublic": boolean
},
"replies": {
"comments": [
comments Resource
]
}
}
Since I chose to pass only the string snippet
to the part parameter, I will only get the snippet
portion of the JSON resource above. The snippet
is a dictionary containing keys and corresponding values for channelId
, videoId
, topLevelComment
, canReply
, totalReplyCount
, and isPublic
.
Among these resources, I chose to save the values of topLevelComment
and totalReplyCount
. However, we still have not accessed the actual text content of the topLevelComment
. We can extract the text, the number of likes the top level comment has received, and whether the commenter has also rated the video by indexing into the topLevelComment
object. It is a comment resource, which looks like this:
{
"kind": "youtube#comment",
"etag": etag,
"id": string,
"snippet": {
"authorDisplayName": string,
"authorProfileImageUrl": string,
"authorChannelUrl": string,
"authorChannelId": {
"value": string
},
"channelId": string,
"videoId": string,
"textDisplay": string,
"textOriginal": string,
"parentId": string,
"canRate": boolean,
"viewerRating": string,
"likeCount": unsigned integer,
"moderationStatus": string,
"publishedAt": datetime,
"updatedAt": datetime
}
}
We can index into the response as follows:
comment = response['items']['snippet']['topLevelComment']['snippet']['textDisplay']
Putting it all together, we can use the code snippet below to get the data points of interest.
If you’re interested in additional data points, such as the time at which a comment was updated, you can write something like:
published_at = item['snippet']['topLevelComment']['snippet']['updatedAt']
The other value of interest for the commentThreads
resource was the nextPageToken
. Each time we submit a request, we get maxResults
number of comments in the items
list. The maximum number of results we can obtain is limited between 1 and 100. Thus, if a video has more than 100 comments, we’ll need to make an API call several times. The nextPageToken
helps us start directly on the next page of comments instead of starting from the beginning again. We just need to modify our API call a bit:
response = service.commentThreads().list(
part='snippet',
maxResults=100,
textFormat='plainText',
order='time',
videoId='ioNng23DkIM',
pageToken=response['nextPageToken']
).execute()
Note that we don’t need a nextPageToken
for our very first service call. Instead, we use the nextPageToken
obtained from the current JSON response for our next call to the service object.
Putting It All Together
The function below will help us get comments off a YouTube video:
Feel free to change the function as you see fit! After importing the necessary libraries (#1), I changed the parameters of the function to include an extra variable, csv_filename (#2). Lists to hold features of interest, code to index for those data points, and code to save the data points to lists are outlined in #3, #5, and #6. I then saved the desired features of each item in the JSON response line-by-line to the csv file (#7). After we check every item in the JSON response, we check if there’s a nextPageToken (#8). If not, we’ll return our data points of interest in dictionary form (#9).
Next Steps
There is a lot more we can do to make this program more modular. For example, instead of hard-coding lists for each feature (#2, #5), we can write a function that takes in a list of keywords and returns a dictionary containing the relevant information for each given keyword. We can also write a dictionary that maps long, involved indexing such as the one for published_at
to a shorthand. For example:
shorthand = {
'updated_at' : item['snippet']['topLevelComment']['snippet']['updatedAt']
}
This will involve some work the first time around to simplify things down the line. Fortunately, these functions (and more) are already available in the wrapper library youtube-data-api.
However, if you’d like to just collect comments out-of-the-box, my repo contains instructions on how to run the provided script get_comments_of_video_id.py.
Note that Google does impose a daily quota on the number of API calls you can make. This quota is set at around 10 thousand units per day, which becomes more or less 250,000 comments I can collect in one day. To address these limitations, I created two API keys to collect more comments.
Wrapping Up
In this article, we took a look at how to collect YouTube comments from a video of interest using YouTube Data API (v3). In my next article, we’ll follow the classical NLP pipeline to preprocess our data for sentiment analysis.
Thank you for following me on this data science journey!