Predicting the number of likes on Instagram

Corentin Dugué
Towards Data Science
14 min readMay 9, 2017

--

In this article, we will show our approach to predict the number of likes of an Instagram post. We fill first go over how we collected a dataset and analyze the data. Then, we will provide a base model without looking at the image using XGBoost. The next step, will be to use Natural Language Processing (NLP) to extract some features. Finally, a Convolutional Neural Net (CNN) is developed to extract features from the image.

Github link: https://github.com/gvsi/datascience-finalproject

I. Motivation

  • Goal: Predict the number of likes of a given Instagram post.
  • Context: Social media influencers get paid by digital marketers to make the promotion of a product or service.
  • Application: Use the model developed in online marketing to find influencers that yield the most impressions for a given post.

Example of such a post, this influencer is going on a trip, in this picture she placed a product, a deodorant, most likely getting paid by a company to do so.

Post by Léa Camilleri: https://www.instagram.com/p/BRNlFUBAG5i/

2. Data

Building the dataset

One of the challenges associated with the project is aggregating sufficient and relevant data for the use case we are tackling.

The first step in building the dataset has to do with collecting a list of Instagram influencers, that is users who, as part of their occupation, make marketed Instagram posts.

Finding this kind of information turned out to be a challenge on its own, as such lists are not easily available to the public domain.

Eventually we came across the Inconosquare Index Influencers, a list of 2000+ Instagram influencers. This index can be found on: https://influence.iconosquare.com.

We’ve had to crawl the 70+ pages of influencers to obtain their handles in a usable list.

Now for the real challenge: scraping the Instagram profiles of these users. This involves reading the metadata from the profile (number of followers / following, number of posts, description of profile), crawling the 17 latests images from the users (image in JPG format, number of likes, number of comments, timestamp, description text).

The final result is a JSON file for each user that looks like this:

JSON file obtained for a given user

The scraper

Instagram’s API have a limit of 60 requests/hour to their backend servers, which makes it totally useless for any real application or gather of data. The alternative to official APIs is crawling each page programmatically.

The scraper we used was coded using Selenium, a framework that is aimed at building functional tests for web applications. Selenium is in this context is used to crawl webpages and gather data from them.

The scraper initially linearly scans the latest posts of a user, then opens each post to retrieve more granular information related to each image.

Here’s a demo of the scraper in action:

Scraper in action

16539 images from 972 Instagram influencers were collected to train and test our model.

Dataset analysis

Some brief data metrics about the data collected. We can see that the number of likes has a high standard deviation, 61 224.20. It appears that there is a high disparity as the mean is 24 416.38 but 75% of the posts have less than 18 359 likes.

Dataset summary

The following histogram confirms the values shown on the table, the majority of the posts have less than 200k likes. In the application we are interested by, it makes sense to remove the people with a very high number of followers (more than 1 000 000 followers) and high average number of likes (above 200k) as they are no longer considered influencers but celebrities/stars. It will also help reduce the error of our model.

We filtered our dataset to only keep Instagramers with an average number of likes lower than 200k and they must have less than 1 000 000 followers. After filtering, we kept 746 Instagramers corresponding to 12678 posts. The standard deviation of the number of likes went down to 9999.27 and the mean to 8306.93. The histogram of the number of likes is shown above on the right.

The number of followers does not necessarily mean a big influence

We plot the number of likes versus the number of followers.

There is an increasing trend, however, it is not significant. It appears to have a lot of noise. The number of followers only is probably not a good metric to predict the number of likes of a given post as we can see that people having a large number of followers don’t necessarily have a high number of likes.

3. Models

We will first develop a base model containing some basic features obtained from the dataset. Then, we will add features obtained from Natural Language Processing (NLP) and finally, add features generated from a Convolutional Neural Net.

To compare the different models we will use two performance metrics: Root Mean Square Error (RMSE) and the R² value.

A. Base model

The base model consists in the following features:

Given features:

  • Number of followers
  • Number of following
  • Number of posts

Extracted features:

  • Website: we classified the website provided in the user description into different categories: Youtube, Facebook, Twitter, Blog, Music and other. Then we one-hot-encode the different categories.
  • Day of the week: using the data of each data post we one hot encoded the day of the week

Generated feature:

  • Average number of likes: we observed that the number of followers does not necessarily yield a high number of likes. Inactive and fake followers impact the results, a better metric appeared to be the average number of likes of each user. The rest of the features are supposed to vary the number of likes up or down depending on the average number of likes.

From the features generated we found that:

Website

11% of the user had a Youtube channel or video as their website, 4% a Facebook profile, 2% a blog and 1% a website related to music (Soundcloud or Spotify). 88% of the user had a website in their user profile.

Day of the week

The day where the highest number of pictures are uploaded is Sunday with 18%, then Saturday and Thursday with 15%, Friday and Wednesday with 14% and finally Monday with 11%. It appears that only two days have different statistics, Sunday has more posts and Monday less.

XGB

Applying an XGBoost model on these features with the following parameters: max_depth=4, learning_rate=0.01, n_estimators=596 we obtain a RMSE of 2876.17 and R² of 0.92.

From the feature importance plot given by XGBoost we can see that the average number of likes significantly impact the outcome of the XGB model. The F score, 4084, is bigger than all the other features combined. In terms of basic features, number of posts, following and followers have a low F score. In terms of extracted features, Saturday appear to have a bigger impact than the other day. Perhaps suggesting that posting on Saturday yield a higher number of likes than the other. Earlier, it was said that the distribution of the posts during the week is relatively uniform with more posts published on Sundays and less posts on Mondays.

B. NLP

Our data set contained some text data that we worked with. First, there was user biography data. This field contains information that users write to give an intro to themselves and what their profile is about.The second field was the caption that accompanied every post. This caption also contained any relevant hashtags that the user added to the caption. The third field contained all of the users that were mentioned in the post. Lastly, there was a field for where the post occurred.

Since we did not have much time, we decided to take a bag of words approach. First, we cleaned up the data and made it into a workable state. Then, we removed all of the punctuation, stopwords and emoji’s from the text (more on the emoji’s later). We then split the words and used the scikit-learn function CountVectorizer to vectorize the text. We also tried to use the scikit function for TF-IDF however, it seemed that CountVectorizer worked better. When we did this process for the captions for the posts, we ended up getting over 7,000 unique words. This resulted in a very sparse 16k by 7k matrix. In order to reduce the sparcity of the matrix we employed a technique where we would only use the top words in the text instead all of the words. Thus, we were able to work with the top 100, 200, 300… words when evaluating our features.

Emoji’s were interesting to examine because the use of them are very popular on Instagram. Unlike the actual text data, we had to process the emoji’s differently. However, because the emoji’s are in unicode, we were able to create our own CountVectorizer function for them. Then, like the caption text data, we were able to take a subset of the top emoji’s as features based on the count of occurrences.

When trying to figure out what subset of the NLP features we would use for our final model, we tried to predict the number of likes of a post given a subset of NLP data. We chose the subset of data that reduced our RMSE the most. After testing a couple subsets, we found that the best subset was this: the top 500 words/hashtags from individual post captions, and emoji’s that show up over 175 times (37 emojis).

Here is an example of some of the top 500 words (I am showing a subset of the top 500 words that have a length of over 5 characters):

account actually adventure almost already always amazing another anyone anything architecture around available awesome barcelona beautiful beautifuldestinations beauty become behind believe berlin better birthday breakfast california camera challenge chance change chocolate coachella coffee collection coming comment community conservation couple course delicious dengan design details different easter enough europe everyone everything excited experience explore family fashion favorite favourite featured feeling finally fitness follow followers forget france friday friend friends geographic getting gracias grateful guides hashtag healthy images important incredible inspiration inspired instagram island journey liketk liketkit liketoknow little living location london looking lovely madwhips magazine makeup making mercedes millionaire moment moments monday morning mother mountain mountains natgeo natgeocreative national natural nature nothing official online outfit people perfect performance photograph photographer photography photos picture places please porque pretty prints profile project really recipe remember saturday season series sharing someone something sometimes special sponsored spring started stories summer sunday sunset support taking thanks thephotosociety things though thought together tomorrow tonight travel trying turkey vacations wanted wearing weekend without wolfmillionaire wonderful working workout yesterday youtube

Here are the top 37 emojis:

Top 37 emojis

When we appended these features to the features we used for the base model, our RMSE actually worsened. What happened was, these extra NLP features were away importance from the mean feature in the base model features. Since the Base + NLP model relied less on the mean feature to predict likes, our RMSE actually worsened. Our RMSE score went up to 2895.90 and our R² score became 0.9163. Since there are over 500 features, the feature importance plot for this model is hard to read.

However, using this function:

vec = model_xgb.feature_importances_for i in range(len(vec)):if vec[i] > .01:print(X_train.columns[i])

We found that the most important NLP features are the words: ada amazing amg code el going lo make man thank time

C. Transfer learning

Another one of the approaches we took is related to image processing and computer vision.

The aim is to identify features related to images that may be meaningful in determining the final number of likes.

While the features in the base model are what we believe will be the most meaningful in this regression problem, we would consider it a great success to achieve the slightest improvement in our model.

Transfer learning relates to the process of using a pre-trained deep ConvNet, and use that as a starting point to build a model that takes into account image features.

There are two ways to use pretrained models, which we both tackle:

  • Fine-tuning the ConvNet with Inception V3: that is taking the pre-trained weights of the ConvNet, remove the last fully connected layers, and expand the networks as needed. We then fine-tune the weights of the pretrained network by continuing the backpropagation. The model we used was Inception v3. The last two fully connected layers were removed, and we added three extra layers. We retrained it on our data points with a GPU on an EC2 machine.
  • Fixed feature extractor with VGG 19: this is removing the last fully-connected layer, then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. VGG 19 is an image classification network pretrained on the 14+ million images on the ImageNet dataset. Removing the last fully-connected layer yields a feature vector of size 4096 for each image. We can use this (sparse) vectors as features in our base model.

Results

Unfortunately, both these approaches didn’t yield a significant improvement on our base model. We link this back to the fact that these deep networks were trained having a classification task in mind, which may not necessarily adapt to a regression problem like ours. Also, our base classifier used an XGBoost model to which we fit the rest of the features. XGBoost generally has mixed results with very sparse feature matrices.

D. Convolutional Regression Neural Network

Motivation

We speculate that the transferred learning approach was not successful because the transferred networks were trained for categorical problems, whereas we have a regression objective. Our next idea was therefore to build and train our own convolutional neural network (CNN) with Google’s TensorFlow Machine Learning platform, that was designed specifically for regression.

Design

Like all CNNs, our design started with c convolutional layers that get flattened into f fully connected layers. The fully connected layers would filter into a final layer of 1 ReLU activated neuron that would output the number of likes. Our first thought was to concatenate the numerical meta features we also used in our base model (like number of followers, number following, etc.) as control parameters for the image analysis. We reasoned this would be helpful because regardless of the image, the number of likes a post gets is heavily influenced by the user posting. For example, a very popular user may post a picture of a mundane object, say a chair, and still receive a relatively large number of likes.

Architecture

Choosing the architecture of a neural network is somewhat of an artistic process. In all machine learning problems, the model designer is always plagued with choosing the hyperparameters in just the right way to both give the model enough freedom to capture the patterns in the data, and constrain the model enough to not fall victim to overfitting. In neural network design, this problem is compounded by the fact and everything is a hyperparameter — the number of layers, the layer sizes, the number of channels, the convolutional filter size, the filter stride, etc. To combat this problem we wrote our code in a way that the model initialization arguments took a compact form describing the architecture of the network. This allowed us to test many different architectures very quickly without rewriting any code. The criterion for our architecture was that the model would not quickly begin to overfit after a few training epochs, and the ReLU activations were not too sparse. Below is the architecture we settled on. It takes in batches of 256 x 256 images, has 4 convolutional layers, and 5 fully connected layers. For the convolution layers the numbers at the top of the boxes represents the output shape per channel, and the number at the bottom represents the number of channels. For the fully connected layers the number simply describes the number of neurons in the layer.

Results

The CNN was trained as a standalone model, however to integrate it into the whole model, we removed the 8 outputs of the final layer per image and used them as features in the main XGBoost. Unfortunately, as we had seen with the NLP and transferred learning approaches our test RMSE increased, from 2876 to 2964.

Redesign

We reasoned that the rise in RMSE could be attributed to the user meta features being concatenated to the fully connected layers. Because these features are so much more predictive that the image itself, the network learned to disregard the image and essentially train like a standard NN on these features. The next step was to simply remove these features from the network and train on the images alone. When training on the images alone, the standalone model expectedly had a far worse RMSE. However, when doing feature extraction and integrating with the joint model, we saw our decrease in RMSE over the base model from 2876 to 2837. When analyzing the important features for the joint XGBoost regression, we found that in fact it found one of the features from the images important. Plotting this features about all the samples, we see that it takes the shape of a gaussian distribution. From this we speculate that it is some feature that is continuous about all images, such as contrast ration or brightness, opposed to a discrete value such as, ‘is there a human in this photograph?’. Below you can see the second feature labeled ‘1_’ comes from the convolutional neural network

Here is the distribution of the feature of interest, about all the sample posts.

Histogram of NN feature 1_

Conclusion

Adding different sub-models like NLP or a CNN on top of our base model shows that it is extremely difficult to extract extra predictive power out of an instagram post, beyond the user specific features. Additionally, even with the base model, users who get likes far out of the typical range are greatly underestimated, throwing off the RMSE or requiring filtering. The lack of post sensitivity in our model is probably attributed to the fact that each user has a following with unique enough tastes that our model could not generalize these preferences at a broad scale. Some possible solutions to this would be to create user specific models, or gather massive amounts of data (millions of posts apposed to thousands) in an attempt to accurately generalize all of instagram. Lastly, all of our models and submodels could have been more optimally tuned and tweaked, in the hopes of getting better post sensitivity, but our development time constraints highly limited our ability of find creative solutions to this highly complex problem.

By Corentin Dugué, Giovanni Alcantara, Joseph Shalabi and Sahil Shah.

--

--

Software/Hardware Engineer and alum CMU, UT Austin & University of Edinburgh passionate about Embedded Systems.