How to get more likes on your blogs (1/2)

Alvira Swalin
Towards Data Science
5 min readFeb 12, 2018

--

Unravelling the mystery of claps on medium blogs using data analytics

Being an amateur blogger, I always wonder what makes a good blog “good”. Is it just the content that is important or do we have to focus on other aspects as well? To solve this mystery I am deploying the only tool I can think of — Data!

This is the first of the two blogs written by Neerja Doshi and me to understand the factors influencing Likes/Claps on medium blogs. The first part covers web scraping, features extraction and exploratory data analysis. In the second blog, we will apply data science techniques including some machine learning algorithms to determine the importance of features and hopefully try to predict the claps.

Disclaimer: The purpose of this blog is to apply data science to gain useful insights out of curiosity. We are in no way undermining the importance of actual content of the blog. It is and will always be the most important factor in determining the quality of a blog. Also, “Claps” is one of the indicators that determines the usefulness or quality of a blog. There can be other indicators like “Views” as well. However we have used only “Claps” as an indicator.

Method

Scraping

We built a web-scraper in python (thanks, Beautiful Soup & Selenium!) that grabbed about 600 blog posts. For maintaining consistency, we scraped blogs related to Data Science & Artificial Intelligence only. I am not going to write the python script here (our humble attempt to keep the blog clean). But the extra curious people can go and find it on this GitHub link.

Raw Data Scrapped from Blogs

Feature Engineering

#Title
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
data = pd.DataFrame()
for i in range(data_scrapped.shape[0]):
data.loc[i, "title_length"] = len(data_scrapped.loc[i, "title"])
data.loc[i, "title_emot_quotient"] = abs(sid.polarity_scores(data_scrapped.loc[i, "title"])['compound'])
#Body
data.loc[i, "ct_image"] = len(data_scrapped.loc[i, "images"])
text = " ".join(data_scrapped.loc[i, "para"] + data_scrapped.loc[i, "bullets"]).split()
data.loc[i, "ct_words"] = len(text)
data.loc[i,'read_time'] = int(data_scrapped.loc[i,'read_time'])
#Additional
data.loc[i, "days_passed"] = (datetime.now() - data_scrapped.loc[i,'datePublished'] ).days
data.loc[i, "featured_in_tds"] = 'Towards Data Science' in data_scrapped.loc[i,'tags']
data.loc[i, "ct_tags"] = len(data_scrapped.loc[i, "tags"]

Data Summary

Visualization

Scatter plot of Claps vs Features

Correlation Plot of Claps vs Features

Correlation Plot

Most commonly used words

from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
stop = STOPWORDS.union(set(stopwords.words('english')))
wordcloud = WordCloud(relative_scaling = 1.0, stopwords = stop, width=1500, height=800).generate(text)
plt.figure(figsize=(18, 16))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Some Interesting Observations

  1. Mean of numbers of claps is much higher compared to the median value. This shows that most of the blogs get lower number of claps and the distribution is skewed towards right
  2. Most of the titles are neutral in terms of their emotional quotient and there is no visible trend between claps and emotional content value of title
  3. Reading time, number of words & number of images are positively correlated with number of claps which is what we usually expect (longer blogs gain more likes). Also, words_count/img_count has negative correlation simply because images capture more attention compared to text.
  4. It’s clear from the above plot that blogs with greater number of tags get more claps. So don’t forget to add tags next time.
  5. Average length of a title is 6–7 words. However titles having 10 words have highest average number of claps (Outliers are removed for creating the following graph).

6. Number of claps increases with the number of days till a certain value. From the graph it looks like after 150 days, days_passed does not really matter because claps are saturated.

7. Average reading time of a blog is 6–7 minutes. However blogs with length of 11 minutes have the highest average number of claps (Outliers are removed for creating the following graph).

I hope you enjoyed reading it! For more interesting observations, please read the second blog. For exploratory analysis we have used only 600 blogs but for employing machine learning techniques, we have used a larger dataset — 4000 blogs. Let’s see how many of these inferences still holds true.

Link for second blog can be found here.

LinkedIn: www.linkedin.com/in/alvira-swalin

Resources:

  1. Web scrapping is inspired by Data Acquisition lecture at USF by Terence Parr
  2. https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho

--

--