How I used Natural Language Processing to extract context from news headlines

Published in

Towards Data Science

5 min readApr 12, 2018

Recently I came across a really amazing dataset at Kaggle (https://www.kaggle.com/therohk/india-headlines-news-dataset). This is one of the rare times, when you get to see data in Indian context. This data is about 2.5 million news headlines published in a national Indian daily called ‘Times of India’. I thought, it would be really nice if I could analyse this data and extract some insights from this data set. Hence one fine evening, I decided to pull out an all nighter gleaning out anything interesting from over 2.5 million news headlines

Part 1: Getting to know the data

So, I began exploring this dataset. What do you do if you have a lots of text and you want to see what general trends exist in the data? You start with simple word frequencies!!! I ended up counting most common unigrams, bigrams and trigrams and discovering some insights. Below is an instance of very simple frequency of tokens -

Part 2: Hitting the brick wall

From this visualization, I could easily figure out that Shah Rukh Khan grabs a lots of headlines and BJP as a political outfit manages to maintain its presence quite prominently along with bollywood stars!!!!! So far so good. This was when I was going to hit a brick wall with my analysis.

So, I thought, why not continue creating frequency plots of tokens from different points of view? Hence I posited it would be a good idea if I could create a frequency plot of common bigram tokens across the years? Essentially I wanted to find out most frequent bigram tokens in the year 2001 (This was the first year of data available), then also find the most frequent bigram tokens in the year 2002 and eventually find out the common frequent tokens for years 2001 and 2002. And continue to accumulate these tokens across years. This is the plot I ended with:

Here you can see that the most frequent and common bigram over the years is ‘year old’. But what does it mean? In what context was it used? Sadly, frequency plots can only take us thus far. They mostly fail to inform about the context. This was my brick wall! For a moment, I thought, its 2 O’ Clock in the morning let me go to sleep!!!!

But then I remembered Randy Pausch :

So, I slogged on… And finally it dawned on me….

Part 3: Climbing the brick wall

I won’t lie, I think I dosed off a bit and dreamt of my grammar classes at high school. What kind of information does a Noun or a Verb or an Adjective convey in a given sentence? Are newspaper headlines not sentences?

All I then needed to do was filter out all the headlines where the token “year old” occurred and then find out what nouns and verbs co-occur with this token. But how do you do that? The way you accomplish this is by creating a POS (Parts of Speech) tree for each sentence. POS tagging is a standard Natural Language Processing technique. All NLP implementations have this feature. I chose spacy and this is the small code snippet I had to write:

index=data['headline_text'].str.match(r'(?=.*\byear\b)(?=.*\bold\b).*$')
texts=data['headline_text'].loc[index].tolist()
noun=[]
verb=[]
for doc in nlp.pipe(texts,n_threads=16,batch_size=10000):
    try:
        for c in doc:
            if c.pos_=="NOUN":
                noun.append(c.text)
            elif c.pos_=="VERB":
                verb.append(c.text)            
    except:
        noun.append("")
        verb.append("")
plt.subplot(1,2,1)
pd.Series(noun).value_counts().head(10).plot(kind="bar",figsize=(20,5))
plt.title("Top 10 Nouns in context of 'Year Old'",fontsize=30)
plt.xticks(size=20,rotation=80)
plt.yticks([])
plt.subplot(1,2,2)
pd.Series(verb).value_counts().head(10).plot(kind="bar",figsize=(20,5))

plt.title("Top 10 Verbs in context of 'Year Old'",fontsize=30)
plt.xticks(size=20,rotation=80)
plt.yticks([]

To create this plot:

And lo behold!!! I had the context associated with token “Year Old” with me. This token was used in news headlines reporting violent acts/crimes, that too mostly against women.

Just to be sure that I was right about my conclusions I looked at the actual news headlines where ‘Year Old’ was mentioned and this is what I saw -

['10-year-old girl missing',
 'Relative kills 9-year-old',
 '59-year-old widow murdered',
 'Spunky 60-year-old woman prevents burglary',
 "75-year-old woman done to death in B'lore",
 'Encroachments threaten 900-year-old temple',
 '3 nabbed for 5-year-old robbery',
 '25-year-old man electrocuted',
 '5-year-old boy run over',
 'Killers of 88-year-old woman arrested',
 '21-year-old held in theft case',
 "60-year-old gets two years' RI for rape attempt",
 'STRAIGHT ANSWERSBRSwati Aneja 13 year old schoolgirl on what I Day means to her',
 'Robbers stab 9-year-old',
 "Eight year old's brush with 'commissions'",
 'By Ganesha; what 81-year-old Deryck does BEST',
 'Six-year-old girl raped; murdered',
 'FBI woos 16-year-old indian author',
 'Six-year old raped murdered in Patiala cantonment',
 'FBI woos 16-year-old Indian author']

Phew!!! That was some work. I continued working on this data set, looking for more such stories. You can view my work in progress on this Kaggle Kernel https://www.kaggle.com/gunnvant/what-india-talks-about-a-visual-essay

Don’t forget to clap if you liked this post. Also, if you are on Kaggle, it would be great if you could upvote for my kernel

How I used Natural Language Processing to extract context from news headlines

Part 1: Getting to know the data

Part 2: Hitting the brick wall

Part 3: Climbing the brick wall

Written by Gunnvant Saini