Leveraging NLP and ML to Consume News Better—and then Making an App out of It

Here is how I leveraged Machine Learning, NLP, and Twitter to have my own digital newspaper delivered daily.

Adithya L Narayanan
Towards Data Science

--

Photo by The New York Public Library on Unsplash

His aunt and uncle exchanged looks of outrage.
“Listening to the news! Again?”
“Well, it changes every day, you see,” said Harry.

I like reading the news — a lot. I endlessly scroll through Twitter and Google News daily in the quest to read better news. And when I think of it, it is for no particular reason.

One fine day, my digital wellbeing tracker read that I spent over two hours on Twitter. I know what I was doing: hunting for news.

When push came to shove and I realized my need to cut down on spending time on Twitter, I found a workaround to still find my daily dose of news, without losing all of the fun that Twitter brings along with it. I built a news aggregator app!

Beyond just news.

Reading news and watching videos are not standalone events. Context is essential when consuming any content and as far as news is concerned, the opinions/opinionated text it carries and the reaction it evokes are what majorly form such context.

Public Pulse

News is entangled with opinion at all times—be it that of the journalist or of the people consuming it. Understanding the intricacies of these opinions helps in understanding news itself. For instance, a news piece or an opinion piece with a large amount of negative responses to it can mean that it wasn’t well-received. This may warrant additional digging into what it speaks of.

With or without this exact motivation, a lot of us often read about public perception of several things of interest. This, I believe, is a big reason why comments/replies/threads exist, and often publicly, in several popular social media (be it Twitter, Facebook, YouTube, Reddit). I tend to do this especially when I read news articles—to gauge what people think of happenings—but this only adds to the endless scrolling.

To solve this, I decided to leverage NLP to spoon-feed me with the gist. For every news tweet the app shortlists, it collects all the replies to said tweet and calculates the polarity/sentiment of these using NLTK Vader and spaCy-textblob. By judging whether positive or negative replies dominate the thread, we determine the emotion of the reaction to the news article—we judge the polarity of the discourse surrounding it.

Should news be opinionated?

“Anchors having an opinion isn’t a new phenomenon. Murrow had one and that was the end of McCarthy. Cronkite had one, and that was the end of Vietnam.”,

Charlie Skinner, The Newsroom.

There are diverse schools of thought on whether a news source should stick to just providing facts or if it should shape the multilogue that accompanies news. I believe in the latter, but I will leave that debate for a different date.

Regardless of which side of this debate you are on, it is always good to know how much of subjectivity news text carries.

While attaining the subjectivity of the news text itself poses some challenges in terms of attaining news text beyond paywalls or without expensive APIs, with some immediacy, we can determine the subjectivity of the tweet from the verified handle of the news source that carries of the article. With a reasonable assumption that the tweet’s text is a fair yet abridged imitation of the content of the news article itself, a subjectivity measure of the tweet can be a good substitute. The app obtains the subjectivity of these news tweets with spaCy-textblob as well.

All that is fine. What was all that app talk?

I decided that scrolling through twitter all day to consume news reported by different agencies was enough. I wanted to be able to read this news, but in as little time as possible and in an organized fashion. Here is what I did:

Gather the tweets from official handles (@nytimes, @washingtonpost, etc.)

I used Twitter’s API to mine about a 1000 such tweets linking the articles daily from reputed news sources.

Make them newspaper-y.

In short, categorize the news. It would be ergonomically unfeasible for the mind to comprehend news across categories in succession without some formal structure.

So, these tweets were run through an ensemble of classifiers (Multinomial Naïve Bayes, Logistic Regression, and Support Vector Classifier). When they come out, they are grouped by the category of news they belong to — business, politics, sports, entertainment, or health. The classifiers were built by mining data using NewsAPI and also using a combination of annotated datasets obtained from Kaggle: this and this. They come out of the classifiers with more than 80% accuracy.

Find the top stories.

The app then shortlists the best five stories across every category, landing at the grand total of twenty five stories. This is done by simply finding the tweets that are accompanied by the best engagement (favorites and retweets).

Create the context.

Now for the not-so secret ingredients. Each of these tweets is run through a subjectivity analyzer that gives us a score ranging from 0–1 on how subjective the tweet’s language is.

Following this, we also collect all the replies (this is a lie: we collect all the replies for the tweets where Twitter API’s rate limit doesn’t kick us out) and allocate polarity scores to each tweet using two different sentiment analyzers.

To know more about collecting replies, read: Mining replies to Tweets: A Walkthrough

Once this is done, we know that whether the thread of conversation surrounding a tweet linking a news story is positive or negative.

We present this information.

Send it to press.

We have our news, we have our context.

To prep the reader, we also add some additional context to give to our readers, to ensure that they are aware of the happenings beyond the stories we present them and so that they are aware of the magnitude of impact these stories have on quantity of news that day.

We present additional visualizations that depict the top 10 trending named entities for the day and use a pie chart to show how often they appear (out of all the ~1000 news tweets with named entities). We also show a bar chart which shows the number of news stories belonging to each category to show which kind of news is prevalent in the current climate.

To tie it all together, we present all of these, in a web application, hosted here:

bit.ly/northeastws

We present the information by sticking as close to twitter’s UI as possible using embedded tweets and while depicting every statistic we computed and every visual we produced along with these tweets.

A quick sneak peek of what the app looks like

Conclusion

There are several future features to add to this application — from analyzing the text of the actual news article to check for toxicity and hyper-partisanship, and analyzing the thumbnail image to compute its polarity. These are all things that I hope to add in the future.

But for now, if you would like to read news, without endlessly scrolling on social media all day, and want to satiate your curiosity on what other people think of this news, I would love to welcome you on this one-stop shop with me.

For any comments, feedback, questions, chats, etc., you can find me right here.

Adithya Narayanan is a Graduate Student at the University at Buffalo. He specializes in Operations Research and has a background in Computer Science. Reading news and commenting on it are his natural instincts.

He has an alter ego, who, when bored, loves creating digital content of all kinds but primarily prefers a pen and paper. He has created digital content for several larger than life sporting entities such as LaLiga Santander, Roland-Garros, and NBA Basketball School.

This is his Linkedin & email.

--

--

Works towards mastering Data Sciences and Ops Research in the free time he gets between watching “Billions” and reading the news. His alter ego likes writing.