The world’s leading publication for data science, AI, and ML professionals.

Discovering Powerful Data: An NLP Goldmine

Using the Free Guardian News API with Python to Unlock a Plethora of Tagged Data

Unlock a plethora of tagged data using the free Guardian News API

Designed by Freepik - www.freepik.com
Designed by Freepik – www.freepik.com

Think about it, the headache of finding labelled Data, anyone who’s ever tackled an NLP problem knows this. But what industry requires a lot of writing and then having the work labelled nicely on their website? That’s right; news articles. So let’s take advantage of this. The Guardian (news agency) has an immensely powerful free API – 5,000 calls per day and almost 2 million pieces of content. All their content is tagged with their own labelling: Environment, Money, Environment/Energy etc. Someone has already done all the hard work for us.

Request access to the API and you have a plethora of data for NLP. These tags are perfect labels. Which makes it perfect for NLP classification training. It might not just be for NLP, but any sort of analysis, be it: volume analysis, a bag of words analysis, etc.

First, request access to it here: https://openplatform.theguardian.com/access/

In this article, we cover:

1) Pulling the data into Python

2) Pulling in more than 200 results in one function

3) Pulling in multiple tags

4) End to End code.

5) Potential use of the data

In light of the energy crisis, we’re going to explore the use case of finding news articles to do with energy. i.e. Energy industry tagging.

If you have trouble pulling in your own data, there are fully functioning code snippets throughout the article that work provided that you replace ‘YOUR API KEY’ with your own API key.

Pulling the Data into Python

Before working with pulling in the data, be sure to register and get your API key from https://open-platform.theguardian.com/access/.

As seen in the documentation, it’s got the typical API call methods we can use with a requests package, in our case, we use the GET method and place the variables into the string of the URL. Since we are focusing on tags/labelling to utilise the work The Guardian has put into manually tagging each of these articles, our example will be querying by tags. However, querying by searching keywords can also be done.

Since we are going to query by tagging, we’d need to find out what tags exist first, according to the API, we can get a list of tags in the form of a JSON file with the get method. Though remember, the guardian has hundreds of tags, so it’s best to filter down by searching for the tag you want.

Below we search for tags to do with energy.

Note: Replace YOUR API KEY with your API key

tag_json = requests.get('https://content.guardianapis.com/tags?q=energy&page-size=200&api-key=YOUR API KEY).json()

What comes back is a JSON with all the tags with the keyword energy. So I just picked a few from that list that I’m interested in and put it into a list. Below are the few I picked.

tag_list = ['environment/energy', 'money/energy', 'technology/energy', 'science/energy', 'environment/energy-storage',
           'business/energy-industry', 'environment/energy-monitoring', 'environment/renewableenergy',
           'sustainable-business/hubs-energy-efficiency']

Let’s take the first tag in that list, I’m going to query the article data for the particular tag "environment/energy". I’ve put a maximum page size of 200, this seems to be about the maximum you can get per call.

Note: Page in an API call denotes the page number, e.g if I put 200 on a page, putting 1 as a page number will pull the first 200, putting 2 will pull the second 200

Problem is, you can only query about 200 at a time, so let’s discuss how to bypass this.

Pulling in more than the 200-row (limit) in one function

To do this, we can simply write functions that can deal with a for loop. Luckily, we know how many results exist, so we can plan for how much data we need. If we have 200 rows per page, then we can write an if statement and for loop to say, if there are more than 200 rows to get, then do a for loop to get every page.

We can create 3 functions:

1) To query one response

2) To check how much data exists and calls more pages. Puts it all in one list

3) Convert the list of json responses to a DataFrame

With three functions, you can now call as many articles as you’d like – as long as it doesn’t exceed 5,000 calls in a day – that’s 1,000,000 articles!

Pulling in Multiple Tags

The one final functionality is if we want to pull in multiple tags. Earlier I created a list of tags that would interest me. The final piece of the puzzle is the run a for loop on those tags.

End to End Code

Here’s the End to End code to copy paste if you’d like:

Don’t forget to change your API key.

Sample output:

Potential Future Uses

Tableau Dashboard to monitor news

To stay on top of news in the energy industry, I created a Tableau dashboard to see the volume of news over time, and if I click on a bar a list of articles get outputted below.

NLP Training (With Google AutoML)

I said earlier that this is the perfect data to use to train NLP models with this tagging. You can get datasets with particular labels, and use them to train your model. I would recommend using Google AutoML which you can find a tutorial to here. Or, just keep to using Python and Scikit learn.

I hope this helped – Happy data pulling.

If you’ve enjoyed this article, please leave a clap and a follow for support!

Or if you’re interested in joining the Medium community, here’s a referral link:

Join Medium with my referral link – Adrian Causby


Related Articles