NLP on Edinburgh Fringe 2019 data

Web scraping and text analysis of the events taking place during the Edinburgh Fringe

Published in

Towards Data Science

7 min readAug 5, 2019

In this post, we dive into the basics of scraping websites, cleaning text data, and Natural Language Processing (NLP). I’ve based the context of this project around the Edinburgh Fringe, the world’s largest arts festival, currently taking place between the 2nd and 26th of August.

Getting data into Python from web scraping

For my analysis, I wanted to acquire text about all of the events taking place during the Fringe festival. This text data was located across several webpages, which would take a long time to manually extract.

This is where the Python library requests helps us out, as it can be used to make HTTP requests to a particular webpage. From the response of making a HTTP request, the website text data can be obtained. This text data, however, is a large scroll of text, and this is where the library BeautifulSoup is implemented to parse the returning HTML we get from the webpage so that we can efficiently extract only the content we want.

The below code snippet demonstrates making a request to a webpage and parsing the response through BeautifulSoup:

import requests
from bs4 import BeautifulSoup# URL to query
url = 'https://url_to_query'# Scrape html from the URL
response = requests.get(url)# Use html parser on webpage text
soup = BeautifulSoup(response.text, 'html.parser')

Within the returned variable soup, we can search for particular classes in the HTML using commands such as soup.find_all(‘’, class_=’CLASS_NAME’). Using this approach, I acquired data for 5,254 festival events, including fields of the event name, a brief description of the event, ticket prices, and the number of shows being performed.

Exploring the Fringe data

After getting a data set, typically the next step is to explore what you have. I was interested to know what the distribution of ticket prices was across the events. A restricted view of this data is shown below:

Distribution of events by ticket price (view limited to cost up to £60)

From the chart, we can see that over 25% of all events are free to attend, with £5–£20 holding a large portion of the distribution. Deeper analysis into the data revealed shows costing more than £60, most of which were technical masterclasses or food/drink tasting sessions. What I was really interested in, though, was the types of shows taking place, for which we need to start working with our text data.

Cleaning

When using text for data science projects, the data will almost always require some level of cleaning before it is passed to any models we want to apply. With my data, I combined the event name and description text into a single field called df[‘text’]. The following code shows a few of the steps that were taken to clean our text:

import string
import pandas
 
def remove_punctuation(s):
 s = ‘’.join([i for i in s if i not in frozenset((string.punctuation))])
 return s# Transform the text to lowercase
df[‘text’] = df[‘text’].str.lower()# Remove the newline characters
df[‘text’] = df[‘text’].replace(‘\n’,’ ‘, regex=True)# Remove punctuation
df[‘text’] = df[‘text’].apply(remove_punctuation)

Bag Of Words (BOW)

With each row in our DataFrame now containing a field of cleaned text data, we can proceed to inspect the language behind it. One of the more simple methods is called Bag of Words (BOW), which creates a vocabulary of all the unique words occurring in our data. We can do this using CountVectorizer imported from sklearn. As part of setting up the vectorizer, we include the parameter stopwords=‘english’, which removes common words from the dataset such as the, and, of, to, in, etc. The following code performs this on our text:

from sklearn.feature_extraction.text import CountVectorizer
 
# Initialise our CountVectorizer object
vectorizer = CountVectorizer(analyzer=”word”, tokenizer=None, preprocessor=None, stop_words='english')
 
# fit_transform fits our data and learns the vocabulary whilst transforming the data into feature vectors
word_bag = vectorizer.fit_transform(df.text)
 
# print the shape of our feature matrix
word_bag.shape

The final line prints the shape of the matrix we have created, which in this case had a shape of (5254, 26,869). This is a sparse matrix of all the words in our corpus and their presence in each of the supplied sentences. One benefit of this matrix is that it can display the most common words in our dataset; the following code snippet shows how:

# Get and display the top 10 most frequent words
freqs = [(word, word_bag.getcol(idx).sum()) for word, idx in vectorizer.vocabulary_.items()]
freqs_sort = sorted(freqs, key = lambda x: -x[1])
for i in freqs_sort[:10]:
    print(i)

From the fringe data, the top ten most-frequent words in our data are:

('comedy', 1411)
('fringe', 1293)
('new', 927)
('theatre', 852)
('edinburgh', 741)
('world', 651)
('life', 612)
('festival', 561)
('music', 557)
('join', 534)

This is to be expected and doesn’t tell us much about the depth of what is on offer. Let’s try mapping how events might be connected to each other.

TF-IDF and Cosine Similarities

When working within NLP, we are often aiming to understand what a particular string of text is about by looking at the words that make up that text. One measure of how important a word might be is its term frequency (TF); this is how frequently a word occurs in a document. However, there are words that can occur many times but may not be important; some of these are the stopwords that have already been removed.

Another approach that we can take is to look at a term’s inverse document frequency (IDF), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. We can combine TF and IDF (TF-IDF). TF-IDF is a method for emphasising words that occur frequently in a given observation while at the same time de-emphasising words that occur frequently across many observations. This technique is great at determining which words will make good features. For this project, we are going to use the TF-IDF vectorizer from scikit-learn. We can use the following code to fit our text to the TF-IDF model:

from sklearn.feature_extraction.text import TfidfVectorizer as TFIV
 
vctr = TFIV(min_df=2,
 max_features=None,
 strip_accents=’unicode’,
 analyzer=’word’,
 token_pattern=r’\w{1,}’,
 ngram_range=(1, 2),
 use_idf=True,
 smooth_idf=1,
 sublinear_tf=1,
 stop_words = ‘english’)
 
X = vctr.fit_transform(df.text)

From the matrix X, we can understand each of our events’ text in the form of a vector of the words. To find similar events, we are going to use a method called cosine similarities, which again we can import from sklearn. The following snippet demonstrates how to perform this on a single event (X[5]), and the output shows the most relevant events and their similarity score (a score of 1 would be the same text).

from sklearn.metrics.pairwise import linear_kernelcosine_similarities = linear_kernel(X[5], X).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cos = cosine_similarities[related_docs_indices]print(related_docs_indices)
print(cos)[5  33         696        1041      ]
[1. 0.60378536 0.18632652 0.14713335]

Repeating this process across all of the events leads to the creation of a map of all events and how similar events link together.

Gephi network graph showing cosine similarities >0.2 for all events, coloured by LDA topic

What is interesting about the network graph above is the vast number of events that have no relationships to others in the network. These are scattering around the edge of the network and highlight the degree of originality on show at the Fringe festival. In the middle, we can see a few clusters with a high degree of overlap between events. In the final stage of the project, we model these clusters and try to assign themes to them.

Topic Modelling

Latent Dirichlet Allocation (LDA) is a model that generates topics based on word frequency. We use it here for finding particular themes in our Fringe data about the types of events taking place. The code below shows how to get started with this approach:

from sklearn.decomposition import LatentDirichletAllocationvectorizer = CountVectorizer(analyzer="word",
                             min_df=20,
                             token_pattern=r'\w{1,}',
                             ngram_range=(2, 4),
                             preprocessor=None,
                             stop_words='english')
 
word_bag = vectorizer.fit_transform(df.text)lda = LatentDirichletAllocation(n_topics=25,
   max_iter=50,
   learning_method=’online’,
   learning_offset=40.,
   random_state=0).fit(word_bag)names = vectorizer_stop.get_feature_names()for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx))
    print(" ".join([names[i]
                    for i in topic.argsort()[:-5 - 1:-1]]))

The above code is just an example of running the LDA model and printing the output topics and the most important words. From the output of the LDA, I used Principal Coordinate Analysis to be able to create a 2D projection of our topics:

Being able to plot the topics from running LDA is useful for observing where topics are overlapping and tuning the model. It also provides a method for understanding the different types of events taking place during the festival such as music, comedy, poetry, dance, and writing.

Closing thoughts

Our analysis demonstrates that, with the basics of NLP methods and techniques, we can interrogate a small dataset of text to gain insight. The BoW model, despite being simple, provides a fast view on the number of words present and the main ones used in the text, although it wasn’t much of a surprise for our data that words such as ‘comedy’, ‘fringe’, and ‘edinburgh’ were the most common.

Expanding on this, TF-IDF provides a way to start thinking about sentences rather than words, with the addition of cosine-similarities providing a method to start grouping observations. This showed us the extent of original events present in the Fringe with low common text between other events.

Finally, with LDA we get our model to produce topics where groups of events are allocated to a theme. This allows us to gauge the main types of events happening throughout the festival.

There are several other methods for NLP worth investigating. As part of the project, I also used Word2Vec and TSNE; however, the findings from these have not been presented.