The world’s leading publication for data science, AI, and ML professionals.

Using machine learning to analyse qualitative survey data

Instant analysis and visualisation of open-ended data with a mix of NLP, sentiment analysis, neural networks and topic modelling

Image from pexels.com
Image from pexels.com

Using Latent Dirichlet Allocation to analyse qualitative survey data

Surveys are one of the most used methods in market research and data collection. While the key output tends to be quantitative data, open-ended questions are frequently asked to elicit a broad range of responses. These verbatim answers can often help to answer the ‘why’ behind the numbers.

However, analysis of open-ended survey data is hard work. It can take hours or even days to go through verbatim responses in a large survey dataset. Not only that, it is almost exclusively done through human coding. As a result, qualitative responses are often ignored or just used to supplement the narrative by pulling out a handful of verbatim quotes.

This led me to the question – Is there a better way to reveal insights in qualitative survey data?

Using a mixture of natural language processing, neural networks, sentiment analysis and topic modelling, I created a model that can take in a dataset, and automatically returns key themes in the data. And this took just 10 hours of my time.

TL;DR: In ten hours, I created a model that can automatically provide key themes in a dataset of any size. On my test dataset with ‘just’ 3000 responses, the entire model (including visualisations) below can be generated in less than 30 seconds.

Read on if you’re keen to geek out with me – I promise it’s worth your two minutes.

I used a publicly available dataset containing 3000 responses- from a community survey in Austin, Texas. At the end of the survey, respondents were given an option to provide written comments in response to the following question: "If there was ONE thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?"

I thought this would be an interesting data science challenge to tackle given the wide range of possible responses.

Here’s my machine learning pipeline –


Step 1. Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular Natural Language Processing (NLP) tool that can automatically identify topics from a corpus. LDA assumes each topic is made of a bag of words with certain probabilities, and each document is made of a bag of topics with certain probabilities. The goal of LDA is to learn the word and topic distribution underlying the corpus. Gensim is an NLP package that is particularly suited for LDA and other word-embedding machine learning algorithms, so I used this to implement my project.

# Create Dictionary
id2word = corpora.Dictionary(processed_data)
# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in processed_data]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=42,
                                           update_every=1,
                                           chunksize=10,
                                           per_word_topics=True)

After some preprocessing of the data to remove common words, I was able to obtain topics from respondent feedback that mainly revolved around –

1) Cost of living

2) Utilities

3) Traffic

4) Miscellaneous issues (classified as topic 0)

Looking ‘under the hood’, we can see the most representative sentences for each topic identified by the model.

Naturally, no qualitative analysis can ever be complete without a word cloud 🙂 Here we can see the most representative words for each topic.

Getting a bit geeky now – I also trained a word2vec neural network and projected the top words in the LDA-obtained general topics onto the word2vec space. We can visualise the clusters of topics in a 2D space using t-SNE (t-distributed stochastic neighbour embedding) algorithm. This allows us to see how the model has separated the 4 topics – Utilities issues appear to be the least commonly mentioned. Additionally, there is some overlap between cost of living and utilities issues.

# Getting topic weights
topic_weights = []
for i, row_list in enumerate(lda_model[corpus]):
    topic_weights.append([w for i, w in row_list[0]])
# Generating an array of optic weights    
arr = pd.DataFrame(topic_weights).fillna(0).values
# Dominant topic in each document
topic_num = np.argmax(arr, axis=1)
# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(arr)
# Plotting the Topic Clusters with Bokeh
output_notebook()
n_topics = 4
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title="t-SNE Clustering of {} LDA Topics".format(n_topics), 
              plot_width=900, plot_height=700)
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
show(plot)

And finally, I used pyLDAVis to create an interactive visualisation of this topic model. This can be used to chart the most salient words per topic, as well as seeing how separated the topics are.

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis

Though this dataset slightly lacked in features to split the data on, we can see that the topics are evenly split by the different council districts. After dropping topic 0 (miscellaneous) there is evidence of a salience hierarchy between cost of living, traffic and utilities. In 9/10 districts, utilities is the most talked about issue for improvement. It would have been interesting to have demographic data to see any differences in prominent topics amongst citizens.


Step 2. Sentiment Analysis

I was also interested in analysing the sentiments of survey open-end comments, and merging that with my topic model. I used the VADER library to assign sentiment scores, and defined a percentage rating for a topic as the percent of respondents that gave a positive comment when they mentioned the topic. This metric was used to assign sentiment scores to topics.

I used my LDA model to determine the topic composition of each sentence in a response. If a sentence was dominated by one topic by 60% or more, I considered that sentence as belonging to that specific topic. Then, I calculated the sentiment of the sentence, either positive or negative, and finally counted the total percent of positive sentences in each topic.

Here’s a quick look at the sentiment distribution across the dataset. With such an open-ended question, it would have been easy for respondents to complain. But given that 70% of the responses have a positive sentiment, we can infer that the sample tends to give constructive criticism towards their home city.

In this particular dataset, there was not too much sentiment variance across topics (as visualised below). However, it would be interesting to apply it in a commercial market research context to see the differences in sentiment.

In conclusion, I am confident that this is an efficient and scalable way to analyse survey verbatim.

Some caveats – Since this is a ‘quick and dirty’ method, it cannot be expected to fully replace human analysis. Additionally, LDA also requires us to choose the number of topics which can be limiting. There is room for further improvement with more customised NLP depending on the subject matter, as well as better hyperparamter tuning in the LDA model. However, the upside is clear –

Machine learning can help reduce human bias and save hours of analysis time in order to get major themes from your data. This particular method can easily handle large datasets and return actionable results within a matter of seconds.


Get in touch if you’d like to learn more, or if you have any suggestions for improvement. The entire project is up on my GitHub repository (feel free to check out my other projects while you’re there!)


Related Articles