Analysing survey data with Python and Jupyter Notebooks

The Notebook environment is perfect for the ad-hoc nature of working with survey data

Published in

Towards Data Science

5 min readOct 8, 2019

Surveys are the amoeba of the data world. Not because they eat your brain (not literally, anyway) but because they are ever changing in their shape and structure.

Even the surveys that are meant to stay the same – the studies conducted at regular intervals to track our feelings towards everything from politicians to toilet brushes – are in constant flux. Clients, in their lovable creativity, discard questions, invent new ones and come up with endless new aggregations to perform on the results.

In this article we will show why Jupyter Notebooks, an interactive scripting and story-telling environment, combined with the python library Quantipy, is the perfect toolset to tame the brain-eaters and bend them to our will.

A screenshot of a Jupyter Lab setup optimised for analysing survey data with the Python open source library Quantipy. The window on the right shows a list of all the variables in the data for quick lookup.

Importing data from SPSS (or CSVs, or ddfs)

Survey data is often stored as data on the one hand and metadata on the other. In SPSS, users can view the data (on the left in the image below) or the variable information, the metadata (on the right of the image).

SPSS splits the data and the variable information (the metadata) into two seperate views. The open source python library Quantipy can read SPSS and other survey specific file types and remove the pain of mapping the numeric results in the data to human readable labels.

Out of the box, python is not great at handling the richness of survey metadata, its primary usage being with transactional data which is much simpler in structure. This means that after we have run a calculation using standard python libraries, we have to map variable labels, from the metadata, onto our results (making it clear that the percentage figure against ‘1’ in our output, actually refers to ‘males’). This gets even trickier when dealing with proprietary data formats (such as Unicom Intelligence’s ddf/mdd files) that python has no, native, understanding of.

Quantipy however, provides ready made readers for SPSS and Unicom Intelligence (previously called Dimensions) files, and converts the data and metadata to a pandas dataframe and a metadata schema of its own.

Quick exploration of the results

We start by naming and creating a dataset (a dataset is just the combination of data and metadata):

dataset = qp.DataSet("Customer satisfaction wave 1")
dataset.read_spss("customer_satisfaction.sav")

Quantipy stores the responses in a dataframe(dataset._data) and the metadata in a python dictionary (dataset._meta). We can reference the metadata about specific variables by looking at the columns element of the dictionary.

dataset._data.head()
dataset._meta['columns']['gender']

The data part looks the same as it does in SPSS, but the json metadata schema is unique to Quantipy. It supports multiple languages and variables (columns) of various types: single choice, delimited sets and grids.

Quantipy, provides a convenient shorthand function for looking at the metadata which is available through the command

dataset.meta('gender')

So, let’s look at how many men vs women did our customer satisfaction survey.

#the pct parameter decides whether we show percentages or the count
dataset.crosstab(“gender”, pct=True)

Wait, we know that only 59% of our customers are women, but they are 64.1% of the respondents. We need to weight our variables so that our results are truly representative of our client base.

Weighting answers to correctly represent the population

We’ll use Quantipy’s implementation of the RIM weighting algorithm (it has other weighting options), which allows us to weight using an arbitrary amount of target variables. In this case, we’re only using one; gender.

# Quantipy's Rim class implements the RIM weighting algorithm
scheme = qp.Rim('gender weights')# Men should be 41% and women 59%
scheme.set_targets({'gender':{0: 41, 1: 59}})dataset.weight(scheme,
               unique_key='RespondentId',
               weight_name="Weight",
               inplace=True)

When Quantipy runs the weights, it generates a weight report, which allows us to double check the weighting isn’t doing anything crazy. The maximum weight here is 1.14, which we are happy with. We now have a new variable in our dataset called ‘Weight’ which we can use in our calculations.

dataset.crosstab(‘gender’, pct=True, w=’Weight’)

Now we can see that the 41% of our client-base which are male, aren’t under represented in the results.

Quick visualisation of our results

The fact that we’re in the Jupyter Lab environment means that we can easily visualise the results. We look at the price variable, which shows how happy customers are with the pricing.

dataset.crosstab('price', 'gender', w='Weight', pct=True)

Now let’s visualise this to get a better idea of the results. For the sake of brevity, we won’t go into the details, but you can see how easy it is to visualise the data using standard functionality.

import blender_notebook.plot as plots
plots.barplot(dataset.crosstab('price', 
                               'gender', 
                               w='Weight', 
                               pct=True), 
              stacked=False)

Automating our work for future projects

Doing our work in Jupyter Notebooks adds a further benefit: It allows us to automate our work by making our Notebooks reusable. Building on ideas from Netflix’s data science team (see: Beyond Interactive: Notebook Innovation at Netflix), we’ve previously covered how data processing tasks such as cleaning, weighting and recoding can be automated and also how to automate the generation of PowerPoint decks. These are all underpinned by the fact that our work is done in Notebooks.

Jupyter Notebooks are taking the data science world by storm and this time, survey data analysis shouldn’t be left behind. Analysing survey data with Notebooks removes a whole world of pain from both the data processing professional and the researcher and can do wonders for the productivity of both.

Geir Freysson is co-founder of Datasmoothie, a platform that specialises in survey data analysis and visualisation. If you’re interested in using open source software for survey data analysis sign up to our newsletter, called Unprompted Awareness.