Automating survey data analysis with open source software

Part one: The data processing workflow

Geir Freysson
Towards Data Science

--

You’ve confessed to the friendly voice on the phone that you’re voting for The Monster Raving Looney Party; you clicked on Pepsi rather than Coke in that pop-up online quiz; you said you were going to highly recommend the car dealership (but then didn’t) in the follow-up customer satisfaction survey. When people all over the world express their feelings and opinions to researchers in fields such as politics, medicine, branding and human resources; where does the data end up? How is it analysed?

Market research is still waiting for the open source revolution

In mainstream data science, open source software runs the world: if you’re serious, you’re probably, at least in part, using Python or R. Data scientists at Netflix have the company’s data at their fingertips in Jupyter Notebooks. AirBnb open sourced the workflow automation software they built to help them automatically estimate the lifetime value of a listed home as soon as it is registered.

The plethora of tools that make data-science practitioners so powerful are still, largely, absent. Market researchers are in danger of becoming the luddites of the data world, reliant on proprietary software such as SPSS, Merlin and Quantum.

Jupyter Notebooks

A Jupyter Notebook showing the results of an opinion poll in both table and chart form using the open source Quantipy library.

This post is the first in a series illustrating how open source software can massively increase the productivity, quality and security of analysing survey research data, using tools such as Jupyter Notebooks and various other open source libraries.

Jupyter Notebooks (hereby referred to as notebooks) is a project started in 2014 to create a consistent set of open source tools for scientific research, reproducible workflows, data narratives and analysis. It is an easy to use, web based platform and it so happens that the technology is also perfect for survey data.

We aim to encourage the industry to hold onto its collective clogs and embrace the coming data revolution.

Automating the data processing workflow

We presume that we have already collected the data. For the purposes of these articles, data processing includes tasks such as cleaning, recoding, merging and weighting of data. Using notebooks, we can arrange these tasks into a production line or workflow. Much of the grunt work is automated, but the process is adaptable: when any part of the workflow changes, it should be easy to run all the other parts, in a defined sequence, with a click of a button.

We will be using the open source library Quantipy (the python 3 port is here) and start by reading the data into a Quantipy DataSet (the data was previously converted from UNICOM mdd/ddf files, but more on that in a later post).

import quantipy as qp
dataset = qp.DataSet("opinion poll")
dataset.read_quantipy(
path_meta="../data/interim/{}.json".format(filename),
path_data="../data/interim/{}.csv".format(filename)
)

We will create three different notebooks, each one performing one of three tasks; cleaning, recoding and weighting, to create a pipeline. Before the pipeline is run the data is converted (with a notebook) so that the process always starts with the original data. In a later pipeline, we will be able to replace the data conversion notebook with one that reads directly from a database. This will allow us to provide clients with live updates.

The data processing workflow is run across multiple Jupyter Notebooks. This will allow us to run the entire workflow with the papermill library when anything changes and share specific parts of the process with other projects.

In a later post, we will cover the automatic creation of the above files, but for now, we’ll create them by hand.

Now that we have the dataset (in Quantipy we have two data structures. Meta data: for the description of the data and data: for the actual records), we can write the notebooks the pipeline will be made up of. We start with a simple clean command

#file: 1_clean.ipynbdataset = dataset.filter('completed', {'SurveyCompletionStatus':6})

The cleaning, in this case, is simple, but here the user could, for example, use machine learning to filter out fraudulent answers.

Next, we recode the data. We want to weight the regions according to England, Scotland, Northern Ireland and Wales but currently there are too many regions to apply that weighting scheme, as the metadata shows.

So, we want to derive a new variable by grouping the England answers but keeping the others.

#file 2_recode.ipynb# mapper arguments are: new code, new label, {variable: old codes}
mapper = [ (1, 'England', {'Regions': frange('3-5, 7-12')}),
(2, 'Scotland',{'Regions': [2]}),
(3, 'Wales', {'Regions': [6]}),
(4, 'Northern Ireland', {'Regions': [1]}),
(5, 'Other', {'Regions': [0]})
]
# arguments: variable name, type (single response), label, mapper
dataset.derive('Regions_recoded',
'single', dataset.text('Regions'), mapper)

We are also going to weight the dataset according to age, so we can correctly represent different age-group voting intentions. Recoding a single response variable like region is different than recoding a number. To recode the age variable we use the Dataset’s band command.

# 2_recode.iypnbage_bands= [{"Younger than 18": (0, 17)},
(18, 24),
(25, 34),
(35, 44),
(45, 54),
(55, 64),
{"65+":(65, 200)}
]
dataset.band(name='Age', bands=age_bands,
new_name='Age_recoded', label="Age groups")

We are now ready to apply the weights. We have a weighting scheme to make sure our results represent the correct gender, age and region distribution.

# 3_weight.ipynb
#
age_and_gender_targets = {
'Gender': {
1:49,
2:51
},
'age_recoded': {
1: 21.3,
2: 9.4,
3: 13.4,
4: 14,
5: 13.7,
6: 11.7,
7: 16.5
},
'Regions_recoded': {
1: 84.2,
2: 8.2,
3: 4.7,
4: 2.8,
5: 0.1
}
}
scheme = qp.Rim('gender, age, region')
scheme.set_targets(targets=age_and_gender_targets)

The weighting report shows us that the weighting is sensible (someone’s done a good job of getting a representative sample) and the maximum increased weight any response has is 10% (maximum weight factor).

Finally, store our pipeline in a directory called data_processing (more on our folder structure in a future post) which means that whenever something changes, we can run them all in sequence to get our data up to speed. We provide users with a notebook that calls a script that runs all the data processing notebooks, shown below.

The run_data_processing script is a for loop that takes all notebooks in the data_processing directory and runs them with the python library papermill.

Many organisations have vastly more complicated recoding and weighting schemes and open source software can handle all of it — we’ve just kept it simple here for brevity.

In upcoming posts, we’ll cover how the above can be part of a standard, “cookie cutter” folder structure, how the automatic generation of PowerPoint files, tables and online dashboards can be integrated into the above process and how graphical user interfaces can be added to the process to make it accessible to users who need quick access to results but don’t have time to write code.

Geir Freysson is the co-founder of Datasmoothie, a platform that specialises in survey data analysis and visualisation.

--

--

Co-founder of Datasmoothie. I also maintain the open-source survey data library Quantipy and it’s enterprise equivalent Tally.