Why You Should Store Your Survey Data in a Graph Database

Klaus Blass
Towards Data Science
6 min readNov 29, 2020

--

Image by Author

After having spent many months and possibly millions of dollars collecting data through a survey or census, researchers are grappling with the problem of understanding and cleaning the data. Representing them as a graph in a graph database is a novel approach to achieve this. Here I’ll show the benefits of having the data in a graph database through the example of a population survey of a Health and Demographic Surveillance System (HDSS) in Western Africa.

Let’s assume you have just conducted a survey using one of the major CAPI (Computer Assisted Personal Interview) platforms, such as Survey Solutions, the World Bank’s CAPI system. And now you have the collected data in a dozen (or several dozens) of files waiting to be cleaned and analyzed.

Loading them into a database allows you to keep your data in an organized structure where you can

  • inspect them easily
  • perform non-destructive cleaning operations (without throwing original data away)
  • generate specific datasets for analysis with statistical packages
  • achieve Data Integration with other data sources (climate data, census data, etc.)
  • integrate data from future surveys
  • allow user-specific access to the data (demographers may see certain data, economists other data, administrators can update data, etc.)

In relational databases (the ones that usually have “SQL” in their name) data are kept in tables, the layout (schema) and format of which must be precisely defined beforehand. There will be tables of households, members, assets, etc. which reference each other.

In a graph database, such as Neo4j, data are represented as graphs or networks of nodes (households, members, etc.) and edges or relationships between these nodes. No need to define their format beforehand, the schema will develop as you add new data.

You can also run graph algorithms on your data, s.a. Shortest Path, Nearest Neighbors, Community Detection, Link Prediction, etc.

When should you use a graph database?

When the real world described by the data looks like a network of concepts connected by different types of relationships.

When you need to integrate your data with data from other sources and with other formats (other surveys, climate data, remote sensing, etc.).

Here is an example:

A household with its members in a compound in the village of Dara (Screenshot by Author)

This diagram is actually the standard output produced by a simple Neo4j database query:

MATCH (v:Village)--(c:Compound)-[:IN_COMPOUND]-(hh:Household {id: “40000010AA”})-[:MEMBER]-(m:Member)
RETURN v,c,hh,m

Hovering over a node displays all the properties associated with the node (eg. for a member node: name, birthdate, etc.):

(Screenshot by Author)

Double-clicking a node will expand the diagram to show all other nodes connected with it. So clicking the household node I get

  • Assets & animals
  • Dwelling characteristics
  • Interview details (interviewer, date of interview, etc.)
Nodes connected to a single household node (Screenshot by Author)

By the way, the relationships between the nodes can also have properties! Here the OWNS link connecting the household with an asset or animal stores the quantity owned and the survey round when it was reported. For future rounds we just add additional links and have the complete history of household possessions at our fingertips.

If you want to see all the data you can just switch from the diagram to a table view. And, of course, you can save the data to a CSV or JSON file for further processing by a statistical package.

But the graph view (the diagram) gives you valuable insights into your data. Take a typical family composition and you can actually visualize the family nuclei without having asked any family nucleus questions in the interview:

(Screenshot by Author)

You can see 3 disjunct subgraphs, one comprising most of the family, then a mother with two children, and finally a single member. This is because the survey didn’t collect all the relationships between members, only the relationship of each one with the household head and, in a children-under-five section, mother and father of the children. (We didn’t create links for uncle, grandchild, son-in-law, etc. of the household head).

The household head and his wife are the only persons connected with a “SPOUSE” link (it wasn’t asked for the others).

We can clearly identify their four children (B, D, K, and J), three of which have children themselves, thus constituting their own family nucleus. About the fourth child (J) we don’t know if it has a spouse or own (older) children in the household because we didn’t ask in the interview. He may very well be the father in the isolated 3-member nucleus.

Because we integrated relationship information from various sections in the survey we get this full picture with a simple query. Try this with a relational database.

Adding new surveys

Now say you do another survey for a sample of the population, for instance an Agricultural Production Survey (APS).

Without having to modify your database schema you can integrate the new data categories: fields and crops.
The database schema will automatically reflect the new node types and relationships.

(Screenshot by Author)

We can see that the household cultivated 2 crops on 2 fields:

Pearl millet on both fields, corn only on one of them.
So the “CHAMP DE MIL” field is intercropped.

Clicking on a field node its characteristics, s.a. area and water source are shown:

(Screenshot by Author)

and for crops the quantities harvested, sold, bought and lost due to pests:

(Screenshot by Author)

In yet another survey we might collect health data.
The following graph shows morbidity sections for 4 members, the reported disease(s), their type (chronic or acute) and the person who responded to the questions.

(Screenshot by Author)

Clicking a morbidity node (the small blue ones) also shows properties s.a.

  • the severity of the illness (impact on daily life)
  • whether the person was afraid to die
  • how many hours/days (during the previous 30 days) the person couldn’t perform his daily activities or attend school
(Screenshot by Author)

Clicking (or hovering over) a disease node displays details like the reported symptoms.

Data Integration with other sources

Example: TRMM Daily Precipitation data from NASA

These data come in a binary format called NetCDF which has to be unpacked with a program in some language like Matlab, R or Python. You end up with one file for each day with precipitation recorded in mm with a spatial resolution of 0.25°, that is in a grid of points evenly spaced about 25 km apart from each other. We can also aggregate them to obtain monthly and yearly data.

Now we create a TRMM node and store the data in an array as a property of that node. A user-defined function can then retrieve and interpolate the data for any given location:

klaus.precipitation(c.longitude, c.latitude, trmm.precipitation, 8, 2019)

will retrieve the interpolated precipitation for a location during August of 2019. (User-defined functions are a very powerful feature of Neo4j. They extend the query language and can make use of Java’s full functionality).

Conclusion

While you can work with the raw datasets returned by your survey or load them into a xyzSQL database, I think the effort spent on building a graph database will be amply rewarded when you start analysing, maintaining and integrating your data with other data sources, not to mention the ease of inspecting the data.

You have millions of data points in your survey? Don’t worry, Neo4j can handle billions of nodes. By the way, there is a free community version and an online sandbox where you can start to try things out right away.

At the recent NODES 2020 conference I gave a short presentation about technical details for loading A Survey As A Graph.

--

--

Survey and IT Consultant at the Development Data Group at the World Bank, assisting National Statistical Offices, NGOs and universities worldwide with surveys.