The secret to analysing large, complex datasets quickly and productively? Constraint.

Thomas Manandhar-Richardson
Towards Data Science
5 min readOct 24, 2019

--

TL:DR: Sometimes you have so much data you can waste hours exploring without answering the important questions. I share 5 tips on how to analyse large complex datasets productively by constraining yourself.

The joys and stresses of too much data. Photo is me!

Researchers these days often have access to a lot of data. You might be analysing The European Social Survey , with hundreds variables and tens of thousands of respondents. Or perhaps you collected tons of data yourself; You might do a study where participants do a huge battery of tests, leaving you with heaps of data per person.

How do you go about analysing this beautiful mess?

Well, I’ll tell you what not to do:

Dive in head first and start correlating and plotting and building multilevel models but then realise that you might have to control for some variables and you get a result but then it goes away with covariates B and C but not when you also add A and then you find something seemingly robust but you’ve tested like 15 things by now so maybe it's a false positive so you learn about cross validation and what the hell am I even doing here and when did it become Friday?!?

But seriously, sometimes we have more data than we know what to do with. There are worse problems, I’ll admit, but this has been a problem for me during my PhD. I’ve lost entire days digging around in big datasets. Leaving aside the problem of p-hacking (I generally try to validate any results found on a new dataset, and so should you!), this can really eat into your time.

You can also get suckered into this when your goal is to “get to know the data inside and out”. We often hear statisticians recommending that we do more exploratory data analysis, plot our data etc. This is valuable, but it’s not necessary to know every corner of the European Social Survey. It’s often not productive to see how a model is influenced by 10 different covariates that are barely related to the dependent variable. You have to draw the line somewhere and say “enough”!

Graph by the author

I’m a big fan of the idea that constraints on our choice can be liberating. If that sounds paradoxical, these great articles [here and here] will explain it. I’ve found that when it comes to data, more options aren’t always better.

Solutions

1. Be question driven
Before looking at your data, decide what your primary questions are. These are the questions you will set out to answer, and after you will stop (at least until you’ve written the paper/chapter). What are the questions that, once answered, could be publishable? Not to be too cynical, but in academia, papers are the name of the game. Papers aren’t usually a collection of interesting observations stapled together. They are focussed, and you should be too.

Do all the analysis required to answer a paper-sized question in a satisfactory manner (i.e. by checking the robustness of the result, ruling out alternative explanations), and then stop. All other questions in the data are unimportant or less important. You can always come back to them later.

2. Discuss the data with colleagues or lab members
Talking them through your data and analyses will clarify it in your mind. The idea of gaining deeper understanding of something through explaining it to others is often called the Feynman technique, and I’m a big fan. They may be able to advise what in the data is interesting and what isn’t.

Remember, you’re trying to produce knowledge that is valuable to your field. Therefore, it makes sense to ask a few members of said field if they think your analysis is valuable.

3. Keep detailed notes of what you try
With so many t-tests and regression models flying about, you might end up forgetting you ran some analysis previously and end up running it again! Also, sometimes models don’t yield insight in isolation, but in combination with other models.

For example, you run model A, and predictor A1 shows an effect that you don’t understand. So you see if A1 correlates with a few other things and that might explain it. You build model B and eureka! Now record what you did. Interpret each test before moving onto the next one. Don’t just go running tests willy-nilly without thinking about what they mean. Running tests for the hell of it is a great way to find false positives. If you keep notes and interpret what’s going on as you go, every test truly does teach you something about the data.

4. Set a time limit on analysis
Set yourself an amount of time to do it in. Deadlines, even self-imposed ones, are great for clarifying the mind. Having years to do a PhD is not an excuse to waste time. Any data stuff not done after this time will have to wait for another day. It’s not a big deal if you miss something important, as your supervisor/colleagues/reviewers on the eventual paper will find them. You can even come back to the dataset and take a few more days on it later when you’ve got some more time. This also has the advantage of returning to the data refreshed. Give yourself a time limit. Don’t just wander leisurely through the data.

5. Pre-registration
Constrain what analyses you’ll do in a formal way. Write down what analyses you’ll do before you look at the data. Aspredicted is a good tool to help you with this. Pre-registration is brilliant for many other reasons: give it a Google, read this or ask your local Open Science Methods Police for more information (If you’re at University of Manchester that may well be me!).

Conclusion

Data is beautiful, and lots of data is simply sublime, but be wary of the pitfalls. I hope the lessons I’ve learned are useful to you! As with all my blog posts, I’ll update it if other people point out stupid arguments, spelling errors or say things that I think would be valuable for others to read. Cheers!

If you liked this post, follow me on Twitter

--

--

Data scientist at https://peak.ai/. Interested in AI explainability, AB testing, causal inference and recommenders