A Data Scientific Method

How to take a pragmatic and goal-driven approach to data science

Peter Turner
Towards Data Science

--

The main aim of data science is simple: it is to extract value from data. This value could be of different forms in different contexts — but, usually, it comes in the form of better decisions.

Data Science Should Ensure that Data Results in Better Decisions (CREDIT: Author on Canva)

As we venture ever further into the 21st century, the role that data plays in the decisions we make is becoming ever larger. This is because of the sheer volumetric increase in available data, as well as the improvements in the tools we can use to store, process and analyze it.

However, in the grander scheme of things, the field of data science is still in its infancy. It is a field that has emerged at the intersection of several other disciplines — statistics, business intelligence, and computer science to name a few. As these fields are subject to rampant evolution, so more so is data science.

Therefore, it is important to formulate clearly an approach to making better decisions from data, so that it may be applied to new problems methodically; sure, the process may start out as ‘firing shots into the dark’ — but you at least want your shots to become more accurate over time; you want your process to improve with each project.

At Gaussian Engineering we have done a number of projects with the express aim of extracting value from data; this post will attempt to document some of our learnings and formulate a process for doing data science; an inspiration for our approach is the time-and-tested scientific method…

The Scientific Method

The scientific method is a procedure that has characterized the field of natural science since the 1700s; it consists of a series of systematic steps — which ultimately aim to either validate or reject a statement (hypothesis).

The Phases of the Scientific Method (CREDIT: Author on Canva)

The steps go something like this:

1. Observe → Make an observation

2. Question → Ask questions about the observation, gather information

3. Hypothesize → Form a hypothesis — a statement that attempts to explain the observation, make some predictions based on this hypothesis

4. Test → Test the hypothesis (and predictions) using a reproducible experiment

5. Conclude → Analyse the results and draw conclusions, thereby accepting or rejecting the hypothesis

6. Redo → The experiment should be reproduced enough times to ensure no inconsistency between observations/results and theory.

As an example, imagine that you have just gotten home from school or work; you turn on your bedroom light, and nothing happens! How could we use the Scientific Method to determine the problem?

The Scientific Method Applied to Everyday Life (CREDIT: Author on Canva)

Science is a methodology for increasing understanding; and the scientific method can be seen as an iterative method to standardize the process of conducting experiments, so that all experiments may produce more valuable, reliable results — and therefore, better understanding.

In a similar manner, we would like a standardized methodology for data science; that is, a method which prioritizes the obtaining of information that is relevant to the goal of the analysis.

“If it disagrees with experiment, it’s wrong. In that simple statement is the key to science.” — Richard P. Feynman

The Data Scientific Method

The Gaussian Data Scientific Method (CREDIT: Author on Canva)

At our organization, Gaussian Engineering, we have come to a method which we feel works well for our projects. Like the scientific method, it is made up of 6 stages:

1. Identify

2. Understand

3. Process

4. Analyze

5. Conclude

6. Communicate

These stages will be explained in more detail, and I will list some of the tools/methodologies we use during each stage (our team programs in Python, and uses various open-source tooling, so excuse my bias in this area).

Identify

The “identify” stage is concerned with the formulation of the goal of the data science project; it could also be called the “planning” stage.

We find it immeasurably helpful to get a very clear sense of what we are trying to achieve through analyzing the dataset in question. To borrow a term from the PAS 55 Physical Asset Management Standard, we try to ensure that our team has a ‘clear line of sight’ to the overall objectives of the project.

During this stage, we ask questions like:

  • What decisions need to be made from this data?
  • What questions do we wish to answer?
  • For answers, what level of confidence would we be happy with?
  • Can we formulate hypotheses relating to these questions? What are they?
  • How much time do we have for the exploration?
  • What decisions would the stakeholder like to make from this data?
  • What would the ideal result look like?
  • How are we to export and present the final results?

Some useful tools/methodologies for the ‘identify’ stage:

  • Workshops/brainstorming sessions
  • Formulation of a designated space to keep related documents and findings together (SharePoint site, Dropbox folder etc.)
  • Slack or Microsoft Teams (digital platforms for collaboration)
  • Trello or Asana (applications to help with Project Management)

Understand

This “understand” stage is all about getting a general feel for the data itself.

Before you start losing ourselves in the details; diving into the various data sources; filtering on various fields, and walking the fine line between ‘value-added work’ and ‘analysis paralysis’, it is useful to ensure that your team has a bigger picture understanding of what is there. We do this after having spent a good amount of time establishing a line of sight to our project goals — so as to keep them fresh in our minds during this next phase.

During this stage, we ask questions like:

  • What is the size of the data?
  • How many files are there?
  • To what extent does the data originate from different sources?
  • Automated exports or manual spreadsheets?
  • Does the data have consistent formats (dates, locations etc.)?
  • What is the overall data quality? In terms of the 6 dimensions of data quality?
  • What is the level of cleaning required?
  • What do the various fields mean?
  • Are there areas in which bias could be an issue?
Our Take on the Six Dimensions of Data Quality (CREDIT: Author on Canva)

Understanding the aspects of your data, such as its overall size, can aid you in deciding how to go about your analyses; for smaller data you may wish to do all of your analyses in memory — using tools like Python, Jupyter, and Pandas, or R; for larger data you may be better off moving it into an indexed, SQL database (for larger data still, Hadoop and/or Apache Spark become options).

What is also particularly fun about this stage is that — if you have a clear line of sight to your goal — then, as you gain a better understanding of the data you can determine which aspects of it are most important for the analyses; these are areas in which most of your effort can be directed first. This is especially helpful in projects where there are strict time constraints.

Some useful tools/methodologies for the ‘understand’ stage:

  • Workshops/brainstorming sessions
  • Python
  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualizations)
  • Numpy and Pandas (Python libraries)
  • Matplotlib and Seaborn (Python visualization libraries that can help with the viewing of missing data)
  • R (a programming language geared towards statistics)
Using Python to Visualize Missing Data with a Heatmap (Yellow is indicative of Missing Data) (CREDIT: Author on Jupyter Notebook)
Code Snippet for Heatmap (‘df’ stands for ‘DataFrame ‘— a Pandas data structure)

(The above heatmap was generated with random data)

Process

This ‘process’ stage is all about getting your data into a state that is ready for analyses.

The words ‘cleaning’, ‘wrangling’ and ‘munging’ come to mind.

A useful phenomenon to put to you here is the Pareto Principle — or ‘80/20 Rule’:

“for many events, roughly 80% of the effects come from 20% of the causes” — Vilfredo Pareto

The Pareto Principle or 80/20 Rule (CREDIT: Author on Canva)

The ‘process’ stage can often take up the most time; in light of the Pareto Principle, it is important to prioritize what aspects of the data you devote most time to; you want to focus on what you think is the most important first, and come back to secondary fields only if necessary and if there is time to do so.

During this stage, we may do any or all of the following:

  • Combine all data into a single, indexed database (we use PostgreSQL)
  • Identify and remove data that is of no relevance to the defined project goal
  • Identify and remove duplicates
  • Ensure that important data is consistent in terms of format (dates, times, locations)
  • Drop data that is clearly not in-line with reality, these are outliers that are unlikely to be real data
  • Fix structural errors (typos, inconsistent capitalization)
  • Handle missing data (NaNs and nulls — either by dropping or interpolation, depending on the scenario)

The purpose of this stage is really to make your life easier during the analyses stage; processing data usually takes a long time and can be relatively tedious work, but the results are well worth the effort.

Some useful tools/methodologies for the ‘process’ stage:

Analyze

This stage is concerned with the actual analyses of the data; it is the process of inspecting, exploring and modelling data — to find patterns and relationships that were previously unknown.

In the data value chain, this stage (along with the previous stage) is where the most significant value is added to the data itself. It is the transformative stage that changes the data into (potentially) usable information.

In this stage you may want to visualize your data quickly, attempting to identify specific relationships between different fields. You may want to explore the disparity of fields by location, or over time.

Ideally, in the identify stage, you would have come up with several questions relating to what you would like to get out of this data, and perhaps have even stated several hypotheses — this is then the stage where you implement models to confirm or reject these hypotheses.

During this stage, we may do any to all of the following:

  • If there is time-based data, explore whether there exist trends in certain fields over time — usually using a time-based visualization software such as Superset or Grafana
  • If there is location-based data, explore the relationships of certain fields by area — usually using mapping software such as Leaflet JS, and spatial querying (we use PostgreSQL with PostGIS)
  • Explore correlations (r values) between different fields
  • Classify text using natural language processing methods (such as the bag of words model)
  • Implement various machine learning techniques in order to identify trends between multiple variables/fields — regression analyses can be useful
  • If there are many variables/fields, dimensionality reduction techniques (like Principle Component Analyses) can be used to reduce these to a smaller subset of variables that retain most of the information
  • Deep learning and neural networks have much potential, especially for much larger, structured datasets (though we have not yet made substantial use of this)

The analyses stage is really the stage where the rubber meets the road; it also illustrates the more sexy side of data science.

Visualizing the Distribution of Two Variables Using Seaborn’s Jointplot (CREDIT: Author on Jupyter Notebook)
Code Snippet for Jointplot

Some useful tools/methodologies for the ‘analyses’ stage:

(Note that we are leaving the visualization tools for the last section)

  • mySQL, SQLite or PostgreSQL (for querying, including spatial querying — for SQLite, see SpatiaLite)
  • JetBrains DataGrip (Pycharm IDE)
  • Datasette (a tool for exploring and publishing data)
  • Jupyter Notebook (allows for sharing of documents containing live code, equations, and visualizations)
  • SciPy (Python library for advanced calculations)
  • NumPy & Pandas (Python data analyses/manipulation libraries)
  • Scikit-Learn (Python machine learning library)
  • Tensor Flow (Python machine learning library generally used for deep learning and neural networks)
  • Keras (Python library for fast experimentation with neural networks)

Conclude

This stage is concerned with drawing solid, valuable conclusions from the results of the analyses phase. This is the phase in which you can formulate clear answers to your questions; it is the phase in which you can either prove or disprove your hypotheses. It is also the stage in which you can use your conclusions, to generate actionable items to aid in the pursuit of the goal (if appropriate).

We usually aim to create a list of conclusions or ‘findings’ that have come out of the analyses and a subsequent list of recommended actions based on these findings. The actions should be listed with your target audience in mind: they want to know succinctly what was found and what they can do with/about it.

In this phase we may do any to all of the following:

  • Cross check findings with original questions (‘identify’ phase) and determine what we have answered
  • Reject or accept the various hypotheses from the ‘identify’ phase
  • Prioritize conclusions/findings, which ones are most important to communicate to stakeholders — which are of most significance?
  • Attempt to weave conclusions together into some form of story
  • Identify follow up questions
  • Identify high-priority areas in which action will yield the most valuable results (Pareto Principle)
  • Develop recommendations and/or actions based on conclusions (especially in high-priority areas)

Some useful tools/methodologies for the ‘conclude’ stage:

Communicate

Arguably the most important step in the Data Scientific Method is the ‘communicate’ phase; this is the phase in which you ensure that your client/audience/stakeholders understand the conclusions that you have drawn from their data.

They should also be presented with these in such a way that they can act on them — so if you do not recommend actions, the conclusions should then be presented so as to stimulate ideas for action, within them.

This is the phase in which you package your findings and conclusions in beautiful, easy-to-understand visualizations, presentations, reports and/or applications.

A Geographic Visualization Using Apache Superset (CREDIT: Apache Superset)

In this phase we may do any to all of the following:

  • If there is time-based data, create sexy time-series visualizations using packages like Grafana or Superset
  • If there is spatial data, create sexy map visualizations using packages like Leaflet JS, Plotly or Superset
  • Create statistical plots using D3.js, Matplotlib or Seaborn
  • Embed various visualizations into dashboards, and ensure these are shareable/portable (whether hosted or built as an application) — Superset is a great way to do this within an organization
  • Develop interactive visualizations using D3.js or Plotly
  • Develop interactive applications or SPAs (Single Page Applications) using web technologies such as Angular, Vue.js or React (or just vanilla JavaScript!)— link these up to the data using libraries such as Psycopg2 for PostgreSQL

Some useful tools/methodologies for the ‘communicate’ stage:

  • Grafana (for time-series)
  • Apache Superset (exploration and visualization platform; allows for the creation of shareable dashboards; great for a variety of data sources, including SQL databases)
  • Matplotlib, Seaborn, and Bokeh (Python visualization libraries — Seaborn is more for statistical visualization, and is built on top of Matplotlib)
  • D3.js (A JavaScript library that directly links HTML to data, allowing for beautiful, interactive, and highly customizable in-browser visualizations)
  • Leaflet.js (A JavaScript library for creating interactive maps)
  • Plotly, Altair, and Pygal (Python libraries for interactive visualizations)
  • Jinja 2 (Python, HTML templating library — similar to Django templates)
  • Psycopg2 (PostgreSQL driver to facilitate database connections through Python)
  • Angular, Vue.js and React (SPA libraries/JavaScript frameworks)
  • Microsoft Office (Excel, Word, and PowerPoint) — for reporting

Information Should Result In Action

Now it is all very well to go through the process as stated thus far; after all, it should result in some sound information. Really though, to realize any benefits of this data, something should be done with the information obtained from it!

Like the Scientific Method, ours is an iterative process, which should incorporate action…

So, to modify our diagram slightly:

The Data Scientific Method with Feedback and Action Loop (CREDIT: Author on Canva)

Oftentimes we may also go through the six stages, and there won’t be time for action before we iterate once more. We may communicate findings that immediately incite further questions — and we may then dive right into another cycle. Over the long term, though, action is essential for making the entire exercise a valuable one.

In our organization, each new data science project consists of several of these cycles. Communication of results often sparks new discussions and opens up new questions and avenues for exploration; if our conclusions result in actions which yield favorable results? We know we are doing something right.

“Without data, you’re just another person with an opinion.” — W. Edwards Deming

For the original version of this article, click here. Thanks to Jaco Du Plessis for putting together the original steps for the Data Scientific Method (View on his GitHub account)

--

--

Inquisitive EdTech cofounder. Software person. Interested in history and historic fiction.