The world’s leading publication for data science, AI, and ML professionals.

Sidestepping the pitfalls of an organically developing data system

Data trains, a different approach to understanding data systems

Opinion

. A simple explanation.

Visualizing Player Pathway, Nick Coleman, Tableau Public
Visualizing Player Pathway, Nick Coleman, Tableau Public

Introduction.

Improving the management of Data projects will save you and your organisation time and money. Data systems are important, they allow us to get information from one point to another. In many organisations they are built organically where a problem presents itself and a solution is provided. Over time this develops into a complex system. But this process of step by step developments can lead to problems which can ultimately limit the effectiveness of the system.

There are many reasons why a data project might fall into this trap but the one which dictates the rest is poor management. This can be caused by a lack of understanding of how data works by the people in the management positions. If you think you are one of these people, I’m going to explain data in a way that I hope helps you understand the key pitfalls. Giving you the foresight to manage more effectively.

The explanation.

Trains and their infrastructure are viewed as permanent and hard to change, they are all about getting something from one place to another, along set and repeatable routes. Let’s use trains to tell the story of a how a data system is developed. We all understand how they work, and data systems are similar in many ways.

Most data projects are started by the need to solve a problem. In this explanation there is you; you are part of an organisation and you are at a given location. This is a long way from another part of the organisation which is in a different location. The problem: you need to know what’s going on at the other part of your organisation to effectively run the business. This means you need to get information from this location to where you are. This is the purpose of the data system, transporting the data which holds information from one point to another allowing you to make better decisions.

Image by author
Image by author

Let’s put this problem into our train metaphor. Imagine our two locations are stations and a train carries the cargo (data) from one to the other. You build a trainline from what we’ll call the collection point to you at what we’ll call the destination location. This allows a train to carry its cargo from the source to you at its destination (figure 2). The train’s cargo is data. You are at the destination which is now a station where all the cargo is supplied to you. You can inspect it, make reports, interpret the data and unlock the information within it.

Image by author
Image by author

This solves your business problem. However, there are some constraints. Trains don’t run constantly, they have timetables. Data systems are similar you need to decide how often you want fresh data and set a schedule. It’s also a one-way system, you can transport cargo from the collection point to the destination station but not the other way round. Congratulations, you have built a crude one-dimensional, one-way data system.

What happens when we change the problem definition?

Let us move this on a step. Imagine the train is transporting two types of cargo; apples and pears. This represents different fields (columns) in your data. Each provides a different bit of information to be used by you at the destination station. In your crude system all the loose apples and pears are dumped in separate waggons and delivered to you at the destination station. You can figure out how many apples or pears you have by counting them all individually. This is time consuming and you don’t like doing it. Anyway, your business is in selling bags of apples and pears, not single ones.

So, you ask an engineer to build a sorting station, this is located on the trainline before it gets to you at the destination station (figure 3). This puts all the apples and pears in bags. Now, instead of saying you have 17,534 apples you can say you have 342 bags of apples. Well done, you have aggregated your data. This improves efficiency but lines us up for a future problem.

Image by author
Image by author

The first pitfall. It turns out there are two types of apples. You didn’t know this before and now all the apples are put into the same bags at the sorting station, which means they arrive to you all mixed up. This is a problem; you can’t figure out how many apples you have of each type at the destination station. Now you have to ask the engineer to go back and modify the sorting station.

If spotted early, it’s not a big issue but you are still going back and redoing work already done. This represents a real-world data problem where at the start of the project we didn’t fully realise what type of data would be important to solving the problem. Better communication would have informed you that there is more than one type of apple and you could have made a decision as to whether this would be relevant or not. This also demonstrates how a small change in the definition of the problem can have large ramifications to system design.

Database creation.

Image by author
Image by author

At the destination station at the end of each day all the cargo is gone and by the next day you have forgotten what happened the day before. You need to know how many apples and pears you produced each day compared to previous days. To do this you need to keep a record of all the past cargo and add all new cargo to it. This creates your database (figure 4). You need to make a note of the date of the delivery to know what happened on what day. You now also have date as a field in your dataset.

Adding new data types.

It turns out that the organisation also owns a dairy farm. You like what you’ve done with the apples and pears and the information you are getting at the destination station. You can see the benefit of knowing the same information about the dairy farm.

The trainline and train already exist so you think it will be simple, but you can’t use the carriages built for transporting apples and pears to transport milk. The same is true of different types of data. They can travel on the same train but need different carriages. So, you build a new carriage and add it to the train. This carries the milk to the destination but before it gets there it goes through the sorting station, this also needs to be modified to handle the new cargo type. Finally, you get the train running all the way to the destination station with the new cargo type.

This demonstrates what happens when we need to add more fields and data types to our system after its been built. It also illustrates the issue of what happens when we need to make modifications to an already built system and how this is amplified as the complexity of the system increases.

Timing.

Adding milk to the train causes one unexpected change. Milk goes off quicker than apples and pears so the schedule has to be updated to run more frequently. This represents the time demand of different data types and sources. The information held in some data is only valuable if delivered within a certain time. One solution is to build a different line which runs more regularly.

Design oversight, missing data.

Let us imagine business is going really well for your organisation and we now have a new collection location. In the real-world this can represent an already existing part of your organisation or a new part. This is similar to the first location in that it processes apples and pears. Now you want to see the information from this new location at the destination station. You decide to build a similar line which delivers the cargo to the sorting station where it can be grouped together with the cargo from the other lines and transported in an easy-to-use way to the destination station.

Image by author
Image by author

Sometime later you notice that some of the apples are arriving damaged. This represents a problem with the business that we need to identify and fix. You turn to the information system and try to figure out what’s happening. You know when it started because the date is logged but you don’t know which location the damaged apples are coming from. You would need to know the source location of the apples and that’s not currently included in the data.

This represents an oversight in the system design with a key bit of data being absent. Without it, in this scenario the system is useless. To fix it you have to go right back to the source and make sure the name of the location collection point is included in the data.

Summary.

You can see how a better understanding of the end requirements of the system will greatly improve your ability to mitigate against potential pitfalls during development. We also need to acknowledge that knowing all the future requirements of a system is not a luxury we often have. Asking more questions at the start, improving our understanding of the data and spending more time on planning will save you time in the long run.

The dream….

Now we are really flying, and you have a hunch you think will help the business. You’ve noticed there is sometimes a surplus of apples and pears and sometimes you run out. This is affecting the organisations profits. You reckon this is dependent on the number of people at the destination consuming the apples and pears. If we know how many people are at the destination location then we can match the output of apples and pears to fit. To do this we introduce another data source and type. This tells us how many people are at the destination location. It arrives in the destination location on its own line but is also directed back to the other locations. This allows them to alter production to fit the consumption. You have now built a loop system where data can be used to improve your business.

Image by author
Image by author

Related Articles