The world’s leading publication for data science, AI, and ML professionals.

Why less is more when it comes to data

Big data may be a trend, but more isn't necessarily better.

Opinion

"Less is more," claim the British. "In der Beschränkung zeigt sich der Meister," the Germans say.

When it comes to the usefulness of data, we should keep these sayings in mind. These could be vital to succeed – or even survive – in the unfolding data-driven economy.

Making effective use of data is a struggle for many organisations. Over the years, IT structures have evolved into a complex collection of systems. Which barely support processes. Let alone satisfy informational demand. This is an important part of the reason why many organisations are unable to supply reliable data to feed algorithms and make decisions.

Data management programmes aim to solve this problem. Albeit with sometimes disappointing results. The programmes often come down to repair after the fact and therefore risk being a one-off. Creating and maintaining high quality data requires more work than mere patching. When the basics aren’t right, it’s like filling a bucket full of holes.

Photo by Pedro da Silva on Unsplash
Photo by Pedro da Silva on Unsplash

As hard as it is to be successful in data management, there is another frequently overlooked factor in this respect: the fact that more than 80% of organisational information is stored in an unstructured form. This enormous amount of data piles up and just keeps growing.

It is a vital source for insights, yet control over all this data remains very limited. The majority of this data is in fact – simply put – garbage. The challenge is to remove the garbage and only keep the data that is relevant. Because less is in fact more.

Business leaders cannot afford not to take up this challenge. For starters because the cost of hosting, maintaining and back-up of this data is unnecessarily high and will only keep growing. Another factor is that old data needs to be removed because of laws and regulations (compliance). But by far the most important argument is that polluted unstructured data leads to bad decision making.

Imagine hiring someone based on the wrong CV. And then accidentally sending that new employee a permanent contract instead of a temporary one. In a data-driven economy, this is not acceptable.

The good news?

There is a solution to clean up old data in a controlled and responsible way. Leveraging modern techniques – indexing and categorisation technology – redundant, obsolete and trivial (ROT) is removed and valuable data is identified. These insights are added as metadata¹ to create structure in what once was an enormous unmanageable pile of data. Over the years, I have developed and tested a 5 level data retention funnel that offers a proven approach. In every layer of the funnel, a little more ROT data is removed.

Photo by Anton on Unsplash
Photo by Anton on Unsplash

Dealing with the hoarding mentality

It’s not unusual for organisations to fear losing valuable data. A good approach to overcome this is to store data in a hidden form for a limited period of time. End users can request specific data during this grace period. It is made available to them if it is required for appropriate business reasons. The reasons are documented and used to enrich data retention policies. Making data available by going through the steps of the data retention funnel is important, to make sure data is correctly enriched with metadata. At the end of the funnel, only valuable data remains.

The bad news?

This is hard work. There is no doubt that a structured Data Cleaning approach helps organisations to improve the management of unstructured data, thereby enabling better decision making. But it’s the data professionals who tackle this who do the heavy lifting. It’s hard and unglamorous work cleaning the bits and the (tera)bytes. Not many enjoy being a digital cleaning lady. But somebody’s got to do it. I think it’s extremely rewarding to witness how it contributes to new opportunities. And it’s an excellent way to live up to the promise of a popular Dutch saying: "Opgeruimd staat netjes."

Footnotes

1 Metadata is data about data, it provides information about a specific data object’s content. Some examples are the author, creation date or document class (e.g. a CV or contract).


Related Articles