Noise is Not an Enemy of Data. Noise IS Data.

Henry Kim
Towards Data Science
3 min readMar 27, 2017

--

In analyzing data, there are three components.

First, there is the reality — this is where the data comes from.

Second, there is data, which is the bits of reality filtered through the data collection process. Data is subset of the data, but not the whole thing.

Third, there is the model, which may be constructed deductively from first principles or inductively from the data on hand. This is what we think the reality looks like and, as a general matter, is never complete.

There are two stages where the model can deviate from the reality. Where the noise, as we term it, usually crops up is the second stage, where the model is fitted to the data. Noise is, by definition, where the data does not match the model. On one hand, a great deal of noise does mean that the model does not match up nicely with the sliver of reality in the form of the data, that, as far as we can tell, the model is not very good and should not be taken for granted as the reflection of the reality. In other words, the price you see is NOT the true reflection of the real value, and you may have to bargain and haggle, so to speak. But, at the same time, the noise is, except in the convenient assumptions of statistics, rarely truly random. It represents where your model can improve upon, even in this small sliver of reality that is the data. While a model that generates a lot of noise may not necessarily be a good thing in itself, especially if your goal is to rely on the model to “commodify” the information (to tell you what the right price is, not that you need to haggle on certain things, so to speak), sometimes there is no obvious “right price” and haggle you must, and if so, you might as well know what the situation is — and understanding the patterns in the noise, EVEN if they do not point directly at what you want to know, can only help enlighten, if in directions that you may not have been expecting, a priori.

This is before you deal with the first stage where the theory may diverge from reality: data generation and data collection processes are rarely a one to one process. Easy, predictable, and routine data is far more readily available than the complex, messy, and strange data, if only by the nature of data availability. The bias against noise in matching up models against data, unfortunately, can only add to the sampling bias. Since the odd data are rarely crucial in shaping outcomes except on rare occasions, evaluating the impact of the quirks in sample selection is not easy, except when they take place, with serious consequences (financial meltdown and such). This needs to kept in mind in course of analysis.

Noise, defined as instances where the data, or even the reality, deviates from the model, is not a waste. That’s where all the potential future insights, opportunity for growth and learning, lie. Yes, you do want to reduce noise in the long run — because that is the sign that you are learning, but, it should be approached with respect and taken seriously. Sometimes, you might even want to blow up the noise, by deliberately oversampling the messy parts of the data, in order that you can investigate it deeper, at least in the short to medium term.

--

--