The Noisy Elephant

Can’t get more data? Less noisy might do the trick

Lydia Nemec
Towards Data Science

--

Figure 1 | Image taken from DimaDim_art from pixabay.

In more traditional industries like manufacturing or health care, machine learning is only beginning to unfold its potential to add value. The key for those industries will be to switch from model-centric towards data-centric machine learning development.[1] As Andrew Ng (co-founder Coursera and deeplearning.ai, head of Google Brain [2]) points out, in these industries the key will be to embrace a “data-centric” perspective on machine learning, where the focus is on data quality not quantity.[3]

In this blog post, we will explore the effect of noise (quality) and dataset size (quantity) on Gaussian processes regression.[5] We will see that instead of increasing data quantity, improving data quality can yield the same improvements in fit quality. I will proceed in three steps. First, I will introduce the dataset. Second, I will define the noise to be simulated and added to the data. Third, I will explore the influence of dataset size and noise on the accuracy of the regression model. The plots and numerical experiments were generated using Julia. The code can be found on github. If not stated otherwise the figures are generated by the code (author).

1. The John von Neumann Elephant

To explore the relation between dataset size and noise, we use the von Neumann elephant [6] shown in Fig. 2. as a toy dataset.

Note: John von Neumann (1903 — 1957) was a Hungarian born mathematician. He made major contributions to a number of fields including mathematics, physics, computer science and statistics. In a meeting with Enrico Fermi (1953), he criticized his work by saying "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk" [7]

Figure 2 | Plot of the perimeter of the John von Neumann elephant parametrised by J. Mayer et al. [6]

The perimeter of the elephant (Fig. 2) is described by a set of points (x(t), y(t)), where t is a parameter. Interpreting t as time J. Mayer et al. [6] expanded x(t) and y(t) separately as Fourier series

Equation 1 | A Fourier expansion

where the upper indices (x, y) denote the x and y expansion, and the lower indices k indicate the kth term in the Fourier expansion. Table 1 lists the coefficients (A, B) as found by J. Mayer et al. The values listed in Table 1 also include the wiggle parameter wiggle coeff.=40 and the coordinates of the eye xₑ=yₑ=20 [6].

Table 1 | Fourier expansion coefficients to generate the von Neumann elephant. [6]

In truth, we need 24 real coefficients to make the elephant, since k ranges from k=0 to k=5 with four coefficients needed per k. However, J. Mayer et al. found that most coefficients could be set to zero, leaving only eight non-zero parameters. If each pair of coefficients is further summarised into one complex number, the elephant contour (and trunk wiggle) is indeed encoded in a set of four (plus one) complex parameters.

Figure 3 | Parametric plot of the von Neumann elephant (top left) and the Fourier Series expansions x(t) (bottom left) and y(t) (top right).

In the following, we will use the curves x(t) and y(t) with t=[ −π, π] for our experiments (shown in Fig. 3).

2. The Noise

For noise, we use random numbers drawn from a uniform distribution, standard normal distribution, or skewed normal distribution. The noise is generated by a pseudo random number generator. We use the default pseudorandom number generator in Julia based on the xoshiro algorithm.

2.1 Uniform Distribution

When sampling from a continuous uniform distribution each real number in the interval [a,b] is equiprobable. Figure 4 shows the curves x(t) and y(t) including the uniformly distributed noise in a histogram. In Figure 4, the random numbers range from a=-1.5 and b=1.5.

Figure 4 | Shows the curves x(t) and y(t) plus the uniform distributed noise and the histogram for the noise (Δx, Δy). The noise is centered around 0 ranging between [-1.5, 1.5].

2.2 Standard Normal Distribution

The standard normal distribution (also called Gaussian distribution) is a continuous probability distribution for a real-valued random variable. The general form of the normalised probability density function (pdf) is given by Eq. 2

Equation 2 | General form of the normalised probability density function

where the parameters μ is the mean (or expectatation) value and σ² is the variance of the standard normal distribution. The standard normal distribution is a symmetric distribution with mean, median and mode being equal. One of the reasons the standard normal distribution is important in statistics is the central limit theorem. It states that, under some conditions, the sampling distribution of the average of many independent random variables with finite mean and variance approaches a normal distribution, as the number of contributing random variables tends to infinity.[8] Physical quantities that are expected to be the sum of many independent processes like measurement errors are often normally distributed.[9] Therefore, noise can often be approximated by a standard normal distribution.

Figure 5 | Shows the data x(t) and y(t) plus the standard normal distributed noise and the histogram for the noise (Δx, Δy). The mean of the noise is 0 and the standard deviation is σ=2.

Figure 5 shows the data curves x(t) and y(t) including noise generated by the standard normal distribution. In the example (Fig. 5), the mean of the noise is μ=0, and the standard deviation is σ=2.

2.3 Skewed Normal Distribution

The skewed normal distribution represents a kind of asymmetric perturbed normal distribution. The distribution can be used to model asymmetric noise, where one tail is longer than the other. In a skewed normal distribution the mean and the median are different in general. The general form of the skewed normal probability density function (pdf), as shown in Eq. 3, is a product of the standard normal distribution pdf Φ(x’) and the error function ψ(α x’).

Equation 3 | The skewed normal probability density function

where the location is given by ξ, the scale by ω and the parameter α defines the skewness. Φ(x’) becomes the normal distribution (Eq. 2) for α=0 in Eq. 3. Often, the parameter α is called the shape parameter because it regulates the shape of the pdf. The distribution is right skewed if α>0 and is left skewed if α<0.

Figure 6 | Shows the curves x(t) and y(t) plus the skewed standard normal distributed noise and the histogram for the noise (Δx, Δy). The location ξ=0, scale ω=3, and shape α=4.

Figure 6 shows the data curves x(t) and y(t) including noise generated by the skewed normal distribution. The noise was generated using the parameters location ξ=0, scale ω=3, and shape α=4.

3. First Experiment: Dataset Size and Regression Quality

For the first experiment, let us use the data y(t) and add noise generated by the standard normal distribution with μ=0 and σ=2 (see Fig. 5). For this example, we take a dataset with N=1000 data points as described above, from which we sample a random selection of 10, 50, 100, and 500 data points as shown in Fig. 7. To fit the sampled points, we use Gaussian processes.

Why Gaussian processes? Aside from being widely used, Gaussian processes work well with small datasets and determining the cause of problems during training or inference is easier for Gaussian processes, than for other comparable machine learning methods. For example, Gaussian processes have been used by the Moonshot Company X in a project to expand internet connectivity with stratospheric balloons. Using Gaussian processes, each balloon decides how best to exploit prevailing winds to situate itself as part of one large communication network.[4]

Figure 7 | Shows the fit (cyan line) to the data (blue point) and the confidence interval 0.95 (blue ribbon) as given by the Gaussian process (a) with 10 (b) with 50 (c) with 100 (d) with 500 data points.

To evaluate the quality of the Gaussian process regression, we calculate the error based on the difference between the true value and the fitted one. For a concise introduction to errors in machine learning regression, see Ref. [10]. Here, we calculate the mean absolute error (MAE), the mean squared error (MSE), and the root mean square error (RMSE). The MAE, MSE, RMSE corresponding to our regression above (Fig. 7) are listed in Table 2.

Table 2 | The MAE, MSE, RMSE corresponding to the regression shown in Figure 7

From Fig. 7 and Tab. 2, we see how the quality of the fit improves with more data points. It does not come as a surprise that the fit improves with more data. Fig. 8 also visualises this behaviour in a log-log plot.

Figure 8 | Shows the MAE, MSE, RMSE corresponding to the regression shown in Figure 7 / Table 2. The axes are shown on a logarithmic scale. The grey, dotted line indicates a 1/N scaling for comparison.

We see that increasing the number of points from N=50 to N=500 reduces the RMSE by 60%. Later, we will see that halving the effect of noise yields a similar reduction.

Note: For the Gaussian processes regression, we use the squared exponential (SE) function as a kernel (Eq. 4). In Gaussian processes regression the SE kernel is the default in most machine learning libraries. The SE has a few advantages over other kernels. For example, every function in its prior is differentiable infinitely many times. Additionally, it also has only two parameters: length scale and output variance σ². The length scale determines the length of the ‘wiggles’ in your function. The output variance σ² determines the average distance of your function away from its mean. For the fit shown in Fig. 7, we have chosen the hyperparameter ℓ=8 and σ=75.

Equation 4 | Squared exponential kernel

4. Second Experiment: Influence of Noise Type

Next, we use the data x(t) and add noise generated by three different distributions: uniform, standard normal, and skewed normal, as introduced in Sec. 2. For the uniform distribution, we sample from the interval a=-2.0 to b=2.0. For the standard normal distribution, we use parameters μ=0 for the mean and σ²=4.0 for the variance. For the skewed normal distribution, we use the parameters ξ=0, ω=2.0, and α=2.0. For all three distributions, we use a dataset with N=1000 data points. From the dataset, we randomly selected 500 data points, as shown in the left column of Fig. 9.

Figure 9 | Left column: The curves x(t) and y(t) plus noise based on different distributions and the histogram for the noise (Δx, Δy). Right column: Shows the fit (cyan line) the sampled points y(t) + Δy (blue points) and the confidence interval (blue ribbon) as given by a Gaussian process with 500 data points. From top to bottom it shows the noise drawn from the uniform (rand), standard normal (randn), and skewed normal (skewed) distribution.

We use Gaussian processes regression, as before in Sec. 3. The results of the Gaussian processes regression are shown in the right column of Fig. 9. The data points are shown as blue points and the resulting fit as a cyan line. In addition, we see the confidence interval (0.95) of the fit and visualise it as a blue ribbon.

For uniform and Gaussian noise, we have RMSEs of 0.13 and 0.31 respectively. The Gaussian fit RMSE is higher because the variance of the noise is also greater. The skewed normal case is more difficult. In the Gaussian and uniform cases, minimising the fit RMSE has been equivalent to finding the maximum likelihood fit. However, the skewed normal case is more difficult, since mean and mode (maximum likelihood) are not the same. Since Gaussian processes regression optimises for the maximum likelihood fit and not RMSE minimisation, we expect a higher RMSE. Indeed, the RMSE is 1.4, as shown in Fig. 9. All in all, we see how the scale and shape of noise affects the fit RMSE we can expect.

5. Third Experiment: Influence of Noisiness

In the third experiment, we use the curve x(t) and add noise generated by the uniform, standard normal, and skewed normal distributions, as introduced in Sec. 2. We vary the scale of the noise for each distribution as follows:

  • Uniform distribution: [a,b] = {[-1, 1], [-2, 2], [-4, 4], [-8, 8]} ; mean = 0
  • Normal distribution: σ={1, 2, 4, 8}; mean μ=0
  • skewed normal distribution: ω={1, 2, 4, 8}; parameters ξ=0, α=2.0

We use a dataset with N=5000 data points for each distribution. We randomly select {50, 100, 500, 1000} points from the dataset. For each combination of scale, distribution and number of data points, we use Gaussian processes regression and calculate fit RMSE values as before in Sec. 3. The RMSEs are listed in Tab. 3 below.

Table 3| The RMSE of the Gaussian processes regression for different noise scales in the generated data. The noise is sampled from a uniform, normal, and skewed distribution. The dataset size varied from 50 to 1000 data points.

The third experiment shows that for all three distributions the number of data points must increase as the scale of the noise increases to retain the same fit quality as measured by RSME. For instance, starting with uniform noise sampled from the interval [-2, 2] (scale = 2) with N=100 points, we can either increase the number of points to N=1000, to reduce RMSE by 48% or we can decrease the noisiness by sampling from the smaller interval [-1, 1] (scale=1), to reduce RMSE by 33%. Looking at Tab. 3, we see similar tradeoffs for other scales, dataset sizes and noise types — halving noise yields a similar improvement to increasing dataset size by a factor of ten.

6. Conclusion

We have seen that more noisy data leads to worse fits. Further, even for the same variance the shape of the noise can have a profound effect on the fit quality. Finally, we compared improving data quality and quantity and found that decreasing noisiness can yield similar fit improvements to increasing the number of data points.

In industrial applications, where datasets are small and more data is hard to come by, understanding, controlling and reducing the noisiness of data offers a way to radically improve fit quality. There are various methods to reduce noise in a controlled and effective way. For inspiration, see Ref. [11].

References

  1. Andrew Ng "AI Doesn’t Have to Be Too Complicated or Expensive for Your Business", Harvard Business Review (July 2021)
  2. Wikipedia article "Andrew Ng" (December 2021)
  3. Nicholas Gordon "Don’t buy the ‘big data’ hype, says cofounder of Google Brain", fortune.com (July 2021)
  4. James Wilson, Paul R. Daugherty, and Chase Davenport "The Future of AI Will Be About Less Data, Not More", Harvard Business Review (January 2019)
  5. MacKay, David, J.C. "Information Theory, Inference, and Learning Algorithms", Cambridge University Press. ISBN 978-0521642989 (September 2003)
    Carl Eduard Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press ISBN 978-0262182539 (November 2005)
  6. Jürgen Mayer, Khaled Khairy, and Jonathon Howard "Drawing an elephant with four complex parameters", American Journal of Physics 78, 648, DOI:10.1119/1.3254017 (May 2010)
  7. Freeman Dyson "A meeting with Enrico Fermi" Nature 427, 6972, 297, DOI:10.1038/427297a (January 2004)
  8. Julia Kho "The One Theorem Every Data Scientist Should Know", Medium.com — TowardsDataScience (October 2018)
  9. Cooper Doyle "The Signal and the Noise: How the central limit theorem makes data science possible", Medium.com — TowardsDataScience (September 2021)
  10. Eugenio Zuccarelli "Performance Metrics in Machine Learning — Part 2: Regression", Medium.com — TowardsDataScience (January 2021)
  11. Andrew Zhu “Clean Up Data Noise with Fourier Transform in Python”, Medium.com — TowardsDataScience (October 2021)

--

--

I am the Head of ZEISS AI Accelerator with a background in computational physics, numerics and machine learning bridging the way from research to innovation.