The world’s leading publication for data science, AI, and ML professionals.

Optimal Undersampling using Machine Learning, with Python

Here's how to smartly undersample your signal using few lines of code

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

In the era of Big Data undersampling is a key part of Data Processing. Even if we can define undersampling in a very rigorous way, the idea is that we want to take a long, big, time and memory consuming signal and replace it with a smaller and less time consuming one.

In this post you will learn how to undersample your signal in a "smart" way, using Machine Learning and few lines of code. We will start by describing the optimization task which is behind undersampling and we will then explain our strategy to make this optimization as efficient as possible.

Let’s dive in 🙂

1. Introduction

Of course, every undersampling task is at the very end an optimization task. Let me explain why.

Consider this function:

Image by author
Image by author

I made this function using 100 points and as you can see it is enough to make it looks perfectly smooth. Now what is undersampling about?

In its very naive way and in a 2D signal, it works like this:

  1. Consider random points along the x axis (or sample uniformly distributed points). Of course, you should pick less than the original point of your signal. In our case, you could consider 10 data points, or 20, or 30, but less than 100.
  2. Connect these points using an interpolation technique. The most simple one is a linear interpolation. Basically you are just connecting the points using straight lines.

Very simple. Let’s see how it looks if we randomly pick 10 points out of the 100 ones we have.

Image by author
Image by author

Ok… are you guys happy about this undersampling? I bet you are not.

Let’s consider more data points and see what it looks like. In particular, let’s double the data points.

Image by author
Image by author

It’s better, of course.

Let’s try to see what happens with 70:

Image by author
Image by author

In that case, we are basically perfect. Nonetheless, we are considering 70% of the datapoint, so even if we are actually undersampling, we are still considering a lot of points.

Now it should be more clear why we are considering a optimization problem every time we try to undersample:

  • If you undersample using too few data point: your reconstruction is lossy (high efficiency, low reconstruction quality)
  • If you undersample using too much data point: your reconstruction is lossy but the method is not so efficient or not efficient at all. (low efficiency, high reconstruction quality)
Image by author
Image by author

2. Proposed Algorithm

The message that we can get from this analysis is that randomly or evenly sample the x axis is not an optimal strategy.

The algorithm we are proposing add another step between the first and the third step of the traditional one:

  1. Consider random points along the x axis (or sample uniformly distributed points). Of course, you should pick less than the original point of your signal. In our case, you could consider 10 data points, or 20, or 30, but less than 100.
  2. Oversample the low quality area
  3. Connect these points using an interpolation technique. The most simple one is a linear interpolation. Basically you are just connecting the points using straight lines.

Let’s analyze step 2 more carefully. If we consider the 10/20 data point images we can see that there are zones where the error is larger than others. For example, in the tails the error is minimum, because it is basically harmless to consider lines instead of gaussian curves there. Nonetheless, if we want to use lines in the peak we get in trouble and the quality is lower.

In our algorithm we consider the absolute difference between the target signal and the interpolated sampled one. Then we add the point if the distance is too large. Let’s make it more clear.

Let’s consider that the original signal has N starting points and we start with n sampled points. We are going to do the following:

1. Interpolate the signal and compute, from your n starting point, an interpolated signal with values in the original N data points 
2. Compute the absolute normalized difference between the interpolated signal in the N data points and the original N data points values 
3. Add the point if the difference is larger than a certain threshold (let's say 5 percent of the maximum)

Let’s code that 🙂

3. Hands on implementation

Importing the libraries:

Create the signal:

Defining undersample and interpolation functions:

Defining the algorithm:

4. Analyzing the results

The result of this procedure can be shown in the

Image by author
Image by author

As we can see, we start with an undersampled generic signal but then we find a real optimized version of the signal, where you have more data in the area that requires to have it.

5. N-Dimensional Signal

In the original file, you can find the same implementation for a N-Dimensional signal.

6. Conclusions

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.


Related Articles