A Quick Way to Check the Linearity of Data

Learn how to use PCA to check if data is linear or non-linear

Aditya Dutt
Towards Data Science

--

Image by Author

When working on new data, the first question that comes to mind is about the distribution of the data. The distribution can be linear or non-linear. Let’s say we want to compress a 100-dimensional feature vector to fewer dimensions. We can either use PCA or an Autoencoder to compress the features. But, as we know, PCA does not work well for non-linear data. On the other hand, autoencoders can model non-linear data. Therefore, before data compression, it is essential to know if the data is linear or not.

In this short tutorial, we will see how we can tell if the data is linear or not. Using the eigenvalues, we can tell if the data looks linear or non-linear.

  1. For linear data, the first few eigenvalues will be significantly large. The rest of the values will be almost zero.
  2. For non-linear data, many principal components will have non-zero eigenvalues. The eigenvector gives the direction of the maximum spread of data. Now, if the data is non-linear and is not spread in a single direction, then all the eigenvectors will have non-zero eigenvalues. It is because there is no one general direction in which the data is spread.

To demonstrate this, we will create different data distributions and compute their PCA. Below is a short code in python for demonstration.

Step 1: Import python libraries

Step 2: Generate Linear Data. We will randomly generate 6-dimensional linear data.

Step 3: Now, generate random non-linear data.

Step 4: Now, we will generate a 6-dimensional unit hypersphere.

Step 5: Visualize the three data distributions

Scatter Plot of Original Data

Step 6: Compute PCA, observe eigenvalues, and display principal components.

Now, observe the eigenvalues in each case:

  • For linear data, the first eigenvalue is 0.51 and the rest are zero.
  • For random non-linear data, the first 3 eigenvalues have a significant non-zero value and then there are zero eigenvalues.
  • For a unit hypersphere, there is almost equal spread in every direction. Therefore, all eigenvalues are non-zero and have almost equal magnitude.
Scatter Plot of Original Data
Scatter Plot of PCA on Original Data

We can see from this plot the difference between PCA of linear vs non-linear data. In the case of linear data, the plot shows a straight line. For PC1, PC0 shows a lot of variation. In the case of non-linear data (middle and right ones), there is spread in both horizontal and vertical directions.

The complete code is available on GitHub here.

We have demonstrated how to tell if the data is linear or non-linear by observing the eigenvalues and interpreting PCA plots.

I hope you found this article useful!

I wrote this article because many students find it difficult to interpret output of PCA. To understand the working of an algorithm or model, the best way to start is by observing the outputs of simple datasets. This will help in gaining more intuition about the algorithm.

Originally published at https://www.linkedin.com.

--

--