P.C.A. meets explainability

An interesting scenario where Principal Component Analysis components are actually easy to interpret, from astrophysics

Published in

Towards Data Science

5 min readDec 2, 2020

Every data scientist, especially the ones that find themselves to work with Big Data, knows the importance of dimensionality reduction. If you have a dataset that has a large amount of columns and you have a Machine Learning task to complete:

A) The algorithm could take a massive amount of time and energy to perform
B) It could perform really badly

So it is important to know and understand some dimensionality reduction techniques and one of the most famous one is the Principal Component Analysis (P.C.A.).

This algorithm projects your data into another dimension, but with lower dimensionality. Speaking in simple terms, it reduces the number of columns. The disturbing fact is that if you start with a dataset that is readable and easy to interpret, with almost (hey, I said almost) any probability you will end up with a reduced number of column, but they are not easy to understand at all.

For example, let’s say you have a record of people and some information about them like Sex, Weight, Height, Favourite Movie, etc... . The feature are extremely clear, right? But if you apply a Principal Component Analysis, you will end up with less columns that are not easy to interpret: you don’t know what “column 1,2,3,4” are. And I mean…you may not care about it, but I think we all agree that it is not the best situation when you start loosing control about your features.

The interesting thing I want to share with you guys is that in this specific scenario you actually know what Principal Component Analysis is doing, and which directions are chosen to project the data.

Let’s go.

The Libraries:

The dataset:

The dataset that I’ve used is an astrophysics one, coming from Hubble Telescope Observation. It basically is a record of stars observation.

Let’s give a look at the features:

#ID is the id number…not interesting at all
X and Y are the coordinates of the stars
F606W and F814W are two fluxes at 606 nm and 814 nm wavelenghts
error and error.1 are the two related errors
Chi is the fit of the stars with the point spread function (P.S.F.)

Sharp is the target, so let’s pretend not to have it into our dataset.

As it is possible to see, we have 50000+ values and 8 columns. I don’t know your technology guys, but my computer starts to struggle a bit with these numbers, especially when I want to apply some expensive algorithm like SVM or NNs.

P.C.A. time!

If we blindly go and apply the Principal Component Analysis now, we will see that it will choose the spatial coordinates to project the data. Let’s say we want 3 components:

It is clear that the ‘FirstComponent vs SecondComponent’ plot is the coordinates scatter plot:

But this is not that interesting, right? I mean, it is basically telling us that the coordinates can not be projected into another directions because they contain an high variance by themselves.

Let’s do the same thing excluding the spatial features.

Now we start partying…

KEEP IN MIND:

In astrophysics an important quantity that is used to distinguish the star nature is called stellar color and it can be expressed as the difference between the 814 nm and 606 nm flux .

THIRD COMPONENT:

Now let’s give a close look to the ThirdComponent vs FirstComponent plot:

The P.C.A. has recovered the stellar color as one of its component!

In particular, if you play a little bit you can even identify the almost exact third component expression:

1.24+1.31*ThirdComponent=Stellar Color

FIRST COMPONENT:

Even the First Component of the P.C.A. is not hard to retrieve. In fact it appears to be something like this:

FirstComponent= -(data.F606W-26.5)*1.3-5*data.error

If you don’t trust me (shame on you…) , you can try to follow this optimization code I wrote to find the best fit parameters for that curve. With this one you can retrieve more accurate values for those numbers (26.5,1.3, -5…):

SECOND COMPONENT:

The second Component appears to be related to the Chi value with this expression:

Chi=SecondComponent*0.96+2.276

To summarize:

The First Component is a combination of one flux and one error, with some other minor corrections
The Third Component is related to the stellar color
The Second Component is related to Chi.

The First Component and the Second Component tell us that the fluxes and the error are strongly correlated, and they can be put together, while the Chi variable is extremely informative, and cannot be identified together with something else. The Third Component is mind-blowing, as it says that an important amount of variance (aka information of the dataset) is found on the stellar color.

That means that P.C.A. is capturing the reason why the stellar color is a quantity extremely used in astrophysics.

Conclusion:

I’m a physicist, so I’m not going to lie on this one: this situation is extremely rare. When you use P.C.A. you usually pay the cost of losing explainability on your features. This report does not want to make the readers believe that you can always extract explainability from P.C.A.

Anyway, it is still fun to find physics in Machine Learning algorithms results, and it is actually mind-blowing when an algorithm that does not know anything about astrophysics capture an astrophysics quantity almost out of the blue.

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.

Ciao :)