NASA Asteroid Classification

Classifying whether an asteroid is hazardous or not.

Shubhankar Rawat
Towards Data Science

--

INTRODUCTION

What all is out there in space is still a mystery. Scientists keep on trying to search for something new in outer space. While a lot is expected to be discovered, let us look at what has been found.

ASTEROIDS
Yes!! An accidental discovery made in 1801 by the Italian priest and astronomer Giuseppe Piazzi, led to what we call an asteroid today. Giuseppe found the first asteroid named Ceres orbiting between Mars and Jupiter. Since then, many asteroids have been discovered and studied by space organizations like NASA.

Asteroids are minor planets, especially of the inner Solar System. Larger asteroids have also been called planetoids. There exist millions of asteroids and the vast majority of known asteroids orbit within the central asteroid belt located between the orbits of Mars and Jupiter, or are co-orbital with Jupiter (the Jupiter Trojans).

The study of asteroids is also crucial as historical events prove some of them being hazardous. Remember Chicxulub crater? — The crater formed by an asteroid that probably snuffed out all the dinosaurs, 65 million years ago.

Being a data science enthusiast, I thought of using machine learning to predict whether an asteroid could be hazardous or not.
Searching on Kaggle, I found NASA’s dataset about some of the asteroids discovered so far. The dataset contains various information about the asteroids and labels each asteroid as hazardous or non-hazardous.
You can find the dataset here.

Now let us look at the dataset.

ASTEROID DATASET

The data is about Asteroids and is provided by NEOWS(Near-Earth Object Web Service)

A glimpse of the dataset

The dataset consists of 4687 data instances(rows) and 40 features(columns). Also, there are no null values in the dataset.

Some of the features’ description is given below;

  1. ‘Neo Reference ID’: This feature denotes the reference ID assigned to an asteroid.
  2. ‘Name’: This feature denotes the name given to an asteroid.
  3. ‘Absolute Magnitude’: This feature denotes the absolute magnitude of an asteroid. An asteroid’s absolute magnitude is the visual magnitude an observer would record if the asteroid were placed 1 Astronomical Unit (AU) away, and 1 AU from the Sun and at a zero phase angle.
  4. ‘Est Dia in KM(min)’: This feature denotes the estimated diameter of the asteroid in kilometres (KM).
  5. ‘Est Dia in M(min)’: This feature denotes the estimated diameter of the asteroid in meters(M).
  6. ‘Relative Velocity km per sec’: This feature denotes the relative velocity of the asteroid in kilometre per second.
  7. ‘Relative Velocity km per hr’: This feature denotes the relative velocity of the asteroid in kilometre per hour.
  8. ‘Orbiting Body’: This feature denotes the planet around which the asteroid is revolving.
  9. ‘Jupiter Tisserand Invariant’: This feature denotes the Tisserand’s parameter for the asteroid. Tisserand’s parameter (or Tisserand’s invariant) is a value calculated from several orbital elements(semi-major axis, orbital eccentricity, and inclination) of a relatively small object and a more substantial‘ perturbing body’. It is used to distinguish different kinds of orbits.
  10. ‘Eccentricity’: This feature denotes the value of eccentricity of the asteroid’s orbit. Just like many other bodies in the solar system, the realms made by asteroids are not perfect circles, but ellipses. The axis marked eccentricity is a measure of how far from circular each orbit is: the smaller the eccentricity number, the more circular the realm.
  11. ‘Semi Major Axis’: This feature denotes the value of the Semi Major Axis of the asteroid’s orbit. As discussed above, the realm of an asteroid is elliptical rather than circular. Hence, the Semi Major Axis exists.
  12. ‘Orbital Period’: This feature denotes the value of the orbital period of the asteroid. Orbital period refers to the time taken by the asteroid to make one full revolution around its orbiting body.
  13. ‘Perihelion Distance’: This feature denotes the value of the Perihelion distance of the asteroid. For a body orbiting the Sun, the point of least distance is the perihelion.
  14. ‘Aphelion Dist’: This feature denotes the value of Aphelion distance of the asteroid. For a body orbiting the Sun, the point of greatest distance is the aphelion.
  15. ‘Hazardous’: This feature denotes whether the asteroid is hazardous or not.

To sum up, the features present in the dataset covers not only the information about the geometry of the asteroid but also its path and speed.

THE APPROACH

As usual, you can find the code for this article in the following Github Repository.

It is worth noting the fact that the asteroids, generally, more prominent in size are hazardous than those which are comparatively smaller.
If we consider the mean of the diameter of the asteroids that are labelled as hazardous in this dataset, then it turns out to be 0.70 KM. In contrast, the mean of the diameter of the non-hazardous asteroids turns out to be 0.40 KM.
Hence, we conclude that the dataset supports the general theory.

Let us begin.

Feature Engineering

As one can see, there are several unnecessary features present in the dataset which hardly contribute towards classification.
The features ‘Name’ and ‘Neo Reference ID’ denote the identification number given to an asteroid. These features are not useful for the machine learning model since the name of the asteroid does not contribute to the fact that it is hazardous. Also, both features contain the same values.
Thus, we can delete both features.

The feature ‘Close Approach Date’ is also unnecessary since it gives the date of when the asteroid will be near Earth. The time at which the asteroid was closest to Earth does not contribute to the fact ‘whether’ that asteroid will be hazardous or not. Instead it tells ‘when’. Thus, we are deleting this feature as well.
For a similar reason, we will also delete the ‘Orbit Determination Date’ feature.

Now, let us look at the ‘Orbiting Body’ feature. It only contains one value “Earth”. Hence, deleting this feature also(since a feature with just one value does not contribute to the machine learning technique).
Also, the feature ‘Equinox’ contains only one value ‘J2000’, thus, deleting this feature too.

Consider the features:

‘Est Dia in KM(min)’, ‘Est Dia in KM(max)’,
‘Est Dia in M(min)’, ‘Est Dia in M(max)’,
‘Est Dia in Miles(min)’, ‘Est Dia in Miles(max)’,
‘Est Dia in Feet(min)’, ‘Est Dia in Feet(max)’

All these features represent the estimated diameter of the asteroid in different units, KM = kilometre, M = meter, etc. This is an excellent example of redundant data since it is the same value represented differently. Such redundancy should be removed. The beauty of statistical analysis is that it identifies such errors in the dataset, even if a data scientist misses them. Let us not remove these features now, but have them identified statistically.

The removal of the above features so far was intuition-based. Now let us look at the statistical analysis and find out which features are statistically relevant.

Statistical Analysis

Before, proceeding let us look at the ‘hazardous’ feature. The values are ‘TRUE’ or ‘FALSE’, encoding these to 1 and 0, respectively.

Now, let us form the correlation matrix of the dataset.

Correlation matrix

So, the correlation matrix denotes that there are some features correlated with each other, which means they can be removed without hesitation.

Coming back to the features:
‘Est Dia in KM(min)’, ‘Est Dia in KM(max)’,
‘Est Dia in M(min)’, ‘Est Dia in M(max)’,
‘Est Dia in Miles(min)’, ‘Est Dia in Miles(max)’,
‘Est Dia in Feet(min)’, ‘Est Dia in Feet(max)’

Just like mentioned above, the correlation matrix quickly identifies this redundancy, and we can now delete the repeated information.
I am going to delete the features;
‘Est Dia in M(min)’, ‘Est Dia in M(max)’,
‘Est Dia in Miles(min)’, ‘Est Dia in Miles(max)’,
‘Est Dia in Feet(min)’, ‘Est Dia in Feet(max)’.

A similar explanation can be given for the features;
‘Relative Velocity km per sec’, ‘Relative Velocity km per hr’, ‘Miles per hour’,
and
‘Miss Dist.(Astronomical)’, ‘Miss Dist.(lunar)’, ‘Miss Dist.(kilometers)’, ‘Miss Dist.(miles)’

Out of the features mentioned above, I am going to keep ‘Relative Velocity km per sec’ and ‘Miss Dist.(Astronomical)’.

If we look at the ‘Est Dia in KM(max)’ and ‘Est Dia in KM(min)’ we can see that they are strongly correlated.

Let us look at their scatter plot to get a clear idea.

The scatter plot shows that the features ‘Est Dia in KM(min)’ and ‘Est Dia in KM(max)’ are completely correlated as a linear relationship between the two can be observed from their scatter plot.
So it is recommended to delete one of these features. I am going to remove the ‘Est Dia in KM(max)’ feature.

Let us now look at the variation in ‘hazardous’ feature

The count plot depicts that it is a case of an imbalanced dataset. We have a class which is dominating the other in terms of the number of data entries.
83.89% of the data instances are labelled as 0 (not hazardous) and only
16.10% are labelled as 1 (hazardous).

This means that even if we have a broken model that predicts all values as 0(not hazardous), then the accuracy will be 83.89%. So, we can not merely rely on accuracy to evaluate a machine learning classifier trained on this dataset. To get a clear idea of imbalanced data and numerous metrics that are used to evaluate a classification model, please read this article.

MACHINE LEARNING MODELS

The data has been analyzed and cleaned; now, its time to build those machine learning models.

Let us divide the dataset into an 80:20 ratio as training and test set respectively.

The train set(generated as per my code) contains 3749 data instances and has 610 cases labelled as 1(hazardous), which means that if a model predicts all values as 0, then the accuracy will be 83.72%. This will be considered as baseline accuracy for the train set.
Similarly, the baseline accuracy for the test set(according to my code) will be 84.54%.
Our model should do better than these accuracies or should be robust enough to deal with the class imbalance.

The following models are used:

  • Naive Bayes Classifier
  • SVM
  • Decision Tree
  • LightGBM

The results are formulated in a table and are as follows:

If you want to know more about the metrics used above for model evaluation, then please refer to this article.
Now let us discuss each model.

Naive Bayes

If we look at the table for Naive Bayes, then we see that the accuracies for the test set and train set are equal to baseline accuracies. Moreover, if we look at Specificity, Mathews Correlation Coefficient and False Positive Rate for the test set then these are ‘nan’, meaning the model is broken and has predicted all values as 0(not hazardous). Such a model is of zero significance because there’s no point in using a model when it can never fulfil its purpose.

Let us look at the confusion matrix too since a classification problem seems incomplete without a confusion matrix.

Confusion matrices for Naive Bayes Classifier

The confusion matrices clearly show that the model fails to predict even a single data instance as 1(hazardous) and hence the model is not robust enough.

SVM

For the record, SVM(SVC) proved to be the slowest one as far as computation speed is considered.

The results table for SVC is quite similar to that of Naive Bayes. Hence, it clearly shows that SVC has also failed(as Specificity, Mathews Correlation Coefficient and False Positive Rate are again ‘nan’).
Let us look at the confusion matrix of SVC.

Confusion matrices of Test and Train set for SVC

The confusion matrix for the test set shows that SVC fails to even predict a single value as 1(hazardous). The confusion matrix for the train set shows that only two data instances(out of 3749) were predicted as one and the two predictions made were correct.

It is a slight(yet negligible) improvement over Naive Bayes classifier. But still, I will say that the model is completely broken and has zero practical significance.

If you are new to data science, then you probably will be wondering how to find a way(or model) to deal with data imbalance since it is a severe issue at hand.

Thanks to the researchers, several classification models are robust enough to deal with such imbalanced data. Consider the following.

Decision Tree

The results table of the Decision Tree Classifier proves it to be a very robust and almost perfect model. The accuracy is 99.4% for the test set, which is excellent and also the values of Mathews correlation coefficient and F1 Score are almost touching one, which denotes that the model is practically perfect.

Let us look at the correlation matrix for the Decision Tree.

Confusion matrices for the test set and train set for Decision Tree

The confusion matrices also depict the fact that the model is robust and is almost perfect.

There are only six incorrectly predicted values in the test set and none in the train set.

Decision Tree is an excellent example of a robust model able to deal with class imbalance data.

Decision Tree might have given us good results, but there are even more robust and efficient models than it.

Ensemble models are more robust simply because they are a ‘collection’ of weak predictors.
So, I will be using LightGBM as well.

LightGBM

The results table for LightGBM shows that the model is almost perfect and gives great results. The accuracy for the test set is 99.3%, Mathews Correlation Coefficient is 0.971, and the F1 Score is 0.996.

Let us look at the confusion matrix for LightGBM;

Confusion matrices for Train and Test set for LightGBM

The confusion matrix for the test set shows that there are only 7(1 more than Decision Tree’s) incorrect predictions and only one wrong prediction for the train set.

Conclusion

In this article, we got to know about asteroids and more importantly built an almost perfect classifier to predict whether an asteroid is hazardous or not.

The choice of the features for the dataset was perfect since the results obtained by the machine learning models were great.

I hope you enjoyed the article and I am sure you must have gained something from it.

In case of any doubts or suggestions, do let me know in the comments.

THANK YOU.

--

--

I am a data science and machine learning enthusiast, who loves to share knowledge.