Predicting Stars, Galaxies & Quasars with Random Forest Classifiers in Python

Insights from the Sloan Digital Sky Survey dataset

Published in

Towards Data Science

9 min readDec 14, 2018

Recently, in my quest to search for interesting (physics-related) datasets, I chanced upon the Sloan Digital Sky Survey (SDSS) dataset on Kaggle, as well as a brilliant Kaggle kernel by Faraz Rahman which predicted the different types of astronomical objects (stars, galaxies, and quasars) using Support Vector Machines in R. However, as R brings back some horrible memories and training SVMs requires a lot of computational effort, I decided to give this classification problem a go using scikit-learn’s Random Forest Classifier.

Overview of Data

Labels

So what exactly are stars, galaxies, and quasars? Had you asked me prior to starting this project, I would’ve not been able to answer (shame on me). Fortunately, Faraz’s notebook succinctly summarises what they are:

A GALAXY is a gravitationally bound system of stars, stellar remnants, interstellar gas, dust, and dark matter. Galaxies are categorised according to their visual morphology as elliptical, spiral, or irregular. Many galaxies are thought to have supermassive black holes at their active centers.
A STAR is a type of astronomical object consisting of a luminous spheroid of plasma held together by its own gravity. The nearest star to Earth is the Sun.
A QUASAR, also known as quasi-stellar object, is an extremely luminous active galactic nucleus (AGN). The power radiated by quasars is enormous. The most powerful quasars have luminosities exceeding 1041 watts, thousands of times greater than an ordinary large galaxy such as the Milky Way.

Professor Andrew Bunker’s page on the Short Option course on Stars and Galaxies (S26) is also a great resource.

Features

The details of the details can be found on on the Kaggle dataset overview. A summary of the more important features are:

ra, dec — right ascension and declination respectively
u, g, r, i, z — filter bands (a.k.a. photometric system or astronomical magnitudes)
run, rerun, camcol, field — descriptors of fields (i.e. 2048 x 1489 pixels) within image
redshift — increase in wavelength due to motion of astronomical object
plate — plate number
mjd — modified Julian date of observation
fiberid — optic fiber ID

Exploratory Analysis

The analysis here follows that of Faraz’s. I’ll let the visualisations speak for themselves.

Loading Libraries and Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedf = pd.read_csv("skyserver.csv")

Description of Data

Naturally, I started with df.head(), df.describe(), and df.info(). The output of df.info() is shown below:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
objid        10000 non-null float64
ra           10000 non-null float64
dec          10000 non-null float64
u            10000 non-null float64
g            10000 non-null float64
r            10000 non-null float64
i            10000 non-null float64
z            10000 non-null float64
run          10000 non-null int64
rerun        10000 non-null int64
camcol       10000 non-null int64
field        10000 non-null int64
specobjid    10000 non-null float64
class        10000 non-null object
redshift     10000 non-null float64
plate        10000 non-null int64
mjd          10000 non-null int64
fiberid      10000 non-null int64
dtypes: float64(10), int64(7), object(1)
memory usage: 1.4+ MB

None of the entries are NaN, as expected of a well-maintained dataset. Cleaning is not necessary.

Unique Entries

The nunique() method returns Series objects with the number of unique entries for each column.

df.nunique().to_frame().transpose()

Occurrences of each Astronomical Entity

I then ran value_counts() on the class column.

occurrences = df['class'].value_counts().to_frame().rename(index=str, columns={'class': 'Occurrences'})occurrences

We see that majority of the entries are either galaxies or stars. Only 8.5% of the entries are classified as quasars.

Density Distribution Plots

Using a kernel density estimation (kde), I plotted (smooth) density distributions of the various features.

featuredf = df.drop(['class','objid'], axis=1)
featurecols = list(featuredf)
astrObjs = df['class'].unique()colours = ['indigo', '#FF69B4', 'cyan']plt.figure(figsize=(15,10))
for i in range(len(featurecols)):
    plt.subplot(4, 4, i+1)
    for j in range(len(astrObjs)):
        sns.distplot(df[df['class']==astrObjs[j]][featurecols[i]], hist = False, kde = True, color = colours[j], kde_kws = {'shade': True, 'linewidth': 3}, label = astrObjs[j])
    plt.legend()
    plt.title('Density Plot')
    plt.xlabel(featurecols[i])
    plt.ylabel('Density')
plt.tight_layout()

Filter band densities are also plotted for each class.

filterbands = pd.concat([df.iloc[:,3:8], df['class']],axis=1)plt.figure(figsize=(15,5))
plt.suptitle('Density Plots')
sns.set_style("ticks")
for i in range(len(astrObjs)):
    plt.subplot(1, 3, i+1)
    for j in range(len(featurecols2)):
        sns.distplot(df[df['class']==astrObjs[i]][featurecols2[j]], hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3}, label = featurecols2[j])
    plt.legend()
    plt.xlabel(astrObjs[i])
    plt.ylabel('Density')
plt.tight_layout()

Additional Visualisations

For completeness, I include a 3D plot, identical to that of the original notebook. The original intention seems to be determining if a linear kernel for the SVM works (correct me if I’m wrong please). There was a lot of clustering at the bottom, and I took the log of the redshift (ignoring the errors) to make the visualisation clearer.

from mpl_toolkits.mplot3d import Axes3Dfig = plt.figure(figsize=(5,5))
ax = Axes3D(fig)for obj in astrObjs:
    luminous = df[df['class'] == obj]
    ax.scatter(luminous['ra'], luminous['dec'], np.log10(luminous['redshift']))ax.set_xlabel('ra')
ax.set_ylabel('dec')
ax.set_zlabel('log redshift')ax.view_init(elev = 0, azim=45)plt.show()

Building the Random Forest Classifier

Training and Test Set Split

The traditional train-test split can be done by:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierx_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=123, stratify=labels)clf = RandomForestClassifier()

Training Complexity

When I initially attempted to train my data with sklearn.svm.linearSVC, my laptop started to overheat pretty badly. Training time complexity is generally between O(mn²) and O(mn³), where m is the number of features and n is the number of observations, as explained by Jessica Mick here. On the other hand, the training complexity of growing CART (Classification and Regression Trees) is O(mn logn) and O(mn²), as explained here (a random forest is an ensemble of CARTs). With limited time, patience, and coffee on hand, I decided to make the swap to a random forest model.

In hindsight, one thing I could’ve done to speed up the SVM (and even the random forest) was to scale my data to [-1,1], as mentioned by Shelby Matlock in the same thread. I would also get more stable prediction results that way.

from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(-1,1)).fit(x_train)
x_train_scaled = scaling.transform(x_train)
x_test_scaled = scaling.transform(x_test)

Hyperparameter Optimisation

For hyperparameter tuning, I found this and this rather handy. We begin by instantiating a random forest and looking at the default values of the available hyperparameters. Pretty-printing the get_params() method:

from pprint import pprint
pprint(clf.get_params())

This gave:

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

The hyperparameters which I decided to focus on are:

n_estimators (number of trees in the forest)
max_features (max. no. of features used in node splitting, usu. < no. of features in dataset)
max_depth (max. no. of levels in each decision tree)
min_samples_split (min. no. of data points in a node before node is split)
min_samples_leaf (min. no. of data points allowed in node)
criterion (metric used to determine stopping criteria for the decision trees)

Tuning Using Random Search

To narrow down my search, I first ran a Randomised Search Cross-Validation. Here, I performed a random search of parameters using k = 10 fold cross-validation (cv=10), across 100 different combinations (n_iter=100), and with all available cores concurrently (n_jobs=-1). Random search selects a combination of features at random instead of iterating across every possible combination. A higher n_iter and cv results in more combinations and less possibility of overfitting respectively.

from sklearn.model_selection import RandomizedSearchCVhyperparameters = {'max_features':[None, 'auto', 'sqrt', 'log2'],
                   'max_depth':[None, 1, 5, 10, 15, 20],
                   'min_samples_leaf': [1, 2, 4],
                   'min_samples_split': [2, 5, 10],
                   'n_estimators': [int(x) for x in np.linspace(start = 10, stop = 100, num = 10)],
                   'criterion': ['gini', 'entropy']}rf_random = RandomizedSearchCV(clf, hyperparameters, n_iter = 100, cv = 10, verbose=2, random_state=123, n_jobs = -1)rf_random.fit(x_train, y_train)

A huge bunch of stuff comes up. To obtain the best parameters, I called:

rf_random.best_params_

This gave:

{'n_estimators': 100,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': None,
 'max_depth': 15,
 'criterion': 'entropy'}

Tuning Using Grid Search

I could now specify a narrower range of hyperparameters to concentrate on. GridSearchCV is perfect for the fine-tuning of the hyperparameters.

from sklearn.model_selection import GridSearchCVhyperparameters = {'max_features':[None],
                   'max_depth':[14, 15, 16],
                   'min_samples_leaf': [1, 2, 3],
                   'min_samples_split': [4, 5, 6],
                   'n_estimators': [90, 100, 110],
                   'criterion': ['entropy']}rf_grid = GridSearchCV(clf, hyperparameters, cv = 10, n_jobs = -1, verbose = 2)
rf_grid.fit(x_train, y_train)

This took me roughly 50 minutes. I called:

rf_grid.best_params_

This returned:

{'criterion': 'entropy',
 'max_depth': 14,
 'max_features': None,
 'min_samples_leaf': 2,
 'min_samples_split': 5,
 'n_estimators': 100}

Training the Classifier

I finally updated the classifier with the optimal hyperparameters.

clf.set_params(criterion = 'entropy', max_features = None, max_depth = 14, min_samples_leaf = 2, min_samples_split = 5, n_estimators = 100)

Testing and Evaluation

I then tested the updated classifier on the test set, and evaluated it against a couple of metrics.

Accuracy Score and F1 Score

Note that accuracy_score refers to the fraction of correction predictions, and f1_score is the weighted average of precision (the ability of the classifier not to label as positive a sample that is negative) and recall (the ability of the classifier to find all the positive samples). The scikit-learn documentation explains these concepts best:

sklearn.metrics has these readily available. The order of the scores in the f1_score list corresponds to the way the classes were encoded, which can be accessed by using the .classes_ attribute of the classifier.

from sklearn.metrics import accuracy_score, f1_scoresortedlabels = clf.classes_
accscore = accuracy_score(y_test, y_pred)
f1score = f1_score(y_test, y_pred, average = None)print(accscore)
for i in range:
    print((sortedlabels[i],f1score[i]), end=" ")

This returns very pleasing scores.

0.99
('GALAXY', 0.9900265957446809) ('QSO', 0.9596774193548387) ('STAR', 0.9959935897435898)

Confusion Matrix

A confusion matrix C has matrix elements C_(i,j) corresponding to the number of observations known to be in group i but predicted to be in group j. In other words, diagonal elements represent correct predictions, while off-diagonal elements represent mislabelling. We aim to have large diagonal values C_(i,i) of the confusion matrix.

We take a look at the classification report and confusion matrix.

from sklearn.metrics import classification_report, confusion_matrixcm = confusion_matrix(y_test, y_pred, sortedlabels)print(classification_report(y_test, y_pred))
print(cm)

This returns:

precision    recall  f1-score   support

      GALAXY       0.98      0.99      0.98      1499
         QSO       0.95      0.89      0.92       255
        STAR       0.99      1.00      1.00      1246

   micro avg       0.98      0.98      0.98      3000
   macro avg       0.97      0.96      0.97      3000
weighted avg       0.98      0.98      0.98      3000

[[1481   11    7]
 [  29  226    0]
 [   1    1 1244]]

Visualising the Confusion Matrix

I generally find it useful (and pretty!) to plot the confusion matrix. In the seaborn colour palette I have chosen, the darker colour implies a larger number of entries.

cm = pd.DataFrame(cm, index=sortedlabels, columns=sortedlabels)sns.set(font_scale=1.2)
sns.heatmap(cm, linewidths=0.5, cmap=sns.light_palette((1, 0.2, 0.6),n_colors=10000), annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')

If you have feedback/constructive criticism, feel free to comment it down below!