Space Science with Python

Space Science with Python — The Origin of Comets

Part 8 of the tutorial series. Simple data visualisations are sometimes enough to get impressive scientific results

Thomas Albin

Published in

Towards Data Science

9 min readMay 16, 2020

Artistic impression of the so-called Kuiper Belt. A region beyond Neptune, where ice and rocks bodies revolve around the Sun. It is a source of potential comets as well as the so-called Oort’ Cloud that is located way beyond this belt. Credit: ESO/M. Kornmesser; License: *Creative Commons Attribution 4.0 International License*

Preface

This is the 8th part of my Python tutorial series “Space Science with Python”. All codes that are shown here are uploaded on GitHub. Enjoy!

This tutorial’s database has been created in the previous session, but is also uploaded on my GitHub repository.

Introduction

In the last couple of weeks we learned about SPICE, a great NASA toolkit that helps developers, scientists and engineers to work on Space Science related problems: Orbit mechanics, coordinate determination, position calculation and much more (Python wrapper source for SPICE). For now, we had only a look at the tip of the iceberg. More topics will follow where we will also work on spacecraft data!

Last time, we started to download and parse actual science data: the comet data set from the Minor Planet Center (MPC). We set up a small SQLite database and have now some scientific questions to answer:

Where do these comets come from?
Why are there different types (C, P, I Types)?
Have we observed all of them, or is there some kind of bias?
And very important: When does the next Great Comet come?

We will aim each individual question in the next tutorials. Surprisingly, we will not need (yet) any high level machine learning or data science methods to derive deep scientific insights. Science is about acquiring, cleaning and exploring these data and creating a model that needs to be tested. To get familiar with scientific tasks, a qualitative approach is also feasible: describing and interpreting the data.

Source regions

In 1950 a paper was published by the Dutch astronomer Jan Hendrik Oort [1]: The structure of the cloud of comets surrounding the Solar System and a hypothesis concerning its origin. His paper and ideas were similar as Öpik’s previous work in the 1930s [2]: The structure of the cloud of comets surrounding the Solar System and a hypothesis concerning its origin. Both scientists were independently wondering, where the comets come from. Unlike other Solar System bodies like planets or main belt asteroids, comets “ignore” the usual harmony of orbiting the Sun close to the ecliptic plane. They come randomly from anywhere, with large orbits and recurring periods of hundreds or thousands of years! What is the source of these bodies? At the time of doing their research, Öpik and Oort had only a few data points of comets. The applied statistical considerations to postulate a theory that is valid to this day: The Öpik-Oort Cloud; or briefly called: Oort Cloud.

The Oort Cloud is a postulated spherical region around the Sun, ranging from around 1000 AU to over 100,000 AU! The inner parts of the Oort Cloud merge probably with the outer parts of the so-called Kuiper Belt that also contains icy bodies beyond the Neptune orbit. The Kuiper Belt is associated with a broad and wide disk that contains a lot of scattered objects.

Gravitational perturbations or collisions between the bodies in the Oort Cloud can cause an orbit change resulting into a high eccentricity or slightly parabolic paths towards the inner part of the Solar System. E.g., a theoretical orbit reconstruction of the comet C/2014 S3 (PANSTARRS) can be seen below:

Visualisation of the Solar System (Centre) with the Kupier Belt (inner disk) and the Oort Cloud (blue shell / sphere). The red line indicates the computed history of the comet C/2014 S3 (PANSTARRS) that originates from the Oort Cloud (top right: beginning), and moved towards the Sun where planet fly-bys caused a change in the orbital elements. Credit: ESO/L. Calçada, License: *Creative Commons Attribution 4.0 International License*

Deep Dive

Can we “see” some parts of the Kuiper Belt or the postulated Oort Cloud in the data? Is it possible to get an idea of Öpik’s and Oort’s theory? Let’s have a look at the pure orbital elements from actual observed comets. Last time we created an SQLite database that contains all data points. Due to its small size, the database is uploaded on my GitHub repository as well as the following codes.

We start with importing all necessary modules for today’s lesson. This time, we do not need SPICE respectively SpiceyPy. sqlite3 is the only standard library; numpy and pandas are used for data handling and matplotlib is needed for the plotting routines. A new library is imported later for some analysis work.

Part 1/7

A connection to the database is established in line 3 with the sqlite3 connect command. The returned connection object is needed to extract data from the database; either by using the corresponding sqlite3 commands or by using the pandas read_sql command as shown in line 7+8 and 11+12, respectively. Two different dataframes are created: One for all P type comets, the other one for all C type comets. The labels represent a simple classification schema for comets:

P Type: So-called Periodic Comets with an orbit period of less than 200 years. (example: 67P/Churyumov–Gerasimenko
C Type: Comets with an orbit period of more than 200 years (example: C/1995 O1 (Hale-Bopp))

Other types are X Type comets (unknown orbit), A Type comets (an asteroid that has been classified as a comet by mistake), D Type comets (comets that are lost) and I Type comets (bodies of interstellar origin).

Since we want to analyse the spatial distribution of the comets, we extract the aphelion and inclination data for P types (line 7+8) and C types (line 11+12). Since the eccentricity of C types can exceed 1 we extract also the eccentricity for these comets to distinguish between bound and unbound orbits.

Part 2/7

Pandas has some nice and quick statistics functions that can be applied quickly. One of them is describe. The function is applied on a dataframe and returns some general descriptive parameters like the mean, standard deviation, minimum and maximum value as well as the median (50 %) and lower (25 %) and upper (75 %) interquartile range.

We print the results for both datasets in the following part and distinguish between bound (line 8+9) and unbound (line 12+13) C type orbits:

Part 3/7

The results are shown below. We have 627 periodic comets. The median aphelion is at around 5.9 AU (Jupiter orbits the Sun at around 5 AU) and 75 % of these comets have an inclination of less than 19°. So, they populate the inner part of the Solar System, close to the ecliptic plane. This does not confirm the Öpik’s and Oort’s theory, does it?

Descriptive statistics of P comets
       APHELION_AU  INCLINATION_DEG
count   627.000000       627.000000
mean      7.793602        16.525725
std       6.567561        20.943154
min       2.440626         0.234800
25%       5.145018         7.088200
50%       5.929885        11.550400
75%       8.995666        18.868550
max     101.318552       172.527900

So let’s take a look a the C type comets. We have 159 bound orbits and 67 unbound orbits. The median aphelion for bound orbits is at 450 AU and more than 25 % have an aphelion larger than 2000 AU! These are impressive long term orbits. According to the database, one object has an aphelion of over 200,000 AU (please note: the database does not contain measurement errors, this comet could also be unbound!). The inclinations are distributed along a larger range, the median is at 72° and the median inclination for unbound comets is even at almost 100°! It appears that these comets have random inclinations.


Descriptive statistics of C comets with an eccentricity < 1
         APHELION_AU  INCLINATION_DEG  ECCENTRICITY
count     159.000000       159.000000    159.000000
mean     4766.428836        76.079123      0.946816
std     20826.176153        44.847788      0.096976
min        15.260793         3.148100      0.428280
25%        60.233850        40.870450      0.939769
50%       450.794467        72.079400      0.991066
75%      2000.520291       105.572200      0.997811
max    226057.150184       164.245500      0.999979


Descriptive statistics of C comets with an eccentricity >= 1
       APHELION_AU  INCLINATION_DEG  ECCENTRICITY
count          0.0        67.000000     67.000000
mean           NaN        96.280825      1.003104
std            NaN        44.336388      0.006557
min            NaN        11.333000      1.000000
25%            NaN        56.890150      1.000482
50%            NaN        99.442800      1.001626
75%            NaN       128.486550      1.003426
max            NaN       174.620300      1.049508

A picture is worth a thousand words. And this applies also for science! So let’s generate scientific insights by plotting the obtained data. First, we create a scatter plot, where the inclination is plotted vs. the aphelion. The scatter markers get different colours and shapes to distinguish between P and C types. The following code generates a scatter plot that is shown below. Formatting plots and making them “publishable” requires some time and effort as you can see in the miscellaneous formatting commands. Each comment describes the necessity of each line. We plot only the comets with a bound orbit (eccentricity e<1):

Part 4/7

Please note: the x axis (aphelion in AU) is plotted logarithmically. Our first impression from the descriptive statistics appears to be as expected. The P types are concentrated within 10 AU and are moving on orbits close to the ecliptic plane. The C types, however, are scattered randomly up to highly retrograde orbits. What we see are parts of the Kuiper Belt and inner parts of the Oort Cloud as expected.

Scatter plot of the inclination vs. the aphelion of P and C type comets. The aphelion axis is scaled logarithmically and only C types are considered with a bound orbit (eccentricity e<1). Credit: T. Albin

The Oort Cloud is shaped as a sphere that replenishes the inner Solar System constantly with new comets. So, the inclination values of C types should be equally distributed. A scatter plot is a helpful tool to get a first impression, but let’s create a distribution visualisation that can be interpreted more easily.

A common approach to visualise data is a histogram, where discrete data points are summarised in different bins. The width of the bin determines the smoothness of the distribution: A too small bin width causes an oversampling, a too-large bin width causes and undersampling. There are a lot of rules or rules-of-thumb to determine the correct bin width, like to Sturge’s rule, Scott’s rule, Rice’s rule and many others. For the complete inclination range (line 11) we will create a histogram based on the very simple square-root choice.

Part 5/7

There are different ways to create a continuous distribution based on discrete data. Besides histograms one can replace the data points with so-called kernel function (e.g., Gaussian or the Epanechnikov kernel). The overlapping kernels generate the resulting distribution (some examples are provided by scikit-learn). The kernel width optimisation has several implementations and several papers studied miscellaneous methods. E.g. Shimazaki and Shinomoto (2010) developed an adaptive kernel width estimator that computes different widths for different kernels, depending on local density variations.

In our case, we use the scipy function stats.gauss_kde that applies Scott’s rule for the kernel width determination. Line 6 and 7 compute a Kernel Density Estimator (KDE) for the P types and computes the density distribution based along the inclination range. Line 10 and 11 apply the same method for C types (using bound and unbound orbits).

Part 6/7

Now we can plot the histogram and the KDEs for both comet types using the same colours as previously shown. The distributions are normalised for better readability (otherwise the P types would be scaled with over 600 and the C types with only 200 comets).

Part 7/7

The plot shows the distributions vs. the inclination. You can see that the P types are moving closer to the ecliptic plane while the C types inclination distribution is almost equally distributed. Small variations are caused by observational biases and also due to the fact that we have only 200 data points for C types. All in all, one can assume that the large aphelion values and the almost equally distributed inclinations for C types indicate that they appear from any random direction as postulated.

Normalised inclination distribution of P and C types comets. The histograms are based on the square-root choice, overlapped with Gaussian Kernel Density Estimators to reveal the distribution of the comets’ tilted orbits. Credit: T. Albin

Conclusion & Outlook

Scientists gather and share data, freely available to download and to analyse. We used, and will use, free software solutions to derive scientific insights. Who could have known that the theory of a spherical comet region is still valid after 70 years? We did not apply any high level machine learning algorithms or methods. Simple data exploration and visualisation were feasible enough to get a feeling for Öpik’s and Oort’s theory.

But there is more to discover. Why are P type comets so “concentrated”? And since we discovered so many comets, are there more to discover?

We will answer these topics in the next sessions, where we will see that Jupiter is a major influence factor and that observational bias effects can cause sleepless nights.

Thomas

References

[1] Oort, J. “The structure of the cloud of comets surrounding the Solar System and a hypothesis concerning its origin”. Bulletin of the Astronomical Institutes of the Netherlands. 11, 1950, pp. 91–110. 1950BAN….11…91O. (Open Access Link)

[2] Öpik, E. “Note on Stellar Perturbations of Nearly Parabolic Orbits.” Proceedings of the American Academy of Arts and Sciences, vol. 67, no. 6, 1932, pp. 169–183. www.jstor.org/stable/20022899.

[3 ] H. Shimazaki and S. Shinomoto, “Kernel Bandwidth Optimization in Spike Rate Estimation,” in Journal of Computational Neuroscience 29(1–2): 171–182, 2010 http://dx.doi.org/10.1007/s10827-009-0180-4.