The world’s leading publication for data science, AI, and ML professionals.

Earthquake Time-series Forecasts using a Hybrid Clustering-LSTM Approach – Part I: EDA

Testing the method on the last 50 years of California's historic earthquakes

The notorious 1906 San Francisco earthquake hit the bay area on the April 18th 1906, leaving more than 3,000 deaths. Earthquake had a moment magnitude of 7.9 [[Image source](https://en.wikipedia.org/wiki/1906_San_Francisco_earthquake), [license](https://catalog.archives.gov/id/524396)].
The notorious 1906 San Francisco earthquake hit the bay area on the April 18th 1906, leaving more than 3,000 deaths. Earthquake had a moment magnitude of 7.9 [[Image source](https://en.wikipedia.org/wiki/1906_San_Francisco_earthquake), [license](https://catalog.archives.gov/id/524396)].

This blogpost is the first of a series of blogposts describing the details of this time-series analysis

0. Foreword

With our limited current technical advances, the deterministic prediction of the earthquakes is just an unattainable claim. The limitation may have two origins: 1 – Our knowledge of the subsurface fault structure is extremely limited; 2 – the seismometers recording seismic data have intrinsic sensitivity limitations. Thanks to the advances in seismological research, the future looks bright as our physical understandings of the earthquake triumph. However, still the occurrence of some earthquakes went unnoticed such as Parkfield’s (2004) earthquake, which was expected to occur earlier in 1988. Another way to look into the earthquake forecast problem is through using more dvanced statistics and AI tools. As a data scientist, I took a different approach to have a better statistical understanding of the earthquake phenomenon; so, the term "forecast" here just refers to statistical understanding and by no means refers to a deterministic earthquake prediction. Knowing this, let’s now dive into the analysis.

1. Introduction

Only since the last century, earthquakes were responsible for millions of deaths. Apart from the life threats, consequential enormous economic and infrastructure damages may leave long-lasting wounds to the civilizations. For example, the state of California which is situated near the Pacific-North American plate boundary has seen many devastating earthquakes, some of which left a great deal of casualties and economical cost (see a brief list here in my GitHub repository documentation).

These earthquakes occur, while we sometimes overlook the wealth of earthquake data that are being continuously recorded by the global seismic network (GSN). The seismic data, recorded on a wide variety of earthquakes, fault systems, geographical locations, and several sensors, provides a diverse enough dataset to look into the earthquake analysis, from a data science point of view. Besides, the tools in data science could be put into a real trial during the assessment of such a crucial societal hazard issue. Some earthquake events went unforeseen by the current physical earthquake models. The hope is that the data-driven earthquake analysis would provide additional insights into the systematics of the earthquake occurrence. The heavily-populated residential areas could really benefit from an early earthquake warning system (for example, check ShakeAlert). Especially, one that is based on powerful AI tools.

2. Difficulties with Forecasting Earthquake Time-series

Figure 1. Earthquake sequences during the magnitude 6.9 Loma Prieta occurred in 1989. Note the occurrence of main earthquake without prior precursory large events (Image by author).
Figure 1. Earthquake sequences during the magnitude 6.9 Loma Prieta occurred in 1989. Note the occurrence of main earthquake without prior precursory large events (Image by author).

Figure 1 shows the sequence of foreshocks (before the main earthquake), the main magnitude 6.9 earthquake, and the aftershocks (following the main earthquake), during the 1989 Loma Prieta earthquake (63 casualties, $10 billions damage). The main earthquake occurred almost without any large precursory signals. This complicates the task of earthquake forecast for more conventional time-series models such as the autoregressive ones.

2. Problem Statement:

San Andreas fault extends nearly 700 miles from NW-SE California along the coast. This fault is responsible for the notorious 1906 San Francisco Earthquake, and the later Loma Prieta and Parkfield earthquakes. But, this is not the only fault system, and earthquakes occur on many nearby faults which may even interact in a complex way. Are there any systematics in the occurrence of these earthquakes? And, if there is, can they help us discern how smaller earthquakes grow, leading to larger earthquakes? The analysis here is more data-based rather than being physics-based. This means, any earthquake with any magnitude and at any distance has the same weight of importance to predict the major earthquake; and, there are no prior assumptions. ONLY, the statistical analysis decides on the weight of importance for an earthquake.

4. Model Workflow

The model workflow is comprised of the following consecutive steps: 1 – Data collection. 2 – data cleaning. 3 – Exploratory data analysis (EDA). 4 – Time-series modeling. 5 – Recommendations. For full access to the codes for this project, I invite you to see the repository on my GitHub profile. The majority of the codes were written in python environment, where I used Pandas, scikit-learn, and TensorFlow libraries in Python.

5. Data Collection and Cleaning

Figure 2: San Andreas fault geometry, where the notorious 1906 San Francisco, 1989 Loma Prieta, and 2004 Parkfield earthquakes occurred [Image from Richard et al., 2014, Source, License].
Figure 2: San Andreas fault geometry, where the notorious 1906 San Francisco, 1989 Loma Prieta, and 2004 Parkfield earthquakes occurred [Image from Richard et al., 2014, Source, License].

I used USGS API to pull the earthquake data for the last 50 years near the state of California. The API is publicly available thanks to USGS. To download an inclusive enough catalog of earthquakes, I downloaded the earthquakes beyond the state of California from a wide longitude and latitude range of (-133, -107) and (24, 50) degrees respectively (well beyond California). For this blog post, I mainly explore EDA analysis of magnitude 6.9 Loma Prieta (1989) earthquake data. This earthquake occurred on the locked segments of the San Andreas fault (see Figure 2). The following function was used to read in the USGS earthquake API’s:

The imported json files were heavily-nested; hence, during data cleaning, I "denested" the json files, transformed them into dataframes, fixed the column datatypes, imputed the NaN values, and finally concatenated them into a global dataframe, which was workable. For a full description of data cleaning, visit my GitHub profile. Finally, I indexed the dataframe by timestamps as a time-series dataframe:

This time-series is irregular, meaning that the time intervals are not uniform, as expected from many sensor data with limited sensitivity to events; some earthquake data have 8 milliseconds intervals and some nearly 1 day. This complicates the time-series modeling which I will cover in the future blog post.

5. EDA – 50 years of earthquake data

The downloaded data contained a wealth of information, including the timing of event, moment magnitudes, intensity (evaluated by a human), and inverted three-dimensional location of the earthquakes. For data size considerations, we only considered earthquakes with moment magnitudes larger than 2 as more important ones.

Note that moment magnitude is in logarithmic scale units, meaning that an earthquake with magnitude 6 is, generally speaking, 10 times bigger than a magnitude 5 earthquake.

An initial look at the temporal evolution of the magnitudes suggested a sudden spike in the time-series, without much early notice as shown in Figure 1.

Figure 3: Total earthquake data for the past 50 years. There was a total of 85 earthquakes larger than magnitude 6 (gray line in top figure). The moment magnitude distribution is also shown by histogram. The magnitude distribution closely follows Poisson's distribution (Image by author).
Figure 3: Total earthquake data for the past 50 years. There was a total of 85 earthquakes larger than magnitude 6 (gray line in top figure). The moment magnitude distribution is also shown by histogram. The magnitude distribution closely follows Poisson’s distribution (Image by author).

Figure 3 shows the total data (nearly 313k downloaded earthquake data). There were 85 earthquakes with magnitudes larger than 6, which I considered as threshold for a destructive earthquake ( Britannica). The destructiveness of an earthquake is a more complex function well beyond the moment magnitude, such as the source depth, duration, and source mechanism; however, here, I just take a rather simple moment magnitude 6 as threshold for destructiveness. The distribution of earthquakes closely followed the famous Gutenberg-Richter power-law distribution (log(N) = a-bM, where N and M are frequency and magnitude, respectively. a and b are constants.). As moment magnitude is in logarithmic unit, the linear decline in histogram of Figure 2 suggests this distribution. I will not go into the details of describing this distribution, as that is beyond the scope of this blog post. One important point is that although this distribution describes the global distribution of the earthquake occurrence, relying only on this statistical distribution for hazard analysis is very unsafe due to the following reason:

"Imagine we have one large, say magnitude 7, earthquake which did not follow this distribution (as in the case of 2004 Parkfield earthquake). It is only one datapoint and will not disturb the power-law distribution, as it is only one data point among thousands! But, it is a deadly data point to miss on!

Hence, more detailed statistical analysis will be enormously helpful for a better understanding of the earthquake phenomenon.

What happened 30 days prior to the Loma Prieta Earthquake?

Figure 4: Earthquake map for events 30 days ahead of the main Loma Prieta earthquake (Map created by folium library) (Image by author).
Figure 4: Earthquake map for events 30 days ahead of the main Loma Prieta earthquake (Map created by folium library) (Image by author).

To understand what led to the main earthquake, we need to track back the events that led to the main shock. As Figure 4 shows, there was a massive activity 30 days ahead of the main event. Events tended to cluster along NW-SE orientation and mainly on upper and lower locked regions of the San Andreas fault (see also Figure 2). Finally, the upper locked region resulted in the magnitude 6.9 earthquake. Having an initial assessment, we can now try to see if there are any location systematics.

Correlation between location and magnitude

Figure 5: 2D histograms of entire dataset. Top: There is strong lineation NW-SE between longitude-latitude. Bottom: At the depth of ~10 km, there seems to be clustering of major earthquakes (Image by author).
Figure 5: 2D histograms of entire dataset. Top: There is strong lineation NW-SE between longitude-latitude. Bottom: At the depth of ~10 km, there seems to be clustering of major earthquakes (Image by author).

To better understand the dataset, I assessed the correlation between variables in the dataframe. The 2D histograms (Figure 5) showed that the majority of the earthquakes occurred in NW-SE, loosely aligning with a certain lineation. Also, the magnitude-depth distribution suggested that major earthquakes occurred at a depth of nearly 10 km. This suggests that we can use clustering techniques to come up with some engineered features and try to see whether they help us for the forecast problem.

Feature Engineering: Spatial Clustering

Figure 6: Translation and transformation with time as clusters of earthquake evolve. We could use these changes as features for earthquake forecasts (Image by author).
Figure 6: Translation and transformation with time as clusters of earthquake evolve. We could use these changes as features for earthquake forecasts (Image by author).

Figure 6 shows a schematic concept for developing the features. As the figure is self-descriptive, I will not go through the explanations. I used DBSCAN algorithm to cluster the events for every 20 consecutive earthquakes beginning from the first datapoint. I used longitude, latitude, and depth for a 3D clustering. We can define three features as a result of clustering:

1 – event density (or spatial size of cluster).

2 – Aspect ratio (3D shape of the cluster).

3- Cluster mass center translation.

[The hyperparameters for the DBSCAN were eps = 0.2 and min_samples=5].

Feature importance

For an initial assessment on the developed features, I filtered only the section of the time-series data before the main Loma Prieta Earthquake. I averaged the data for every 20 rows and added the features derived from the spatial clustering to the original dataframe columns.

Finally, I added a target label as the time to the main Loma Prieta Earthquake (time_to_failure).

Below is the dataframe structure:

Dataframe for the Loma Prieta earthquake. The target label is the time_to_failure column. See GitHub repository for full description of columns.
Dataframe for the Loma Prieta earthquake. The target label is the time_to_failure column. See GitHub repository for full description of columns.

A single regression analysis was run for preliminary assessment of the features. Note that this regression is only for exploratory purposes, and the results need to be cross-validated on other distinct test earthquake datasets as a future effort. The following correlation map (Figure 7) shows that the features are in fact strong indicators of the time to the main earthquake.

Figure 7: Correlation between features and the target label (time_to_failure). As can be seen, these are strong indicators of the time to failure. See GitHub repository for full description of columns (Image by author).
Figure 7: Correlation between features and the target label (time_to_failure). As can be seen, these are strong indicators of the time to failure. See GitHub repository for full description of columns (Image by author).

The feature importance plot in Figure 8 showed interesting implications. The dept_std, long_std, and lat_std, indicate the aspect ratio of the cluster. This means closer to the failure, the cluster tends to elongate more vertically and longitudinally and less latitudinally. These were the strongest features. Also, density is negatively correlated, meaning that the size of the cluster expands before failure. Another interesting aspect is the mean magnitude of earthquakes which is a curious case. An inspection of the dataset for a few other earthquakes also showed this trend. This means that before the main earthquake happens, the mean magnitude of events in fact declines. In other words, the fault becomes more seismically silent. This is a curious case and needs further study.

Figure 8: Feature importance based on linear regression on Loma Prieta earthquake. Features are discussed in the text. See GitHub repositopry for full description of columns (Image by author).
Figure 8: Feature importance based on linear regression on Loma Prieta earthquake. Features are discussed in the text. See GitHub repositopry for full description of columns (Image by author).

Future work

This is the beginning of my journey in time-series analysis of this dataset. At this point, I would like to finish this blog post by a few remarks, and leave more detailed descriptions for the future posts. Here, I performed an initial EDA analysis, and suggested a few engineered features based on spatial clustering. The features seem to be descriptive of the time to the main earthquake.

In the future post, I will use the developed features, specifically, the clustering features to perform more advanced time-series analysis. Based on early evaluation, I found LSTM as an appropriate method for my purposes. Two aspects specifically brought my interest towards LSTM sequential analysis: 1 – the irregular nature of my time-series. 2 – the shock-like nature of the earthquake without much apparent precursors.

STAY TUNED…


Thanks for reading my blog post. Interested in more readings? Please check out my LinkedIn and GitHub profiles.


Written By

Topics:

Related Articles