The world’s leading publication for data science, AI, and ML professionals.

Earthquake & Popular Venues Data Analysis of Turkey

Which sector should I invest in which safe city?


1. Introduction

1.1. Description & Discussion of the Background

Turkey is an earthquake populous country with 82 million inhabitants. It consists of 81 provinces. It is surrounded by the North Anatolia, Western Anatolia and Southeastern Anatolia earthquake zones. The diversity of landforms in Turkey, is a result of earth movement that shaped the terrain in the region for thousands of years.

It has extinct volcanoes and earthquakes still occur frequently. In the north and east of the country, there are large fault lines that cause earthquakes today. The great Marmara earthquake that occurred on the North Anatolian Fault Line in 1999 caused the death of thousands of people. [1]

Although Turkey is a country of earthquakes, though it is also a tourist country. It is visited by millions of tourists from all over the world every year. There are even some foreign nationals to buy real estate in Turkey and make the trade. [2]

1.2. Problem

A Data Analysis based on earthquake statistics and popular locations of each city will be found interesting by investors and consumers. Investors can carry out projects for places with less earthquake risk and invest in jobs where the type of business in those places is less intense. For regular person living in cities, they may want to buy and live in places that are less dangerous and at the same time close to social venues.

Therefore, with a clustering study to be carried out by taking earthquake statistics based on a reasonable past (for example, 100 years) of each city, the regions with high or no earthquake risk can be determined. In addition, by comparing popular venues in these regions, the distinctiveness between regions can be revealed.


2. Data

2.1. Data Sources

The data sources used in this study to solve the problem are as follows:

  • From the AFAD Earthquake Catalog page, the statistics of the earthquakes that occurred between 1920–2020 and whose magnitudes were between 4–10 according to the Richter scale were taken in csv format with various filtering methods. [3]
  • With OpenStreetMap, the GeoJSON data needed for earthquake distribution and maps showing clustered groups was provided. [4]
  • With the Foursquare API, the type information of the most popular venues of the cities, the latitude and longitude information of those places were obtained. [5]
  • From Wikipedia was taken Turkey‘s cities list. The latitude and longitude information of the cities were obtained by querying them one by one with the help of Python GeoPy Library. [6]

2.2. Data Exploring and Cleaning

In the earthquake data scraped from AFAD; Date, earthquake latitude, earthquake longitude, earthquake magnitude and depth information was obtained. Data downloaded or scraped from multiple sources were combined into one table. A total of 5707 pieces of data measured between 01.01.1920 and 01.01.2020 and greater than 4.0 Richter were obtained. Some variables with missing information presented in the csv file were not included in the original data set because they were redundant.

Figure 1. AFAD Earthquake Catalog Page. (Image by Author)
Figure 1. AFAD Earthquake Catalog Page. (Image by Author)

While receiving AFAD earthquake data, "Rectangular Search Type" was implemented. It was realized that some of the data received came from locations outside the borders of Turkey. Since the names of the cities where the earthquakes occurred are not included in the earthquake statistics, the information of the cities where the earthquakes occurred was obtained by using the GeoPy Library and the Reverse Geocoding method for each data. The data of earthquakes have occurred outside of Turkey was removed from the master data set. Thus, a total of 4364 rows of data remained in the master data.

Figure 2. The first edited version of earthquake statistics taken from AFAD website. (Image by Author)
Figure 2. The first edited version of earthquake statistics taken from AFAD website. (Image by Author)

According to the data set obtained, it can be seen by looking at the Magnitude Histogram Chart that the majority of the earthquakes that occurred were between 4 and 4.78 in magnitude. The histogram graph is informative about the distribution of earthquakes and is useful for getting to know the structure of the data set.

Figure 3. Histogram of Earthquake Magnitude. (Image by Author)
Figure 3. Histogram of Earthquake Magnitude. (Image by Author)

In addition, the standard deviation, minimum and maximum values of the data set in the study were examined and an opinion was obtained. The table for this review is shown below.

Figure 4. Statistical information of the first data set created. (Image by Author)
Figure 4. Statistical information of the first data set created. (Image by Author)

Looking at Figure 4, the Magnitude standard deviation value of ~ 0.47 shows that Magnitude values are in a more uniform distribution and close to each other than other variables. Magnitude Histogram Chart in Figure 5 also supports this idea. The Depth Histogram Chart below, which shows the distribution of Depth values, can also be examined.

Figure 5. Histogram of Earthquake Depth. (Image by Author)
Figure 5. Histogram of Earthquake Depth. (Image by Author)

When the maximum and minimum value ranges of the properties are compared, it will be seen that they are values that are generally not too far from each other. In addition, when the maximum values of all properties and minimum values of all properties were compared, it was understood that there was no need for any standardization or normalization.

2.3. Dealing with Missing Data

It was found that there were no statistics for some cities in the first data set. This means that earthquakes with a value greater than 4.0 have not occurred in those cities in 100 years. These cities are Kırklareli, Rize, Bartın and Kilis. Magnitude and Depth variables were filled with reasonable values only for these four cities so that the rows of these provinces did not appear as NaN in the data set.

The fact that there is no earthquake record greater than 4.0 for these cities in the statistics for 100 years does not mean that there is no earthquake. Some earthquakes that are less than 4.0 but we do not have a record may have occurred. Even if there is no earthquake within the boundaries of these cities, earthquakes in neighboring cities can affect these cities. So it makes no sense to fill the Magnitude sections that are NaN with zero.

However, some Depth values were found to be 0.0 in the statistics obtained from AFAD. Considering that this situation may be natural, the values of Max Depth and Min Depth to be derived below are accepted as 0.0 for these provinces. Max Magnitude values were accepted as 3.0 and Min Magnitude values as 1.0. Based on this, the Avg Magnitude value was accepted as 2.5.

2.4. Feature Extraction

The average, maximum and minimum features for each city were generated by feature extraction. These attributes were added to the first data set edited. The new attributes are derived based on the Magnitude and Depth variables. The _EqLatitude and _EqLongitude attributes were then removed from the data set. Finally, the Master data set was created with the addition of the data table with the latitude and longitude information of 81 cities to the data set.

Figure 6. Master Data. (Image by Author)
Figure 6. Master Data. (Image by Author)

3. Methodology

Using the Python Folium Library, the geographical details of 4364 earthquake points were visualized for initial insights. It was a useful visualization, especially in terms of seeing in which regions earthquakes occurred and gaining an idea about the earthquake distribution.

Figure 7. Earthquakes that occurred between the years of 1920–2020 in Turkey. (Image by Author)
Figure 7. Earthquakes that occurred between the years of 1920–2020 in Turkey. (Image by Author)

Foursquare API was used to discover popular venues in cities. As a limit for each city, 100 popular venues and 20 kilometers in diameter were measured. In some cities, it has been observed that when the diameter measure is shorter, no popular space data is returned. For this reason, a 20-kilometer-wide search was conducted in order to obtain at least 10 rows from each city. A total of 6581 rows of data were obtained.

Figure 8. Location data obtained by Foursquare API query. (Image by Author)
Figure 8. Location data obtained by Foursquare API query. (Image by Author)

A summary table was created for the popular places identified in the cities. The pivot table shows the total number of venues returned by the Foursquare API for each city. The chart of this table is as follows.

Figure 9. Total number of Popular Venues in each City. (Image by Author)
Figure 9. Total number of Popular Venues in each City. (Image by Author)

When Figure 9 was examined, it was seen that 100 results were returned for many large cities. While provinces such as Istanbul, Yalova, Izmir, Bursa, Trabzon, Adana, Mersin, Diyarbakır and Antalya are seen as rich in popular places; Provinces such as Ağrı, Tunceli, Hakkâri, Siirt, and Erzurum remained below 20 results.

Of course, this graphic does not include all popular places in the provinces. Because a search was made at a distance of 20 kilometers for each city and only one latitude-longitude pair was used to represent each city. Considering its area, this search may be considered a narrow search for some cities.

There may also be many locations that were not detected or considered popular by Foursquare. More popular location information can be obtained by performing more detailed searches with more latitude and longitude information about a city.

When the data obtained with the Foursquare API was summarized, it was seen that a total of 347 types of popular venues belonging to different categories were identified. A new data table has been created showing the 10 most common venues in each province.

Figure 10. Top Ten Most Common Venues. (Image by Author)
Figure 10. Top Ten Most Common Venues. (Image by Author)

After the determination of popular venues, Onehot encoding was done for 347 categorical variables and City attribute was extracted from Master Data and clustering study was performed with K-Means algorithm. The K-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. K-Means algorithm is one of the most common cluster method of unsupervised learning.

Figure 11. A summary of the dataset used for the K-Means algorithm. (Image by Author)
Figure 11. A summary of the dataset used for the K-Means algorithm. (Image by Author)

Figure 11 contains a summary of the data set (81 rows x 355 columns) used in the K-Means clustering algorithm. There are different metric distance function for spatial distance. Sqeuclidean metric was chosen in the study. Because it made the elbow break point to be seen more clearly. In order to determine the optimum number of clusters, clustering study was conducted with different trials. The results were analyzed by increasing the cluster constant (k value) of the K-Means algorithm. When analyzed by The Elbow Method, it was decided that the optimum value was K = 3. This situation can also be seen in the graphic below.

Figure 12. Optimal K number with The Elbow Method. (Image by Author)
Figure 12. Optimal K number with The Elbow Method. (Image by Author)

After the clustering study, a new data table was created giving the clusters found, city names, avg. magnitude of earthquakes occurring in cities, and the 10 most common venues in each city.

Figure 13. Most Common Venues and Cluster Numbers. (Image by Author)
Figure 13. Most Common Venues and Cluster Numbers. (Image by Author)

When the 3 clusters detected were examined in terms of earthquake risk, they were labeled as LOWMEDIUMHIGH. When the average earthquake magnitude of all cities in each cluster was calculated, the risk ratio by clusters was also determined.

The cluster of cities with the highest earthquake risk was determined as HIGH, and the cluster of the lowest cities was determined as LOW. A new data table was created consisting of the total number of cities owned by the clusters, the representation colors of the clusters on shapes and maps, and cluster labels.

Figure 14. Label, color and count information of the detected clusters. (Image by Author)
Figure 14. Label, color and count information of the detected clusters. (Image by Author)

When the count of cities between clusters is compared, it is seen that Cluster HIGH is in the first place with 46.9%. This result shows that most of the cities in the country are at high risk of earthquakes.

On the other hand, the total number of cities in Cluster LOW, where the earthquake risk is low, is too high to underestimate. For all kinds of investments to be made to protect against earthquake risk, attention can be drawn to the provinces in this cluster.

Cluster MEDIUM, where the earthquake risk is at a medium level, appears to be the smallest cluster in terms of the total number of cities. Below you can examine the pie chart containing the city numbers of the clusters.

Figure 15. Count of the cities in clusters. (Image by Author)
Figure 15. Count of the cities in clusters. (Image by Author)

4. Results Analysis

The results we have achieved so far have been a guide for regular person who want to live in a safe city or make their real estate buy decisions in this direction. It is necessary to examine clusters in more detail and separately in order to make comments to guide investors.

4.1. Cluster: "HIGH"

Figure 16. Pivot Table of Cluster HIGH. (Image by Author)
Figure 16. Pivot Table of Cluster HIGH. (Image by Author)

Cluster HIGH is the cluster of cities with the greatest earthquake risk. 38 of the all cities in Turkey are in this cluster. Considering all clusters, it is the first cluster in terms of the count of cities. When the characteristic structure of the cities is examined, it is seen that they are generally located far from the sea and in the inner parts, in places mountainous and some of them have a coast to the sea.

Most of the cities that have a coast in the cluster are neighbors of the Aegean Sea. There are also two cities in the cluster, neighboring Marmara, Black Sea and Mediterranean.

Figure 17. Pivot Table of 1st Common Venues in Cluster HIGH. (Image by Author)
Figure 17. Pivot Table of 1st Common Venues in Cluster HIGH. (Image by Author)

The table given in Figure 17 was obtained when a comparison was made in terms of "the first most common places" among the cities in the cluster.

According to this table, Café appears to be the first most common location in 24 of 38 cities in this cluster. Then, "Hotel" and "Turkish Restaurant" come respectively. It is anticipated that there will be a great competition in these sectors for the investors. It would be wiser to turn attention to less competitive sectors.

Almost all of the cities in this cluster are located on a fault line. In this respect, especially those who want to buy real estate, can choose a city from Cluster MEDIUM or Cluster LOW that are farther from the fault lines.

In addition, since the cluster with the highest earthquake risk, it may be preferable that the investment to be made is a less costly business type that is not affected by the earthquake. For high-cost investments, cities in clusters with less earthquake risk may be preferred.

When a summation is made among all common places from 1 to 10 of all cities in this cluster, the top 20 venues at the top of the list are shown in the graph below.

Figure 18. Top 20 Chart of All Most Common Venues in Cluster HIGH. (Image by Author)
Figure 18. Top 20 Chart of All Most Common Venues in Cluster HIGH. (Image by Author)

According to the graphic in Figure 18, "Seafood Restaurant" appears as a less competitive place than other sectors. An investment can be made in this sector, which can be utilized especially in the seaside cities of this cluster.

In addition, as an alternative to the seventh baked food business "Dessert Shop", investing in the "Bakery" sector, which serves a similar area and has less competition than the graph, can be considered as a good opportunity.

Some of the cities in this cluster are located in the inner parts of the country where the continental climate is experienced and in the mountainous regions with very cold winters. On the other hand, looking at the chart, it is seen that there are many businesses related to beverages such as tea and coffee.

Rather than heading into this highly competitive area, "Breakfast Spot", where you can have both tea and coffee, have breakfast on the one hand, and has less competition, can be a good investment option.

4.2. Cluster: "MEDIUM"

Figure 19. Pivot Table of Cluster MEDIUM. (Image by Author)
Figure 19. Pivot Table of Cluster MEDIUM. (Image by Author)

The cluster, which is in the second degree in terms of earthquake hazard, consists of 10 cities. Neighbors of cities are usually cities from the Cluster HIGH. In this respect, landforms can be similar to Cluster HIGH in terms of fault lines passing through cities and earthquake risk. Except for Muğla, Antalya and Mersin, among the cities in the cluster, none of the cities have a seafront. Cities in Eastern Anatolia are located in mountainous and high altitude regions.

When a comparison is made between the cities in the cluster in terms of "first most common places", the following table is obtained.

Figure 20. Pivot Table of 1st Common Venues in Cluster MEDIUM. (Image by Author)
Figure 20. Pivot Table of 1st Common Venues in Cluster MEDIUM. (Image by Author)

As can be seen from this table, the Café is a sector that is widely preferred by investors among the first two clusters. Therefore, the competition is very high. Especially when the clusters in Cluster MEDIUM are examined, the "Mountain" sector, which is common in some provinces but has less competition, may be suitable for investment.

For further comment on this issue, it would be useful to look at the top 20 venues list, which includes all common venues in cities. Below you can see the graph that gives the sum of all common places from 1 to 10.

Figure 21. Top 20 Chart of All Most Common Venues in Cluster MEDIUM. (Image by Author)
Figure 21. Top 20 Chart of All Most Common Venues in Cluster MEDIUM. (Image by Author)

Cities with a coast to the sea in the cluster can be considered as cities that can be invested especially in terms of tourism. As with Cluster HIGH, we can examine industries that are less competitive in terms of investment and are lower on the list.

Depending on the development of tourism in the country, the "Hotel" sector, which ranks 4th in the list of top 20 places, and the "Scenic Lookout", which seems less competitive in cities with high altitude, may be a good option for investment.

Cities by the sea can be attractive in terms of real estate. However, for this, it would be more appropriate to choose the Cluster LOW, which has the least earthquake risk and has a seafront.

4.3. Cluster: "LOW"

Figure 22. Pivot Table of Cluster LOW. (Image by Author)
Figure 22. Pivot Table of Cluster LOW. (Image by Author)

It is the cluster with the least earthquake risk. It consists of 33 cities in total and ranks second among other clusters in terms of number of cities. Almost all of the cities with a coastline are adjacent to the Black Sea. Except Kocaeli and Sakarya, Turkey’s neighbor to the Black Sea all the cities that are located in this cluster.

Cities in this cluster may be preferred especially for real estate investment. Those who want to buy a house by the sea and from a place with the least earthquake risk should evaluate the cities in this cluster and the real estate opportunities in these cities in detail.

When evaluated in terms of population density, it consists of the most populous cities in the country such as Istanbul and Ankara. The fact that the population is dense and the earthquake risk is lower than other provinces has paved the way for investment.

The "first most common places" in cities are shown in the table below.

Figure 23. Pivot Table of 1st Common Venues in Cluster MEDIUM. (Image by Author)
Figure 23. Pivot Table of 1st Common Venues in Cluster MEDIUM. (Image by Author)

In the cities in this cluster, investment opportunities are higher than in the provinces in other clusters. The abundance of cities on the seaside, fewer earthquakes, and a wide variety of sectors to invest in make the cities in Cluster LOW more attractive.

Therefore, it would be a better decision to choose a sector that is different and open to development, instead of a sector that is very common and competitive in every cluster like the Café sector. In the image below, you can see top 20 location list of the cities in Cluster LOW.

Figure 24. Top 20 Chart of All Most Common Venues in Cluster LOW. (Image by Author)
Figure 24. Top 20 Chart of All Most Common Venues in Cluster LOW. (Image by Author)

The "Dessert Shop" and "Bakery" investment idea we mentioned for Cluster HIGH before is also valid in this cluster. When Figure 24 is examined, it will be seen that these sectors can be alternatives to each other since they are in similar areas.

Fishing has developed in cities with a coastline to the Black Sea. Therefore, "Seafood Restaurant", which is less competitive, can be a good investment. It is also a good business idea to supply fish for Seafood Restaurants.

In "Garden" and "Farm", they can be evaluated as similar and less competitive business types, as joint or separate investments.

When the tables and graphs of all clusters are examined so far, it will be seen that the "Mountain" sector is in a remarkable order in each cluster. For this purpose, it can be considered as a good investment opportunity in cities with high altitude in all clusters.

Again, when all of the clusters are considered together, it is possible to see that the "Farm", "Park" and "Plaza" sectors, whose competition is at medium levels, have a large percentage in total.

4.4. Cluster Maps

All detected clusters were visualized with the point in the map of Turkey. Python Folium Library was used for this. You can see the visual version of all the explanations we made above about the clusters on the map below.

Figure 25. The visualization of cluster point in the map of Turkey. (Image by Author)
Figure 25. The visualization of cluster point in the map of Turkey. (Image by Author)

The dots shown in Red on the map above belong to Cluster HIGH, which has the highest earthquake risk. Orange dots represent MEDIUM and Green dots represent Cluster LOW.

In addition to this map, Choropleth Map with color distribution according to the average earthquake magnitude was also created. Both average earthquake magnitude and clusters are shown together in the Choropleth map.

In addition, when clicking on each point, the name of the relevant city, the cluster information and the 100-year average earthquake magnitude information are given as Popup.

Figure 26. The visualization of Cluster Points and Average Earthquake Magnitude distribution in the map of Turkey. (Image by Author)
Figure 26. The visualization of Cluster Points and Average Earthquake Magnitude distribution in the map of Turkey. (Image by Author)

5. Discussion

The first aim of this study is to help business people and ordinary citizens in their investment decisions and social lives for the future by analyzing popular venues in cities with earthquake data.

Although the statistical data used in the project represented a part of the city, it was seen as representing the whole. For example, an earthquake that occurred in only one part of the city was considered to have occurred in the whole. Therefore, the city-based study can be carried out on a district or neighborhood basis by digging deeper. In addition, more data can be obtained with different data sources other than Foursquare API data. Thus, it will be possible to obtain sharper and different results.

Another aim of the project is to make inferences by examining old earthquakes and to produce information that we can benefit from.

For example, if there have been too many earthquakes in a city and the magnitude of the earthquake is high, it is likely that an earthquake will occur in that city again due to fault activity. However, this acceptance does not lead to the conclusion that places where there have been no earthquakes for many years are safe. The approaches used in this study can be handled with different data and different methods in the future, and different results can be reached.

When the red points in Figure 25 are examined, it will be seen that they are compatible with The Seismic Hazard Map of Turkey. [7]


6. Conclusion

As a result, this project has helped to better understand the characteristics of cities in terms of earthquakes and social venues. The project has supported not only the investors but also the city managers or planners in their decisions. It also played a guiding role for all kinds of researchers using data analysis types in areas similar to the one in this study.

All data sets, Python codes and image files used in the study are stored in the relevant GitHub account. [8] Thus, it is made ready to be used for different projects in the future.


Zeki ÇIPLAKPhysicist, Data & Computer Science Enthusiast.


7. References


Stay with the Science…


Related Articles