Analysis of crimes in Mexico during 2017 with Machine Learning techniques (Cluster Analysis): Comparison Elbow Method and Silhouette Method

Using Machine Learning to identify problematic regions in Mexico.

Isaac Arroyo
Towards Data Science

--

Final Division of Mexican states by their “Problematic features” (Folium)

Authors:

Arroyo-Velázquez Isaac; Camacho-Perez Enrique

Looking at the big picture

Crime issues always has been a delicate topic of great interest in Mexico. Some people believe that there is a relationship between the number and the type of crime with the country’s region, that’s why in the following work an analysis will be made with the data from the National Survey of Victimization and Perception of Public Security ( ENVIPE ) 2018 (which aim to estimate the number of crimes committed during 2017) in order to be compared to Peace Index Mexico ( IPM) 2018¹

Data Acquisition

The data to be used are part of the National Survey of Victimization and Perception of Public Security ( ENVIPE ) 2018² also we add the data from the number of homicides in each of the states³.

All this data is reflected and group in a csv file named CrimesMX2017 and has the following content:

  • ENTITY_CODE : ID Number.
  • STATE : State’s name.
  • ID: Official name’s abbreviation.
  • HOMICIDES*: The act of one human killing another.
  • CAR_THEFT*: Total or partial theft of vehicle.
  • EXTORTION* : Intimidation to perform an act to the detriment of your patrimony.
  • STREET_TRANSPORT_THEFT*: Robbery/Theft or assault on the street or public transportation.
  • HOME_THEFT*: Home theft.
  • FRAUD*: Delivery of money for a product or service that was not received as agreed.
  • POPULATION: Total number of inhabitants in the entity⁴.
  • URBAN_PP : Percentage of urban population⁴.

* Crime prevalence rate by state per hundred thousand inhabitants

Exploratory Data Analysis

Libraries used for the EDA

General aspects

A statistical summary of the data is shown below:

Central statistics

It can be seen that the highest averages of number of crimes per hundred thousand inhabitants are:

  1. CAR_THEFT
  2. STREET_TRANSPORT_THEFT
  3. EXTORTION

Data distribution

Figure 1: Distribution of the variables (Pandas)

Bar charts

An interesting analysis is to show the different distributions of variables by state:

Figure 2: Distribution of crimes in the states (Matplotlib)

Correlation of the variables

For this analysis only the most important variables will be taken, these are:

  1. CAR_THEFT
  2. STREET_TRANSPORT_THEFT
  3. EXTORTION
  4. HOMICIDES

The correlation between these four variables will be shown below for a better perspective of the problem:

Figure 3: Pairplot made to show the correlation of the variables (Seaborn)

Sometimes is better to see the numbers in a different way:

Figure 4: Correlation Matrix. Here we can observe in a better way the strongest and weakest correlations (Seaborn)

It is observed that the highest correlations are:

  • HOMICIDES — STREET_TRANSPORT_THEFT
  • CAR_THEFT — STREET_TRANSPORT_THEFT
  • HOMICIDES — EXTORTION

Data pre-processing: Standardization

An essential step for the Cluster Analysis is standardization . In this process, the variables’ data are re-scale to lie under the same weighing; thus, the data’s similarity is determined. Standardization is necessary when the variables have different scales.

One way to perform this task is to model the distributions of the variables as a Normal Distribution with mean μ and standard deviation σ. The following formula is used to model such distribution:

Equation 1
Data pre-processing
Standardized data

K-Means Algorithm and Cluster Analysis

from sklearn.cluster import KMeans

What is it? (a summary)

K-Means Algorithm seeks to find K number of clusters in a data set. This clusters have to be apart as they can be from each other and keep their elements as closely as possible.

Figure 5: Representation of the K-Means algorithm

The problem

Cluster analysis is ideal to find patterns, client segmentation, and in our case, to find any similitude. However, the question is always the same: what is the K numbers that makes the number of clusters optimal?

The options

There are two ways to find the optimal number of clusters:

  • The Elbow Method: this approach plots the values of the WCSS(Within Clusters Summed Squares) and the point where the value of the parameter falls most from the previous value is selected.
  • The Silhouette Method : in this procedure, the Silhouette Coefficient is plotted, and the maximum value is selected

The Elbow Method

The Elbow Method is the most popular way to find the optimal number of clusters in the data set selected. This method employs the Within Cluster Summed Squares parameter or WCSS parameter, this number is obtained with each group’s centroid locations.

* Centroid : Geometric point where the majority of the cluster’s data is concentrated.

How does the Elbow Method Work ?

Within Cluster Summed Squares (WCSS) is a parameter that represents the square distance from each data point to the cluster’s centroid where it belongs.

Different number of clusters give different WCSS parameters. If the WCSS parameter associated with every possible cluster of the data set is plotted, it will be obtained something similar to:

Figure 6: Plot made by me (Matplotlib)

The WCSS parameter decays as the number of clusters increase. The difference of WCSS ‘s decrease between clusters diminishes. What is the optimal number of clusters? When the decrease difference is not significant.

Looking at the image above, there is a significant difference when the number of clusters is 4; after that number, the difference between clusters is attenuated or not significant. In this example, the optimal number of clusters is 4.

This method is called Elbow Method because the point where the decrease of the WCSS parameter starts to get less significant makes the graph looks like a semi-flexed arm, and the point that gives us the optimal number of clusters is the elbow.

Figure 7: Plot made by me (Matplotlib)

Application of the Elbow Method

Necessary code to plot WCSS parameter against number of clusters
Figure 8: WCSS against Number of clusters (Matplotlib)

The optimal number of clusters is not not precise in this case.

Silhouette Method

As observed, it is not always clear the optimal number of clusters using the Elbow Method; however, there is another option: The Silhouette Method.

What does the Silhouette Method do?

In order to use the Silhouette Method, it is necessary to obtain the silhouette score . This parameter indicates how appropriate the classification of a point is for the cluster in which it is located. The mathematical equation is:

Equation 2

Where:

Each cluster corresponds to a silhouette score that is the average of all silhouette scores of all points in that same cluster, and the silhouette score of the model is the average of the clusters’ values.

he ideal case is when a is so small that (we can say) it tends to zero and b is so big that (we can say) tends to infinity, so the model’s classification is excellent:

Equation 3

this means that the worst case scenario is when a>>b , and the model’s classification is the worst:

Equation 4

In summary: The closer it gets to one the better. In this method we are trying to find the maximum value of the silhouette score to know the optimal number of clusters

Application of the Silhouette Method

Necessary code to plot the silhouette coefficient against number of clusters
Figure 9: Silhouette Coefficent against Number of clusters (Matplotlib). Max. Silhouette Score: 0.331725 Optimal number of cluters: 5

Complementation: The Elbow Method + The Silhouette Method

The Elbow Method is more of a decision rule, while the Silhouette is a metric used for validation while clustering. Thus, it can be used in combination with the Elbow Method.

Therefore, the Elbow Method and the Silhouette Method are not alternatives to each other for finding the optimal K. Rather they are tools to be used together for a more confident decision⁵ .

Below it is possible to observe the combination of both methods in the crime problem in Mexico.

Figure 10: Using both methods brings us better results. The optimal number of clusters is 5

Final result: Optimal number of clusters = 5

In this case it can be seen that the optimal number of clusters is 5 and it is possible to apply it in the K-Means algorithm to obtain the separation of the data into clusters

Analysis of the output

Distribution of the variables

The following graph shows the correlation of each of the selected variables and each of the clusters in colour.

Figure 11: Pairplot of the clustered data (Seaborn)

Correlation of variables: Homicides, Car theft and Robbery/Theft or assault on the street or Public Transport

The variables with the highest correlation (HOMICIDES, CAR_THEFT and STREET_TRANSPORT_THEFT) are taken and plotted together, and it is possible to appreciate much better the way in which the clusters are separated.

Figure 12: 3D Scatter plot (Matplotlib)

Division of Mexican states

The geographical location of each cluster is observed through a map created in the library folium⁶, to identify the states by their corresponding cluster colour assigned by the K-Means algorithm.

Figure 13: Division of the Mexican states (Folium)

Discussion

It is possible to notice that there is a geographical relationship in the way in which they were grouped, so here are some comments regarding the division of the clusters and the states they cover:

  • Cluster 1:
    States: Aguascalientes, Baja California, Nuevo León and Sonora
    Comment: These four states are located in the north of Mexico, and their main characteristic is to have high rates of car theft .
  • Cluster 2:
    States: Chihuahua, Guanajuato, Guerrero, Michoacán and Sinaloa
    Comment: This cluster has the characteristic of having the highest homicide rates. The states that belong to the cluster are located in the North and West of Mexico.
  • Cluster 3:
    States: Baja California Sur, Campeche, Coahuila, Colima, Chiapas, Durango, Hidalgo, Oaxaca, Quinatana Roo, Tabasco, Tamaulipas, Veracruz and Yucatán.
    Comment: This cluster contains the highest number of states, and in general they have the lowest indices of the four parameters that are being considered. Most of the states of this cluster are found in the Southeast of Mexico.
  • Cluster 4:
    States: Mexico City and State of Mexico
    Comment: These two entities behave very similarly in their parameters and the algorithm classifies them in a single cluster and they have the characteristic of having a high score of theft on the street and public transport.
  • Cluster 5:
    States:** Jalisco, Morelos, Nayarit, Puebla, Querétaro, San Luis Potosí, Tlaxcala and Zacatecas
    Comment: In this cluster the states are geographically very close, they are located in the western region of Mexico and have the characteristic of having high rates of extortion.

Conclusions

Unsupervised learning can help show patterns or similarities that are not easily observed, especially when many variables are involved.

When the K-Means algorithm is used to perform cluster analysis, the Elbow method, is a great tool to choose several clusters, although sometimes using this method, the optimal number of clusters is not so precise. The Silhouette Method is then used to find the maximum value (either visually or with software support). As mentioned earlier both methods are not alternatives to each other for finding the optimal K, they are tools to be used together for a more confident decision.

With this exercise, it was observed that Machine Learning tools tools could be applied to studies from other areas such as the Social Sciences.

Final Remarks

These results obtained have some similarity with the results published in the document Mexico Peace Index 2018 found on the IMCO (Instituto Mexicano para la Competitividad A.C.) published on April 13, 2018¹ ⁷.

GitHub Repository

References

[1] IEP Mexico. El Índice de Paz México. https://www.indicedepazmexico.org/la-paz-en-mexico

[2] INGEI. Encuesta Nacional de Victimización y Percepción sobre Seguridad Pública (ENVIPE) 2018. https://www.inegi.org.mx/programas/envipe/2018/default.html#Tabulados

[3] El Univseral. Rompe 2017 récord en asesinatos, con 31 mil 174. https://www.eluniversal.com.mx/nacion/seguridad/inegi-homicidios-en-mexico-registran-record-en-2017

[4] SEMARNAT, Población rural y urbana , https://apps1.semarnat.gob.mx:8443/dgeia/compendio_2016/archivos/04_demografia/D1_DEMOGRAF01_02_D.pdf

[5] How to Determine the Optimal K for K-Means?, https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb#:~:text=The%20Elbow%20Method%20is%20more,for%20finding%20the%20optimal%20K.

[6] https://python-visualization.github.io/folium/

[7] Institute for Economics and Peace, El Índice de Paz México 2018, http://visionofhumanity.org/app/uploads/2018/04/Mexico-Peace-Index-2018-Spanish.pdf

--

--

On a journey to be a Data Journalist | I love data visualization, the arts and social impact projects