Suicide in the 21st Century (Part 1)

Published in

Towards Data Science

9 min readJun 20, 2019

Suicide is the leading cause of death among young people aged 20–34 years in the UK, with three times as many men taking their own lives compared to women. Alarmingly, of these thousands of men and women, recent statistics show that just 27% of suicide victims between the years of 2005–2015 were classified as patient suicides, i.e. the individual had been in contact with mental health services in the year before their death.

In part one of this post, we will do some basic analysis on suicide rates across the world, and in part 2, we will dive deeper into some machine learning, with K-means in particular.

Let’s get started!

The Dataset used for this project was taken from Kaggle, posted as ‘Suicide rates overview 1985 to 2016’ in the form of a large .csv file. The dataset contains 27820 rows across 12 columns: ‘country’, ’year’, ‘sex’, ‘age’, ’suicides_no’, ’population’, ’suicides/100k pop’, ‘country-year’, ‘HDI for year’, ‘gdp_for_year ($)’, ‘gdp_per_capita ($)’, ‘generation’.

Data Preprocessing

Data preprocessing is one of the most crucial steps in the data mining process, dealing with the preparation and transformation of the initial dataset. Although the dataset itself is relatively clean, it is still necessary to remove any redundant data that is not needed for this analysis.

Full steps of the data preprocessing are available at my Github Repository for this project, however, we will go over a few of the most important steps.

Firstly, It is useful to then check if there is any missing data, as running analysis on frames with missing data is very likely to cause errors. As we can see below, the column HDIForYear has many null entries. There are many methods for filling in missing data, however, most of these methods assume that most of the rows of that column are filled. In this case, however, 19456 of the total 27820 rows are missing data for this attribute, so it is best to drop the column altogether.

df.isnull().sum()

df.drop('HDIForYear', axis=1, inplace = True)

Before starting the analysis, it is important to make sure the data frame only contains data that is relevant to it. As this analysis will focus on young male suicide in the 21st century, we wish to remove any rows where gender = ‘female’, any rows where the Year is not >= 2000, and also any rows that are not within the age range that we wish to analyze. These filters can be achieved very simply in Pandas, and is outlined below.

df = df[df.Year >= 2000]
df = df[df.Gender =='male']criteria_1 = df['Age'] == '15-24 years'
criteria_2 = df['Age'] == '25-34 years'
criteria_all = criteria_1 | criteria_2
df= df[criteria_all]

Now that the dataset is more like what we need for analysis, the data frame can be grouped by year and sum, to show the number of suicides per year. Running this it can be seen that the year 2016 has very little data, as the numbers are over 10 times lower than previous years, which is incorrect. Therefore, it is necessary to remove 2016 data for accuracy reasons.

#create new data frame grouped by year to check
yearlyStats = df.groupby('Year').sum()
yearlyStats

df = df[df.Year != 2016]

Removing this data only removes 32 rows, showing how incomplete the 2016 data was. Using the same groupby method, it is simple to show the data by year, so we can get a picture of which countries have very little reported data. The following countries were removed, as any analysis on them would be inaccurate: Antigua, Barbados, Grenada, Maldives, Montenegro, Saint Vincent.

Data Aggregation

We know that the dataset has a Country column, but what if we wish to run analysis on larger groups, such as continents? This can again be accomplished in Python/Pandas in three fairly simple steps.

Creating continent arrays and assigning countries to them, according to The United Nations Statistics Division
Move these to a dictionary
Use the Map function in Pandas to map continents to the countries Note that Step 1 can be skipped and the countries put straight into a dictionary, but moving them to an array first makes it easier in the future, for example if a country was to be added to the dataset.

#create lists of countries per continenteurope = ['Albania', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Russian Federation', 'San Marino', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Ukraine', 'United Kingdom']asia = ['Armenia', 'Bahrain', 'Israel', 'Japan', 'Kazakhstan', 'Kuwait', 'Kyrgyzstan', 'Macau', 'Maldives', 'Mongolia', 'Oman', 'Philippines', 'Qatar', 'Republic of Korea', 'Singapore', 'Sri Lanka', 'Thailand', 'Turkey', 'Turkmenistan', 'United Arab Emirates', 'Uzbekistan']northamerica = ['Antigua and Barbuda', 'Bahamas', 'Barbados', 'Belize', 'Canada', 'Costa Rica', 'Cuba', 'Dominica', 'El Salvador', 'Grenada', 'Guatemala', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama', 'Puerto Rico', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and Grenadines', 'United States']southamerica =  ['Argentina', 'Aruba', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana', 'Paraguay', 'Suriname', 'Trinidad and Tobago', 'Uruguay']africa = ['Cabo Verde', 'Mauritius', 'Seychelles', 'South Africa'] australiaoceania = ['Australia', 'Fiji', 'Kiribati', 'New Zealand']#move these to a dictionary of continentscontinents = {country: 'Asia' for country in asia}
continents.update({country: 'Europe' for country in europe})
continents.update({country: 'Africa' for country in africa})
continents.update({country: 'North_America' for country in northamerica})
continents.update({country: 'South_America' for country in southamerica})
continents.update({country: 'Australia_Oceania' for country in australiaoceania})

Then we can simply map the continents to our countries

df['Continent'] = df['Country'].map(continents)

Now that the data has been preprocessed, the data frame has gone from 27820 rows and 12 columns to 2668 rows and 10 columns, and is now ready to be analysed.

Exploratory Data Analysis (EDA)

First of all, let’s define a nice colour palette for our plots.

flatui = ["#6cdae7", "#fd3a4a", "#ffaa1d", "#ff23e5", "#34495e", "#2ecc71"]
sns.set_palette(flatui)
sns.palplot(sns.color_palette())

Firstly, we will be showing some basic plots that show interesting data in a graphical form. By grouping the data by Year and doing a .sum(), we are able to create a temporary data frame with the total number of suicides by year globally. Taking this frame and applying Matplotlib code with Seaborn aesthetics allows us to show the rate of global suicide, whilst also plotting an average line across.

data_per_year['SuicidesNo'].plot()
plt.title('Total No. of Suicides per Year: 2000 To 2015', fontsize = 22)
plt.axhline(y=52720, color='black', linestyle='--')
plt.ylabel('No. Suicides', fontsize = 20)
plt.xlabel('Year', fontsize = 20)

Here we can see that there is a downward trend and the global rate of suicide is falling over the years. It could be speculated that this is because of increasing awareness, or funding etc., but this is something that can be explored deeper later.

Next, we can show the mean number of suicides per 100k population per year, by continent, by using a bar chart in Matplotlib. A new data frame is created grouping by continent, using .mean() this time. This data frame is then represented below:

data_per_continent = df.groupby('Continent').mean()
data_per_continentax = data_per_continent['Suicides/100kPop'].plot(kind='bar', figsize=(15, 10), fontsize=14)
plt.title('Mean Suicides/Year by Continent', fontsize = 22)
ax.set_xlabel("Continent", fontsize=20)
ax.set_ylabel("Suicides/100k Population", fontsize=20)
plt.show()

Interestingly, we can see that South America is the continent with the highest rate of suicide in young men, followed by Europe. Although useful, it does not show the change over time in the rate of suicide of these continents. After grouping the data using ‘Continent’ and ‘Year’, and executing the following code, we are able to plot the rate of change of suicides/100k population by continent:

dfAgg = dftesting.groupby(['Continent','Year'],sort=True,as_index=False)['Suicides/100kPop'].mean()by_cont = dfAgg.groupby('Continent')for name, group in by_cont:
    plt.plot(group['Year'], group['Suicides/100kPop'], label=name, linewidth=6.0)plt.title('Mean Suicide/100k, Year by Year, per Continent', fontsize = 22)
plt.ylabel('Suicides/100k', fontsize = 20)
plt.xlabel('Year', fontsize = 20)
leg = plt.legend(fontsize = 12)
for line in leg.get_lines():
    line.set_linewidth(10)
plt.show

Suicide rate by Continent over the years

As can be seen, this graph shows the overall downwards trend but also the vicious spikes in continents such as South America and Africa (the latter, likely due to the inconsistencies of the reported data). Next we wish to find out which countries have the highest suicide rates. We could also find out the lowest; however this would be skewed due to countries with low incidence of reporting, etc. (mainly African countries). In Python, we are able to create a visual plot by creating a data frame grouping the data by the mean SuicideNo of each country, sorting the values by descending and plotting the .head() of the data frame as a bar plot.

data_suicide_mean = df['Suicides/100kPop'].groupby(df.Country).mean().sort_values(ascending=False)
f,ax = plt.subplots(1,1,figsize=(15,4))
ax = sns.barplot(data_suicide_mean.head(10).index,data_suicide_mean.head(10))
plt.ylabel('Suicides/100k', fontsize = 20)
plt.xlabel('Country', fontsize = 20)

Countries with the highest suicide rates

Lithuania shows the highest suicide rate over the years, followed closely by Russia and Kazakhstan, with all three countries having a mean suicide rate of over 50 per 100k population. It is interesting to note that Lithuania and Kazakhstan both border Russia.

As the 2016 data was removed due to incompleteness, the most recent year we can run analysis on is 2015. Matplotlib allows the use of scatterplots, giving the ability to plot suicide rates vs. GDP, plotted as countries. Again, preparing the data frame is important, such as excluding any non-2015 data and also irrelevant columns. Grouping by Continent and Country, whilst including suicide rate and GDP.sum() gives the correct shape of the data frame that is needed. Plotting suicide rate vs. GDP for this data frame will scatter the data as Country, showing GDP vs. suicide rate for every country in the frame. Furthermore, adding hue=‘Continent’ to the scatterplot parameters shows the data coloured according to the continent that the country resides in.

#plot suicide rate vs gdp
plt.figure(figsize=(20,16))
sns.scatterplot(x='GdpPerCapital($)',s=300, y='Suicides/100kPop',data=dfcont, hue='Continent') 
plt.title('Suicide Rates: 2015', fontsize= 30)
plt.ylabel('Suicide Rate /100k Population', fontsize = 22)
plt.xlabel('GDP ($)', fontsize = 22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(loc=1, prop={'size': 30})plt.show()

Suicide rates vs GDP, coloured by Continent

Interestingly, there looks to be many countries with very low GDP and also very low suicide rates, which is slightly unexpected. However, this could be due to poorer countries having a low rate of reported suicide when in fact the number could be much higher. Still, GDP seems to have an interesting effect on the rate of suicide.

It would also be interesting to see if the general happiness of a country affects its suicide rates amongst young men. Taking the 2015 world happiness report 10], a list can be created of all the happiness scores for the countries in the data frame; this can then simply be read into a new column ‘HappinessScore’ with the values converted to Float. For this plot, Countries with a HappinessScore of less than or equal to 5.5 are removed — this is because many of these countries with low scores have low suicides rates probably due to incomplete data, non-reporting of suicide, or different classifications of suicide. This data can then be plotted using a scatterplot in Matplotlib/Seaborn to give the following visualization, again using hue=’Continent’ :

#plot suicide rates vs happiness score
plt.figure(figsize=(20,16))
sns.scatterplot(x='HappinessScore',s=300, y='Suicides/100kPop',data=dfcont, hue='Continent') 
plt.title('Suicide Rates: 2015', fontsize= 30)
plt.ylabel('Suicide Rate /100k Population', fontsize = 22)
plt.xlabel('HappinessScore', fontsize = 22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(loc=1, prop={'size': 30})plt.show()

Suicide rates vs HappinessScore, coloured by Continent

Again, it is difficult to tell if there is any real relationship between the suicide rates of a country and its Happiness score; therefore, the relationship will be explored further. We can do this by applying bivariate analysis, plotting a correlation matrix in Pandas, which computes the pairwise correlation of columns.

dfcont.corr(method = 'pearson')

It can be observed that in this data frame, there is a correlation of -0.175131 between GdpPerCapita($) and Suicides/100kPop using the Pearson method, meaning there a relation between the two but not a strong one, with negative indicating that the correlation relationship is inversely proportional, i.e. as one increases, the other decreases. This can also be visualized as a heatmap using Seaborn, giving a more pleasing view of the correlation matrix.

sns.heatmap(dfcont.corr(method = 'pearson'),cmap='YlGnBu',annot=True)

Thanks for reading!

Stay tuned for part 2 which will be out within the next week. We’ll stick with this dataset and jump into some machine learning.

Suicide in the 21st Century (Part 1)

Data Preprocessing

Data Aggregation

Exploratory Data Analysis (EDA)

Written by Harry Bitten