Using Machine Learning to Predict Country Population

A simple linear regression model based on more than 50 years of population record to predict next year's population.

Abid Ebna Saif Utsha
Towards Data Science

--

Photo by Ryoji Iwata on Unsplash

Machine Learning has become one of the trendy topics in recent times. There is a lot of development and research going on to keep this field moving forward. In this article, I will demonstrate a simple linear regression model that will predict a country’s population in the upcoming years.

I am not going to explain linear regression in detail here as there are many articles in Medium and many online courses that offer an in-depth discussion of linear regression. I am simply going to show and explain a small project developed using linear regression in python. The dataset used in this project was collected from the worldbank.

The following libraries are needed for this project.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import re
import json

To ignore warnings in code output, warnings module can be used, but this is optional.

import warnings
warnings.filterwarnings("ignore")

after importing the necessary libraries, now is time for loading the data into pandas datagram. I saved the population file as pop.csv after downloading.

df = pd.read_csv('pop.csv')
df.head()

This will give the following result:

df.head()

From the image above it can be seen that a bit more preprocessing is needed before passing it to the linear regression model. There are few unnecessary columns like Indicator Code, Indicator Name, and others which are not needed for this project. before that, I wanted to select one country based on the Country Name column.

bd=df.loc[df['Country Name']=='Bangladesh']
bd.drop(['Country Name','Country Code','Indicator Name','Indicator Code'],axis=1,inplace=True)
bd = bd.T
bd.head()

Here I used .loc to select Bangladesh and after selecting, I have dropped four columns, so then I have all the years as columns and population as rows. I transposed the dataframe so that I can have only two columns as year and population. The below image is the result of the above code.

but the column name is not showing and instead of the population, it is showing 18. Also, the year is shown as an index. For the Autoregressive (AR) model if the year was shown as index it would have been fine. But for linear regression, I want year as a different column, not as an index. because

y = mx + c

This is a simple linear regression formula where y would be the predicted or dependent variable, x would be the independent variable, m is the slope or coefficient and c is the intercept. In this project, x would be the year, y would be the predicted population. So, the following code will help me prepare the process and renaming the columns.

bd.dropna(inplace=True)
bd=bd.reset_index().rename(columns={18:'population','index':'year'})
bd.head()

This will give the following result:

Now I can use this dataframe to train the linear regression model and get the desired output.

x = bd.iloc[:, 0].values.reshape(-1, 1)
y = bd.iloc[:, 1].values.reshape(-1, 1)
model = LinearRegression().fit(x, y)
y_pred = model.predict([[2019]])
y_pred

In this block of code, I transformed my year and population as a 2D array which is required to use in LinearRegression. Then I simply called the model and fit x and y in the model. after that, I used model.predict() to predict the following result.

array([[1.65040186e+08]])

Now, all of this is great, but I have more than 100 countries' population information and I am restricting myself to only one country using the above block of codes. This above block of codes will provide as the backbone of the following block of code which will show the implementation of the original project. This project will take user input for country and year, after that, I can do some preprocessing like before with a little bit of tweaking and use the linear regression model to show the predicted result.

def main():
country = input("Please input the country name: ").lower()
year = int(input("Please input the year to predict: "))
df = pd.read_csv('pop.csv')
lists, df = country_list_gen(df)
if country in lists:
df = selecting_country(df, country)
model = prediction_model(df)
result = prediction(model,year)
print(f"\n Result: {country.upper()} population in {year} will be {result:,d}")
else:
print('kindly check country name spelling from country_list.json')

if __name__ == "__main__":
main()

The above code is the main function of my script. First, it takes the country name as input from users and converts it into a lower case string. After that, it will also take the year as input and convert it into an integer. Then the original CSV file was loaded in dataframe using pandas read_csv(). Then, it executes the following function.

def country_list_gen(df):
df.rename(columns={'Country Name':'country_name'},inplace=True)
df['country_name'] = df['country_name'].apply(lambda row: row.lower())
lists = df['country_name'].unique().tolist()
with open('country_list.json','w', encoding='utf-8') as f:
json.dump(lists, f, ensure_ascii=False,indent=4)
return lists, df

This is a function which will store all the unique country name in a JSON file, this will help users to check whether the country is available and also if there is any spelling mistake. The function takes the raw dataframe as input parameter and renames the country_name column. After that, it will convert all the countries' names in lower case letters using a combination of .apply() and lambda function. Then, all unique country name was converted into the list and it was saved as country_list.json file. In the end, it returns lists and modified dataframe to the main function. I am giving the main function code again for more ease in reading.

def main():
country = input("Please input the country name: ").lower()
year = int(input("Please input the year to predict: "))
df = pd.read_csv('pop.csv')
lists, df = country_list_gen(df)
if country in lists:
df = selecting_country(df, country)
model = prediction_model(df)
result = prediction(model,year)
print(f"\n Result: {country.upper()} population in {year} will be {result:,d}")
else:
print('kindly check available country name and thier spelling from country_list.json')

Now, the returned list contains the available country name, so a simple if-else statement is executed. If the user-inputted country name is not available in lists or if there is a spelling mistake with my lists then it instructs users to look into the country_list.json file. But, if the name already matches then it will execute the following functions.

def selecting_country(df,country):
df = df.loc[df['country_name']==country]
df.drop(['country_name','Country Code','Indicator Name','Indicator Code'],axis=1,inplace=True)
df = df.T
df.dropna(inplace=True)
df = df.reset_index()
return df
def prediction_model(df):
x = df.iloc[:, 0].values.reshape(-1,1)
y = df.iloc[:, 1].values.reshape(-1,1)
model = LinearRegression().fit(x,y)
return model
def prediction(model, year):
return int(model.coef_[0][0] * year + model.intercept_[0])

selecting_country function takes the country name and filters the dataframe, after that it drops unnecessary fields, transpose and reset index of the dataframe and return it to the main function. The main function then calls the prediction_model and send the dataframe as a parameter. Here I didn’t care about renaming column name as I am converting them into a 2D array as x and y. After that called LinearRegression() and fit the model with x and y. Then the fitted model is sent to the main function. The main function then passes this model along with the year which was prompted by the user in prediction function. This prediction function simply takes the coefficient and intercept from the model and use ‘y=mx+c’ to derive the predicted population.

The sample output for two different scenarios are given below, one with the correct country name and another with the wrong spelling.

Correct name
Wrong Spelling

So, this was my project which was created under an hour. It’s a really simple and easy project to implement and have a practical understanding of the linear regression model. I am from Bangladesh, that’s the reason I used Bangladesh as the country name throughout this article. You can access and see the full code from GitHub. You can see my previous article here which was about getting geolocation information using free API from public IP addresses.

Thank you very much for reading my article. I Hope, you learned something new as I learned while doing this project and writing this article.

--

--

I am working as a data Engineer, with interests in Machine Learning and Software Engineering