Predicting Demographic Trends for Global UNHCR Persons of Concern

Using regression models to design better social support systems

Jonathan Shapiro

Published in

Towards Data Science

6 min readJan 20, 2020

Refugee camp in Bulgaria | Photo credit: Dhaka Tribune

Introduction

Over the past decade, the number of refugees and displaced persons has skyrocketed, putting us in the midst of the largest refugee crisis ever seen. Meanwhile, countries like the United States are now accepting a dismally low number of refugees, while thrashing institutional services and support networks. It is paramount that we rectify these human rights violations immediately and better plan for the mass migrations that we are already seeing.

Though data science and machine learning technologies have historically been weaponized against marginalized communities, rather than empowering them (look no further than Weapons of Math Destruction by Cathy O’Niel), they can also be utilized to help generate equity if used correctly. Within the context of displacement, we can begin the lofty goal of predicting migration trends across populations. By not only looking at where people are coming from in terms of total population size, but also looking at the demographic makeups of the population itself, governments and organizations can put themselves in an optimal spot to support displaced people during the relocation process. Clearly, it would be better to know that a group is 70% children under the age of four than having to assume that there is an equal spread among ages.

The United Nations Refugee Agency (UNHCR) collects demographic data among individuals classified as persons of concern. This can be used as an example dataset to build a scaffold for predicting demographic trends globally in the near future. Specifically, this tool will use regression techniques to predict total numbers of individuals across eight demographic buckets. Note that using this approach can and should be used for more specific demographic datasets (by age, non-binary gender identity, etc).

Methodology

Pre-processing: For the purposes of this project, the UNHCR data needs to be normalized and aggregated at the country level, so a few simple data cleaning steps can be applied, such as removing asterisks, grouping by country, combining demographics that were not always considered separate categories, and renaming countries so that latitude and longitude can be pulled for them. We also can’t run regressions on countries that only have a single year of data, so those values must be deleted.

Machine Learning: The main model utilizes simple linear regression for prediction. Functions are built to specifying a country within a dataset and create a linear regression object for a specific demographic. The regression is then applied across multiple test_sizes in the test_train_split sklearn functionality to maximize the linear regression via R². For all countries and demographics, the linear regression objects are converted into a dataframe for easy calculation in predictive analytics.

Visualization: Two different visualizations are used to inspect the predicted data. The first clearly demonstrates the historical data, regression fit and predicted value of a specific year, while the second provides a map with points plotted geographically on a logarithmic color scale.

Scaffold Development

I won’t be going over data cleaning in this article, but feel free to look at my github if you are interested in the specifics. For the scaffold, I created three functions that specify a country to pull data for, create the regression model and maximize the regression based on R² by iterating over different test sizes in train/test/split. Here, the linear regression function can be replaced with other regression models should they create better fits.

Testing for Cambodia with the demographic “Female 5–17,” we get:

Now, this functionality can be iterated over each country and demographic in the cleaned dataset. Clearly some R² values are better than others. More on that later.

Visualization

As described above, the visualizations for this tool come in two flavors — the country and global level. The first is a simple linear regression plot for a given demographic and country that includes historical data, the associated regression line (with R² maximized), and a specific point on the graph called out that shows the predicted value of a chosen year. The second visualization shows a logarithmically color-scaled geographic plot of the world with blue representing very few displaced persons (0–10) and red representing significant displaced persons (1,000,000–10,000,000) for a given demographic and year. With this plot, you can click on any value and get that country’s prediction.

The visualization development is seen below. Note the second function pulls latitude and longitude for an input address.

Examples of visualizations are seen below:

Results and Discussion

This model was built to better predict the number of specific demographics of individuals displaced in total population size to better prepare for necessary services to support persons of concern, but there is clearly a lack of statistical rigidity in the results.

The biggest downfall of this study is the predictive ability with the amount of data publicly available. Each country has a maximum of 18 data points, one per year since 2001. With this small amount of data and the incredibly volatile nature of displacement, the actual predictive viability of this tool is very low — if anything it can only provide directional answers as to where we will be seeing displacement by specific demographics. Each time the regression is run, the R² can vary widely depending on the country, despite maximization. This is especially common among countries with low population displacement. With a longer period of data collection and more specific metrics, the algorithmic scaffold would be more successful.

An area of improvement for the predictive capability of this tool would be to map against specific causal data where trends are not as variable. It makes sense, for example, to utilize the massive amount of predictive work done around climate change, which is already forcibly displacing individuals, rather than trying to predict if specific wars will break out. This isn’t to say that the latter is less or more important, but that if the model intelligently pulls in causal factors in addition to predicting exclusively based on historical trends of total population movement, it would be better equipped. In essence, displacement trends are complex, so getting usable information requires pulling in causal data, as well as utilizing a more advanced regression model. However, this could provide a scaffold for loading, cleaning and transforming complex datasets to build and analyze necessary learning algorithms.

Despite the statistical pitfalls with the dataset used, there are inferences we can gather. Based on the logarithmic map that details predicted numbers of displaced people, we get expected results based on historical trends. Specifically, the largest populations are predicted to come generally from Sub-Saharan Africa, the Middle East, Central America and parts of Southeast Asia.

Some unexpected countries are seeing high numbers that we may not natively expect, such as Germany. However, research will usually indicate why these numbers exist. For example, there have not only been many displaced individuals traveling to Germany, but the government has also been deporting a large number of them. Because of the proximity to many areas that are experiencing refugee crises, the numbers have become unfortunately quite large.

It is also evident that demographic trends vary widely, even within the same country— if you play around with the data yourself, you will see that some populations are decreasing over time, while others are spiking (e.g. older males in Afghanistan vs. younger males).

For further reading on how data science has been used to create more equitable situations for refugees, please see Bansak, et al. (2018).

Feel free to reach out with any thoughts or questions, and for the complete code, see my Github. A big thank you to Madeline Hardt for consulting on this project.