
How do you find meaning in data? In our mini project, my friend @ErikaSM and I seek to predict Singapore’s minimum wage if we had one, and documented that process in an article over here. If you have not read it, do take a look.
Since then, we have had comments on our process and suggestions to develop deeper insight into our information. As such, this follow-up article outlines two main objectives, finding meaning in data, and learning how to do stepwise regression.
The Context
In the previous article, we discussed how the talk about a minimum wage in Singapore has frequently been a hot topic for debates. This is because Singapore uses a progressive wage model and hence does not have a minimum wage.
The official stance of the Singapore Government is that a competitive pay structure will motivate the labour force to work hard, aligned with the value of Meritocracy embedded in Singapore culture. Regardless of the arguments for or against minimum wages in Singapore, the poor struggle to afford necessities and take care of themselves and their families.
We took a neutral stance acknowledging the validity of both sides of the argument and instead presented a comparison of a prediction of Singapore’s minimum wage using certain metrics across different countries. The predicted minimum wage was also contrasted with the wage floors in the Progressive Wage Model (PWM) across certain jobs to spark some discussion about whether the poorest are earning enough.
The Methodology
We used data from Wikipedia and World Data to collect data on minimum wage, cost of living, and quality of life. The quality of life dataset includes scores in a few categories: Stability, Rights, Health, Safety, Climate, Costs, and Popularity.
The scores across the indicators and categories were fed into a Linear Regression model, which was then used to predict the minimum wage using Singapore’s statistics as independent variables. This linear model was coded on Python using sklearn, and more details about the coding can be viewed in our previous article. However, I will also briefly outline the modelling and prediction process in this article as well.
The predicted annual minimum wage was US$20,927.50 for Singapore. A brief comparison can be seen in this graph below.

Our professor encouraged us to use stepwise regression to better understand our variables. From this iteration, we incorporated stepwise regression to assist us in dimensionality reduction not only to produce a simpler and more effective model, but to derive insights in our data.
Stepwise Regression
So what exactly is stepwise regression? In any phenomenon, there will be certain factors that play a bigger role in determining an outcome. In simple terms, stepwise regression is a process that helps determine which factors are important and which are not. Certain variables have a rather high p-value and were not meaningfully contributing to the accuracy of our prediction. From there, only important factors are kept to ensure that the linear model does its prediction based on factors that can help it produce the most accurate result.
In this article, I will outline the use of a stepwise regression that uses a backwards elimination approach. This is where all variables are initially included, and in each step, the most statistically insignificant variable is dropped. In other words, the most ‘useless’ variable is kicked. This is repeated until all variables left over are statistically significant.
The Coding Bits
Before proceeding to analyse the regression models, we first modified the data to reflect a monthly wage instead of annual wage. This was because we recognised that most people tend to view their wages in months rather than across the entire year. Expressing our data as such would allow our audience to better understand our data. However, it is also worth noting that this change in scale would not affect the modelling process or the outcomes.
Looking at our previous model, we produced the statistics to test the accuracy of the model. But before that, we would first have to specify the relevant X and Y columns, and obtain that information from the datafile.
## getting column names
x_columns = ["Workweek (hours)", "GDP per capita", "Cost of Living Index", "Stability", "Rights", "Health", "Safety", "Climate", "Costs", "Popularity"]
y = data["Monthly Nominal (USD)"]
Next, to gather the model statistics, we would have to use the statmodels.api library. Here, a function is created which grabs the columns of interest from a list, and then fits an ordinary least squares linear model to it. The statistics summary can then be very easily printed out.
## creating function to get model statistics
import numpy as np
import statsmodels.api as sm
def get_stats():
x = data[x_columns]
results = sm.OLS(y, x).fit()
print(results.summary())
get_stats()

Here we are concerned about the column "P > |t|". Quoting some technical explanations from the UCLA Institute for Digital Research and Education, this column gives the 2-tailed p-value used in testing the null hypothesis.
"Coefficients having p-values less than alpha are statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0)."
In other words, we would generally want to drop variables with a p-value greater than 0.05. As seen from the initial summary above, the least statistically significant variable is "Safety" with a p-value of 0.968. Hence, we would want to drop "Safety" as a variable as shown below. The new summary is shown below as well.
x_columns.remove("Safety")
get_stats()

This time, the new least statistically significant variable is "Health". Similarly, we would want to remove this variable.
x_columns.remove("Health")
get_stats()

We continue this process until all p-values are below 0.05.
x_columns.remove("Costs")
x_columns.remove("Climate")
x_columns.remove("Stability")

Finally, we find that there are 5 variables left, namely Workweek, GDP per Capita, Cost of Living Index, Rights, and Popularity. Since each of the p-values are below 0.05, all of these variables are said to be statistically significant.
We can now produce a linear model based on this new set of variables. We can also use this to predict Singapore’s minimum wage. As seen, the predicted monthly minimum wage is about $1774 USD.
## creating a linear model and prediction
x = data[x_columns]
linear_model = LinearRegression()
linear_model.fit(x, y)
sg_data = pd.read_csv('testing.csv')
x_test = sg_data[x_columns]
y_pred = linear_model.predict(x_test)
print("Prediction for Singapore is ", y_pred)
>> Prediction for Singapore is [1774.45875071]
Finding Meaning in the Data
This is the most important part of the process. Carly Fiorina, former CEO of Hewlett-Packard, once said: "The goal is to turn data into information, and information into insight." This is exactly what we aim to achieve.
"The goal is to turn data into information, and information into insight." ~ Carly Fiorina, former CEO of Hewlett-Packard
From just looking at the variables, we would have easily predicted which were statistically significant. For example, the GDP per Capita and Cost of Living Index would logically be good indicators of the minimum wage in a country. Even the number of hours in a workweek would make sense as an indicator.
However, we noticed that "Rights" was still included in the linear model. This spurred us to first look at the relationship between Rights and Minimum Wage. Upon plotting the graph, we found this aesthetically pleasing relationship.

Initially, we wouldn’t have considered Rights to be correlated to Minimum Wage since the more obvious candidates of GDP and Cost of Living stood out more as contributors to the minimum wage level. This made us reconsider how we understood minimum wage and compelled us to dig deeper.
From World Data, "Rights" involved civil rights, and revolved mainly around people’s participation in politics and corruption. We found that the Civil Rights Index includes democratic participation by the population and measures to combat corruption. This index also involves public perception of the government including data from Transparency.org.
"In addition, other factors include democratic participation by the population and (with less emphasis) measures to combat corruption. In order to assess not only the measures against corruption, but also its perception by the population, the corruption index based on Transparency.org was also taken into account."
This forced us to consider the correlation between Civil Rights and minimum wage. Knowing this information, we did further research and found several articles that might explain this correlation.
American civil rights interest group, The Leadership Conference on Civil and Human Rights, released a report about why minimum wage is a civil and human rights issue and the need for stronger minimum wage policy to reduce inequality and ensure that individuals and families struggling in low-paying jobs are paid fairly. It hence makes sense as a country with more democratic participation is also likely to voice concerns about minimum wage, forcing a discussion and consequently increasing it over time.
The next variable we looked at was Popularity. We first searched how this was measured from World Data.
"The general migration rate and the number of foreign tourists were therefore evaluated as indicators of a country’s popularity. A lower rating was also used to compare the refugee situation in the respective country. A higher number of foreign refugees results in higher popularity, while a high number of fleeing refugees reduces popularity."

At first glance, it seems like there is no correlation. However, if we consider China, France, USA, and Spain as outliers, the majority of the data points seem to better fit an exponential graph. This raises two questions. Firstly, why is there a relationship between Popularity and Minimum Wage? Secondly, why are these four countries outliers?
To be very honest, this stumped us. We simply could not see any way where popularity could be correlated to a minimum wage. Nevertheless, there was an important takeaway: that popularity is somehow statistically significant in predicting a minimum wage of a country. While we might not be the people to discover that relationship, this gives insight into our otherwise less meaningful data.
Conclusion
It is important to bring back the quote from Carly Fiorina, "The goal is to turn data into information, and information into insight." We as humans require tools and methods to convert data into information, and experience/knowledge to convert that information into insight.
We first used Python as a tool and executed stepwise regression to make sense of the raw data. This let us discover not only information that we had predicted, but also new information that we did not initially consider. It is easy to guess that Workweek, GDP, and Cost of Living would be strong indicators of the minimum wage. However, it is only through regression that we discovered that Civil Rights and Popularity are also statistically significant.
In this case, there were research online that we found that could possibly explain this information. This resulted in new insight that minimum wage is actually seen as a human right, and an increase in democratic participation can possibly result in more conversations about a minimum wage and hence increasing it.
However, it is not always possible to find meaning in data that easily. Unfortunately, we, as university students, may not be the best people to offer probable explanations to our information. This is seen in our attempts to explain the relationship between Popularity and Minimum Wage. However, it is within our capacity to take this information and spread it to the world, leaving it as an open ended question for discussions to flourish.
That is how we can add value to the world using data.
- Written in collaboration with Erika Medina