How the combination of the right data and Machine Learning can be leveraged to shape the smart cities of the future
We live in a society that is increasingly aware of the adverse effects of living in polluted air. **** As a result, pollution is becoming a critical issue when cities are designed or redesigned. Unfortunately, it is difficult to assess what effect individual choices have since causality is difficult to confirm. As human behaviour is fundamental to results, the problem is even more complicated. Luckily, urban planners are looking into ways to nudge people into making better individual choices. Unfortunately, planners do not possess the necessary tools to evaluate what must be done to reduce pollution, especially those planners who cannot afford to use large supercomputers.
Humans have a considerable effect on pollution levels. The Covid-19 pandemic shows the extent to which our behaviour is correlated with air pollution in our cities. **** In Barcelona, NO2 levels dropped 64% in March 2020 to levels previously deemed unreachable. Combine this result with the knowledge that multiple studies brought have shown that NO2 pollution is associated with Health problems such as diabetes, hypertension, strokes, chronic obstructive pulmonary disease, and asthma.
The pollution challenge we face in our cities is important and the consequences of success or failure will be felt by everyone. To discover what we could do to help we combined our forces with 300.000 km/s, a Barcelona urban planning think tank that works with smart data for cities. We aimed to use data intelligently to enable city architects to make more informed decisions when considering air pollution. With an abundance of data, this raises essential questions from the get-go: what data is relevant? How do we make data smart?
Our approach
Our journey started on Esteve‘s kitchen table as we discussed the various options for an impactful and exciting project that would conclude our master’s degree in business analytics. We quickly decided on the subject of smart cities, an area in which Esteve had a done some recent research. Esteve contacted Mar and Pablo, the co-founders of 300.000 km/s. Together with Esteve, they helped and supported us throughout this eight-month journey and guided us with their expertise when we took a wrong turn. We would not have got this far without them.
We started our project with a dataset provided by 300.000km/s. This dataset contained summarised travel data about the movement of individuals in Spain collected from the movement of cellular devices. Spain was divided into roughly 2500 regions, and all travel between these regions was collected. Scholars have long shown that NO2 is strongly correlated with travel (most notably, from diesel cars). To reinforce our initial data, we added numerous environmental statistics. These ranged from the number of people per age group living in these areas to average incomes.
The Covid-19 pandemic shows the extent to which our behaviour is correlated with air pollution in our cities.
To accurately predict the NO2 levels in many areas of Spain, we needed to think about modeling techniques. Our model used a combination of standard and uncommon Machine Learning techniques. We used correlation matrices, random forest regression trees, graph-based representations, and spatial lagged features from start to finish. As we were struggling to use the data optimally, Andre, the data scientist from 300.000 km/s, introduced us to the concept of spatial lag. This feature uses the strength of the data we possess, namely the geographical information, in the best possible way. By doing so, we could introduce ‘spatiality’ into our machine-learning vocabulary.
As a result, we could extract vital information usually lost in traditional machine learning techniques, such as random forests or XGBoost. We looked at Moran’s I coefficient to ensure we would only use spatial lagged features that possessed complete information. This coefficient is a measure for spatial autocorrelation, which, in simple terms, represents how good it is to predict an element with the knowledge of the value of the same quality in geographically neighbouring areas.
Our final deliverable was a model that used the best combination of ‘normal’ and ‘spatially-lagged’ features to predict NO2 levels in Spain. We started our initial search for the best possible model for the 30+ features and ended with a model that uses eight features to predict NO2 throughout Spain. The Moran’s I score and multiple try outs between different features are spatially lagged. We arrived at a model that is 88.8% accurate for predicting NO2 levels throughout Spain. We found that the percentage of space used for residential buildings and the number of homes with surfaces between 61 and 90 m2 were the most potent predictors of NO2 levels. Other notable predictors were houses with surfaces between 45 and 60 m2 and the number of people aged between 0 and 25 per square kilometre. Thus, we could predict NO2 levels with precision using primarily residential information. This insight shows how city planning affects liveability.
resall =pd.DataFrame()
res_w1 =pd.DataFrame()
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
seed=7
kfold=KFold(n_splits=10, random_state=seed, shuffle = True)
num_trees=100
num_features=5
max_depth = 5
model12=xgb.XGBRegressor(colsample_bytree=0.4,
gamma=0,
learning_rate=0.07,
max_depth=3,
min_child_weight=1.5,
n_estimators=10000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.6,
seed=42)
model12.fit(mix_matrix,Y)
results_NO2_avg=cross_val_score(model12, mix_matrix, Y, cv=kfold)
print(f'Random Forest - Accuracy {results_NO2_avg.mean()*100:.3f}% std {results_NO2_avg.std()*100:3f}%')
res_w1["Res"]=results_NO2_avg
res_w1["Type"]="Random Forest"
resall=pd.concat([resall,res_w1], ignore_index=True)
Random Forest – Accuracy 88.876% std 1.376834%
plt.figure(figsize=(30,9))
for name, importance in zip(names, model12.feature_importances_):
print(f'{name:15s} {importance:.4f}')
sns.barplot(x=names, y=model12.feature_importances_)
income_lag 0.0429 pob_sale_lag 0.0842 less45m2_per_km2 0.0765 med_small_per_km2 0.1632 large_med_per_km2 0.2218 more90m2_per_km2 0.0553 young_per_km2_lag 0.1494 housing_perc 0.2067
pred2 = model12.predict(mix_matrix)
abs_diff2 = abs(Y-pred2)/Y
abs_diff2 = pd.DataFrame(abs_diff2)
abs_diff2 = abs_diff2.merge(df_tryout.geometry, how = 'left', left_index = True, right_index = True)
abs_diff2
abs_diff2 = gpd.GeoDataFrame(abs_diff2, crs="EPSG:4326", geometry='geometry')
abs_diff2.NO2_avg = np.clip(abs_diff2.NO2_avg, 0, 1)
fig, ax = plt.subplots(1, figsize=(30, 20))
abs_diff2.plot(column='NO2_avg', cmap='gist_yarg', linewidth=0.8,ax = ax, legend=True)
The key finding
We stated the objective of this project, asking two questions: What data is relevant? How do we make data smart? We are now able to answer both, but in order to do so, we need to take a step back.
We all agree that all data available is not and will not be enough to create a digital twin of a city. Furthermore, the model will not be scalable even if possible because it will require an equivalent amount of data for each city to be trained.
So, is there a way to avoid this scenario? Do we need all this data? In other words, what are the decisions that could be taken based on the model?
Ideally, a legislator would probably decide whether to ban trucks from passing through a street, restrict flows, or promote bicycles. Not only do these decisions come at a low granularity, but also they are likely binary decisions (e.g., our legislator would ask herself: "Should we ban or not ban trucks in the city center this weekend?"). Moreover, in the most optimistic scenario, these discrete decisions will be applicable only with a small range of action (e.g., the legislator may permit to use two out of 3 lanes).
Therefore, we do not need a highly accurate model but a model accurate enough to help leaders make interventions at a low granularity.
The strength of our model is its capacity to be completely scalable and flexible to adapt to various scenarios. It can be fed with synthetic data, as **** it was built on publicly available data sources.
Final considerations
Multiple sectors can leverage the results of our model. The public sector can be a major beneficiary as urban planning affects pollution. By taking innovative strategies to reduce traffic between places, cities may have a greater impact at a lower cost compared to routes currently used. This model will give us information on what happens if we adjust specific traffic flows within the whole structure. An example can be building offices in Sant Cugat to reduce traffic flow to Barcelona and so improve Barcelona’s air quality. This action contrasts with those taken nowadays when politicians try to establish measures where the pollution is too high.
Nations can use these models to check whether their pollution planning works according to plan. Our predictions can benchmark areas where specific pollution minimalizing measures have been taken and review their success. This takeaway will reduce the time to market for successful ideas, as it will take less time to confirm the results. Furthermore, it will enable a more rapid rollout of new ideas as poor ideas will be identified sooner. The result will be cost savings and better protection for the environment.