Predicting “Bikeability” in U.S. Cities

Becca R
Towards Data Science

--

According to the United Nations, more than half of the world’s population now live in cities. As the worldwide poverty rate continues to fall and people get richer, the number of private vehicles on the roads has surged. These dual phenomena mean higher traffic congestion, which, in turn, exacerbates climate-change-causing greenhouse gas emissions. Alternative “green” modes of transit, namely, walking, scootering, and biking, can improve urban mobility and help cities to meet emissions-reduction targets, but not all cities are equally conducive to these modes. It is in this context that I developed a data science project to predict the “bikeability” of cities across the United States and to explore which urban features determine “bikeability”.

I used OLS regression techniques to predict the target value, urban bike score, which ranges from 0 (poor biking conditions) to 100 (perfect biking conditions). Features for my model included a city’s public transit score, population, population density, business density, weather, GDP per capita, and local tax rates. Data sources included Walk Score, U.S. Climate Data, INRIX, City Lab, and the Tax Foundation.

Part 1: Data Wrangling

The project began as most data science projects do: with getting and cleaning data! After gathering and cleaning data from the aforementioned sources, I needed to join the disparate information. One main source of headache was the different naming conventions for the same city (Washington, D.C. and Washington DC, for example). I ran the following function on all datasets to make sure cities with multiple naming conventions (or misspellings) were harmonized across datasets.

def fix_cities(df):
df.loc[df['city'] == 'Nashville', 'city'] = 'Nashville-Davidson'
df.loc[df['city'] == 'Louisville', 'city'] = 'Louisville-Jefferson'
df.loc[df['city'] == 'Lexington', 'city'] = 'Lexington-Fayette'
df.loc[df['city'] == 'OklahomaCity', 'city'] = 'Oklahoma City'
df.loc[df['city'] == 'Salt LakeCity', 'city'] = 'Salt Lake City'
df.loc[df['city'] == 'SanFrancisco', 'city'] = 'San Francisco'
df.loc[df['city'] == 'VirginiaBeach', 'city'] = 'Virginia Beach'
df.loc[df['city'] == 'Washington,D.C.', 'city'] = 'Washington, D.C.'
df.loc[df['city'] == 'Washington Dc', 'city'] = 'Washington, D.C.'
df.loc[df['city'] == 'Washington', 'city'] = 'Washington, D.C.'
df.loc[df['city'] == 'New York', 'city'] = 'New York City'

Next, since there are some cities with the same name across different states, I needed to join dataframes on a combination of city and state. However, some datasets included the full state name while others included the two-letter abbreviation only. I created a dictionary to map state names to state abbreviations.

states_abbrev = {'Alaska': 'AK', 'Alabama': 'AL', 'Arkansas': 'AR', 'Arizona': 'AZ','California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT','District of Columbia': 'DC','Delaware': 'DE','Florida': 'FL','Georgia': 'GA','Hawaii': 'HI','Iowa': 'IA', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN','Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Massachusetts': 'MA','Maryland': 'MD', 'Maine': 'ME', 'Michigan': 'MI', 'Minnesota': 'MN', 'Missouri': 'MO', 'Mississippi': 'MS', 'Montana': 'MT', 'North Carolina': 'NC','North Dakota': 'ND', 'Nebraska': 'NE', 'New Hampshire': 'NH', 'New Jersey': 'NJ','New Mexico': 'NM', 'Nevada': 'NV', 'New York': 'NY', 'Ohio': 'OH',
'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Puerto Rico': 'PR','Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN','Texas': 'TX', 'Utah': 'UT', 'Virginia': 'VA', 'Vermont': 'VT','Washington': 'WA', 'Wisconsin': 'WI', 'West Virginia': 'WV', 'Wyoming': 'WY'}

I had previously “pickled” each individual dataframe, so my next step was to read in each pickle, make sure city names and states were harmonized, and merge dataframes by city and state.This gave me one master dataset to pickle for later when I was ready to start my exploratory data analysis (EDA) and feature engineering.

%pylab inline
%config InlineBackend.figure_formats = ['retina']
import pandas as pd
import seaborn as sns
df = pd.read_pickle('city_traffic.pkl')
df['city'] = df['city'].str.lower()
df['state'] = df['state'].str.strip()
df['city'] = df['city'].str.strip()
fix_cities(df)
files = ['city_busdensity.pkl','city_percip.pkl','city_poulations.pkl',
'city_taxes.pkl','city_temp.pkl','city_walkscore.pkl']
for file in files:
df_pkl = pd.read_pickle(file)
if 'city' in df_pkl.columns:
df_pkl['city'] = df_pkl['city'].str.strip()
df_pkl['city'] = df_pkl['city'].str.lower()
if 'state_long' in df_pkl.columns:
df_pkl['state'] = df_pkl['state_long'].map(states_abbrev) #Get state abreviation for these
df_pkl['state'] = df_pkl['state'].str.strip()
fix_cities(df)
if 'state' in df_pkl.columns:
df = pd.merge(df, df_pkl, on = ['city','state'], how = 'outer')
print(file,'success')
else:
print(file, 'no state column')
df.to_pickle('city_data.pkl')

Part II: EDA and feature engineering

I loaded back in my pickled dataframe and dropped everything except for my target value (bike_score) and features of interest.

df = pd.read_pickle('city_data.pkl')df = df[['bike_score','walk_score','transit_score',
'congestion', 'bus_density','pop_density', 'population',
'gdp_per_cap', 'state_tax', 'city_tax', 'total_tax',
'avg_percip', 'avg_temp']]

Next I created pairplots to view distributions of my variables, as well as bivariate relationships within my dataframe.

sns.pairplot(df, height=1.2, aspect=1.5);

While most features appeared to be normally distributed, population, population density (‘pop_density’), and business density (‘bus_density’) were notable exceptions, and I wondered if they might benefit from log transformations.

log_vars = ['population','pop_density','bus_density']
for v in log_vars:
df['log_'+v] = log(df[v])

I plotted histograms of these three features before and after their log transformations. The code and plots are shown below.

f, axes = plt.subplots(3, 2)
f.subplots_adjust(hspace = 0.5)
sns.distplot(df.population, ax=axes[0][0], kde=False, bins='auto')
sns.distplot(df.log_population, ax=axes[0][1], kde=False, bins='auto')
sns.distplot(df.pop_density, ax=axes[1][0], kde=False, bins='auto')
sns.distplot(df.log_pop_density, ax=axes[1][1], kde=False, bins='auto')
sns.distplot(df.bus_density, ax=axes[2][0], kde=False, bins='auto')
sns.distplot(df.log_bus_density, ax=axes[2][1], kde=False, bins='auto')
for i in range(0,3):
for j in range(0,2):
axes[i][j].set_xticks([])
axes[i][j].set_yticks([])
Histograms of select features, before and after log transformation

As can be seen from the histograms, the distribution of population appeared more normally distributed after the log transformation. Population and business density, on the other hand, still did not appear to be normally distributed. Rather, these features appeared to have large outliers (for example, New York City) that caused them to skew right. Nonetheless, the relationship between the transformed density features and the target value appeared more linear after the log transformation, so I decided to leave them in to see if they were in fact significant.

Bike score as a function of urban population density and log urban population density
Bike score as a function of urban business density and log urban business density

Next, I explored correlations among my features to account for any collinearity.

Heatmap of feature correlations
corrs = {}
cols = list(df.iloc[:,1:].columns)
for x in cols:
cols = [w for w in cols if w != x]
for w in cols:
corrs[(x, w)] = df[x].corr(df[w])
results = [(x, v.round(2)) for x, v in corrs.items() if corrs[x] > 0.5]
results = sorted(results, key=lambda x: x[1], reverse=True)
results
Correlations for pairs of features

The first three correlations were expected (the log of a variable should be highly correlated with the variable itself). Total tax was simply the sum of city and state taxes, so I chose total_taxes as the tax feature to include in the model. The bottom four pairs of features with correlations between 0.52 and 0.56 were of concern so I created interaction terms for these.

interact = [x[0] for x in results[4:]]
interacts = []
for i in range(len(interact)):
col1 = interact[i][0]
col2 = interact[i][1]
col_interact = col1+'_x_'+col2
interacts.append(col_interact)
df[col_interact] = df[col1]*df[col2]

Next, I checked variance inflation factors of various combinations of features to ensure that any collinearity among my final features was low. Luckily enough, all of my VIF’s were below the “magic number” of 3.

from statsmodels.stats.outliers_influence import variance_inflation_factordf['constant'] = 1
X = df[['congestion', 'population', 'gdp_per_cap', 'walk_score_x_transit_score','total_tax', 'avg_percip', 'avg_temp', 'log_population','bus_density_x_pop_density', 'constant']]
vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])],index=X.columns)
vif.sort_values(ascending = False)
Variance inflation factors of model features

With my data primed for analysis, I pickled this version of the dataframe before continuing to hpyerparameter tuning and model selection.

df.to_pickle('regression_data.pkl')

Part III: Hyperparameter tuning and model selection

Finally, I was ready to create my model and see which features determine “bikeability” in a city!

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV
df = pd.read_pickle('regression_data.pkl')

First, I defined ‘y’ as my target variable (bike_score) and ‘X’ as a matrix of urban features.

y = df['bike_score']
X = df[['constant','congestion', 'transit_score', 'gdp_per_cap','total_tax', 'avg_percip', 'avg_temp', 'log_population','bus_density_x_pop_density']]

Next, I split my data into training and testing sets. I trained my model on 80% of the data and reserved 20% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 88)

a. Linear regression

I first fit a “vanilla” linear regression model on 5 cross-validation folds of data (meaning that one-fifth of my data was always being held out for scoring). I got an average mean absolute error (MAE) of close to 8 across all validation folds.

lr=LinearRegression()scores = cross_val_score(lr, X_train, y_train, cv = 5, scoring='neg_mean_absolute_error')
print (np.mean(scores)*(-1))

b. Ridge regularization

Knowing that there was a lot of multicollinearity among my features, I next fit a ridge regression, which adds a penalty for multicollinearity, on my (standardized) features.

std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)
X_test_scaled = std_scale.transform(X_test)ridge = Ridge(random_state=88)

I used GridSearchCV from scikit learn to find the alpha parameter that minimizes the average mean absolute error across five cross-validation folds.

grid_r1 = {'alpha': np.array([100,75,50,25,10,1,0.1,0.01,0.001,0.0001,0])}
ridge_grid = GridSearchCV(ridge, grid_r1, cv=5, scoring='neg_mean_absolute_error')
ridge_grid.fit(X_train_scaled, y_train)
print("tuned hpyerparameters :(best parameters)", ridge_grid.best_params_)
print("MAE score :",-1*(ridge_grid.best_score_))

The best alpha was 100, which gave an average MAE of 7.5 across the five folds of the training set — a slight improvement from the “vanilla” linear regression.

c. Lasso regularization

Lasso regularization penalizes for multicollinearity and keeps only significant features with non-zero coefficient. Given this, I decided to throw the “kitchen sink” of features into this model. Once again, my first step was to use GridsearchCV to determine the optimal alpha parameter. An alpha of 1.0 returned an MAE of 6.6, already a significant improvement from the ridge model.

y = df['bike_score']
XL = df.drop(['bike_score','constant'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(XL, y, test_size=0.2, random_state = 88)std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)
X_test_scaled = std_scale.transform(X_test)
grid_l1 = {'alpha': np.array([100,75,50,25,10,1,0.1,0.01,0.001,0.0001,0])}
lasso = Lasso(random_state=88)
lasso_grid = GridSearchCV(lasso, grid_l1, cv=5, scoring='neg_mean_absolute_error')
lasso_grid.fit(X_train_scaled, y_train)
print("tuned hpyerparameters :(best parameters) ",lasso_grid.best_params_)
print("MAE score :",-1*(lasso_grid.best_score_))

As I mentioned above, lasso regularization tends to take several feature coefficients to zero and keep only a select set of significant features. Indeed, only walk score, walk score/transit score interaction, congestion, and average precipitation returned non-zero coefficients. I decided to keep all but walk score in the final model (to account for multicollinearity between walk_score and walk_score*transit_score).

coefficients = sorted(list(zip(X_test.columns, lm.coef_.round(2))), key=lambda x: x[1], reverse=True)
coefficients
Feature coefficients with lasso regularization

Below is the code for my final model, which fits a Lasso Regression with alpha 1.0 on the three significant features in the training data (walks score/transit score interaction, congestion, and average precipitation). Finally, I used the model to predict bike scores in the test set (remember that 20% of the data that I put aside?). The MAE on the test data (the mean absolute error between model predictions and actual target values) was 6.3, meaning that my model was able to predict city bike scores within 6.3 points of the actual score on average.

y = df['bike_score']
XL2 = df[['walk_score_x_transit_score','congestion','avg_percip']]
X_train, X_test, y_train, y_test = train_test_split(XL2, y, test_size=0.2, random_state = 88)std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)
X_test_scaled = std_scale.transform(X_test)
lasso = Lasso(random_state=88, alpha=1.0, fit_intercept=True)lm2 = lasso.fit(X_train_scaled, y_train)
y_pred = lm2.predict(X_test_scaled)
mae(y_pred, y_test)

Results

Based on the model coefficients, more “bikeable” cities are those with a precedent for alternative modes of transit (public transit and walking), more congestion (greater incentive to find an alternative mode to private vehicles), and less precipitation (rain/snow). According to the lasso model, features such as local tax rates, business density, and population density had no significance on a city’s “bikeability”.

coefficients = sorted(list(zip(X_test.columns, lm2.coef_.round(2))), key=lambda x: x[1], reverse=True)
coefficients
Feature coefficients, final regression model with lasso regularization

The low R-squared and adjusted R-squared of my model suggest that much of the variance of a city’s bikeability is not explained by my selected features. A next step will be to gather additional features that might explain bikeability, such as the quality of bike-related infrastructure and overall hilliness.

def r2_adj(X,Y,r_squared):
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)- X.shape[1]-1)
return(adjusted_r_squared)
r2 = lm2.score(X_test_scaled, y_test)
r2a = r2_adj(X_test_scaled, y_test, r2)
print('R-squared', r2.round(2), 'Adjusted r-squared', r2a.round(2))

Once I am able to refine the model and further reduce the MAE through additional feature selection and feature engineering, future work will be to incorporate international cities for which walkscore.com does not currently provide bike scores. With urbanization increasing worldwide and the threat of climate change looming over us, it is imperative that cities find ways to reduce private vehicle congestion and emissions. Bike and scooter share programs can contribute to this endeavor and it is my hope that this project will shed some light on those factors that determine viable pre-conditions for implementation of such programs.

I am always striving to improve my work, so suggestions are always welcome! Please feel free to provide feedback in the comments.

--

--

Data Scientist using data and machine learning to draw insights about the world around us.