NYCOpenData has a treasure trove of both interesting and rich datasets to explore related to topics concerning health, environment, business, and education. I stumbled upon the 2018 Central Park Squirrel Census dataset and I knew immediately that I had to do something with it. This dataset deals with squirrel sightings collected over the course of two weeks by volunteers in Central Park. After looking through the data dictionary, I was drawn to a feature named ‘Approaches’ that denotes whether a squirrel was observed approaching a human. I thought it’d be neat to train a machine learning (ML) model to assist me in determining whether a squirrel located within the bounds of Central Park would approach me. This article will go through this weekend project where I detail the entire process towards building that model. There’s a little bit of everything in this project: there’s work with Geospatial data, clustering, visualization, feature engineering, unstructured text, model training, model calibration and model deployment.
I deployed the model in a streamlit app where you can enter your coordinates and other features which will tell you the probability of a squirrel approaching you. You can play with it here. Also, if you are interested in looking through some of the code, I’ve posted the .ipynb here.
Loading and Initial EDA
The data loading was pretty standard.
ini_squirrel_df = pd.read_csv('/content/drive/MyDrive/SquirrelML/NYC_Squirrels.csv')
To carry out the initial EDA I used dataprep to rapidly get an initial idea of what sorts of feature distributions, cardinalities, patterns, missing data, and correlations are present in the raw dataset. You can see the report here. There were several useful insights that I gained from this which allowed me to plan my subsequent feature engineering and remove redundant/unnecessary features. Some of the most notable observations I gleaned from this EDA were as follows:
- The dataset is comprised of 3023 unique observations, 31 columns, with 13% of cells missing.
- Most of the features are categorical and Boolean.
- The latitude and longitude entries appear to have an approximately tetramodal distribution
- The "Shift" column is fairly balanced (~55% of observations were made in the afternoon/night)
- The "Indifferent" column is fairly balanced (~51% of entries are False)
- The vast majority of squirrels are Adults and have a primary fur color of Gray
- The most common location squirrels were observed in were trees
- Color notes, Specific location, Other Activities, and Other Interactions are text columns that have a lot of missing data
- The "Approaches" is HIGHLY imbalanced (94.11% of the observations are negative)
These are all useful observations since now I know that the problem I’m trying to tackle is an imbalanced, binary classification problem. In addition to this, now I know that using clustering methods to group squirrel sightings might be useful moving forward. Furthermore, the text columns may contain information that will enable creation of new, potentially useful features.
Clustering Squirrel Sightings
As can be seen below the squirrel sightings appeared to have approximately a trimodal or tetramodal distribution. I think it would be interesting to try and group squirrel sightings based on location.
Though we can try and establish an optimal number of clusters via silhouette scores or the elbow method, in this case, I wanted to simply partition the sightings into 4 areas. I ran the K-Means algorithm using the Latitude and Longitude as shown below:
# Function for Applying Clustering
def apply_clustering(df, num_clusters=4):
kmeans = KMeans(n_clusters=num_clusters, n_init='auto')
df['Cluster'] = kmeans.fit_predict(df[['X', 'Y']])
centroids = kmeans.cluster_centers_
# Assuming 'kmeans' is your trained KMeans model
joblib.dump(kmeans, '/content/drive/MyDrive/SquirrelML/squirrel_kmeans.pkl')
return df, centroids
The results of the K-Means algorithm are shown below. There are some clear boundaries which I’ve accentuated a bit via a Voronoi Diagram. You can also see the centroids of each cluster depicted via red X’s. I’ve also made it so that the points that correspond to an approaching squirrel are not enclosed in a dark circle, while those that didn’t approach are enclosed.
Geospatial Visualization of Squirrel Sightings
Geospatial data is fun to work with. If you were to look up an aerial view of Central Park or if you are simply familiar with the layout then you would be able to see the one-to-one correspondence of the layouts. For instance, there are 3 major bodies of water in central park which are present in the empty areas of the plot. You can find a labeled layout of Central Park here. For instance, that large ’empty’ area situated in the ‘Yellow’ cluster is actually ‘The Reservoir’. On the top right, we also see an emptyish area that corresponds to the ‘Harlem Meer’. Finally, in the bottom left of the blue cluster we can see the body water known as ‘The Lake’.
To better take advantage of this data, I thought it’d be neat to visualize these sightings directly on a geospatially accurate map of NYC. To do this, I once again turned to NYCOpenData and download the shape (.shp) files for NYC over here. Your download should include 4 different files: a .shp file, .a .prj file, a .dbf file, and a .shx file. Make sure that all those files are in the same directory before you load them. Once I had those files downloaded, I loaded them into my environment with Geopandas. These files are generally quite large and can take a few minutes to load (in my case it took about 5 minutes).
nyc_map = gpd.read_file('geo_export_0a3d2fab-a76c-40da-bc23-6e335dd753dd.shp')
To make the .shp file a bit more lightweight and faster to work with, I used Geofeather to convert Geopandas frame into a feather object.
geofeather.to_geofeather(nyc_map,'nyc_map.feather')
Now I can regenerate the nyc_map from the feather file and plot it which is many times faster than doing it using the original .shp file. We can plot the nyc_map as follows:
nyc_map = geofeather.from_geofeather('nyc_map.feather')
nyc_map.plot()
And the result is below:
That looks great! You can also see Central Park over on the top left of the NYC landmass. As cool as this view is, I want to focus more on the Central Park area. Given that, I also won’t need to have the entire dataset available so I’ll filter it based on the squirrel sighting coordinates. This will also have the added benefit of speeding all subsequent operations since we will be dealing with a much smaller subset of the NYC map.
A centroid is the geometric center of a shape. In the context of geospatial data, it refers to the central point of a geographic feature. This point is the average position of all the points in the shape. For simple shapes like polygons (e.g., the boundary of a city or park), the centroid is often inside the shape. For more complex shapes, especially those with concave parts, the centroid might be outside the physical boundaries of the shape. There are a few approaches one can use to calculate centroids. Here I’m simply using the shapely package alongside the nyc_map:
from shapely.geometry import Point
# Calculate the centroid for each geometry
nyc_map['centroid'] = nyc_map.geometry.centroid
Then, I can filter the nyc_map using the coordinate ranges of the squirrel sightings.
# Filter based on centroid's longitude and latitude
filtered_nyc_map = nyc_map[(nyc_map['centroid'].x >= -74.0) &
(nyc_map['centroid'].x <= -73.94) &
(nyc_map['centroid'].y >= 40.75) &
(nyc_map['centroid'].y <= 40.82)]
And now we can plot the filtered NYC map, which, if we did things correctly should be centered around Central Park.
# Plot the filtered DataFrame
filtered_nyc_map.plot(figsize=(10, 10), aspect='auto')
The result is below:
Success! However, we have one more thing to do. I want to include the squirrel sightings, colored by cluster and accentuated based on squirrel approach on this plot. We can do that using the code below:
# Convert the 'X' and 'Y' columns into a GeoSeries of Points
points = gpd.GeoSeries([Point(xy) for xy in zip(squirrel_df['X'], squirrel_df['Y'])])
# Create a new GeoDataFrame
squirrel_geo_df = gpd.GeoDataFrame(squirrel_df, geometry=points)
# Plotting the filtered NYC map
fig, ax = plt.subplots(figsize=(10, 10))
filtered_nyc_map.plot(ax=ax, color='#207388') # Base map in light grey
# Overlay the squirrel data points, color-coded by 'Cluster'
squirrel_geo_df.plot(ax=ax, column='Cluster', categorical=True, markersize=20,
cmap='viridis', legend=True, legend_kwds={'title':'Cluster', 'loc': 'upper left'},
facecolors='none' if True else 'black', edgecolors='black')
# Adding labels and title
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Squirrel Sightings in Central Park by Cluster')
plt.show()
And the result of this is below:
That’s pretty cool! I’m pretty happy with this. There was one more visualization that I was thinking of doing. I found a really cool aerial picture of Central Park. I thought it would be neat to try and overlay the squirrel sightings on top of the aerial view, however, there are some other effects that were complicating the plotting so I figured I’d save that for a future time.
Feature Engineering
As I alluded to in the introduction, there was a fair bit of feature engineering opportunities available in this dataset. Most of the new features I created were derived from the textual columns. These textual column contained additional descriptions made by the volunteers surrounding each squirrel sighting. I identified a few key categories and terms associated with each category that I used alongside regular expressions to create the new features as shown below:
# Define lists of terms for various features
tree_terms = ['tree', 'trees', 'maple', 'log', 'branch', 'oak', 'willow', 'trunk', 'stump', 'elm']
shrubbery_terms = ['shrub', 'bush', 'weeds']
rock_terms = ['rock', 'stone']
grassland_terms = ['lawn', 'grass', 'field', 'glen']
path_terms = ['path', 'road', 'pavement', 'street']
structure_terms = ['conservatory', 'fountain', 'bench', 'fence', 'statue', 'bin', 'carousel', 'trellis',
'structure', 'car', 'table', 'ledge', 'railing', 'overpass', 'post', 'can', 'house',
'arch', 'bar', 'sanctuary', 'bridge', 'bike', 'rack', 'construction', 'playground']
water_terms = ['water', 'shore', 'pond', 'pool', 'stream']
# Function to create a matching column based on specified terms
def create_matching_column(df, column_name, source_column, terms):
pattern = '|'.join(terms)
df[column_name] = df[source_column].str.contains(pattern, case=False, na=False)
return df
In the example above, I was able to create new features like ‘Seen in Tree’, ‘Seen in Shrubbery’, and ‘Seen in Rock’ amongst several others. In addition to this, some of the columns like ‘Primary Fur Color’ and ‘Highlight Fur Color’ for instance, were One Hot Encoded since they had a handful of categories (i.e., Primary Fur Color had Gray, Black, Cinnamon and Unknown). There were others and if you are interested in looking at the full EDA on the added features as well as looking at the full set of features that were created you can check out the report here.
I created a pipeline to do all the necessary cleaning, formatting, etc. that I decided was necessary after playing with the dataset. You can check out all the work in the notebook I linked in the beginning of the article. A few things worth mentioning are:
- We now have 44 features (original dataset had 31)
- No observations were dropped so we still have 3023 rows.
- There is no missing data anymore. No imputation or dropping of rows was performed. Most of the missing data was present in the textual columns. These columns were dropped after they were used for feature engineering. In columns like Primary Fur Color or Age, I simply treated the NaNs as a new category called ‘Unknown’
The correlation map below wasn’t particularly helpful. It mostly revealed correlations/relationships that I already anticipated. For instance, most of the strong anticorrelations are between variables associated with Primary Fur Color (i.e., Gray vs Cinnammon) or Squirrel Age (i.e., Adult vs Juvenile).
At this point I wanted to look at dimensionality reduction techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (tSNE) to try and identify and additional patterns/clusters within my data as well to gauge which features carry the most weight. The tSNE plot wasn’t particularly revealing, though I can sorta see a few groupings in the data. I carried out PCA as well and both the Loading and Explained Variance plots were useful in telling me that 30 features accounted ~90% of the variance as well as seeing that features like Cluster, PFC_Gray, X,Y, PFC_Cinnamon, and HFC_Gray were some of the strongest components in the dataset.
Training the Model
From the moment I decided to take a ML approach for this, I knew that there were really only two algorithms I’d consider: Decision Tree and Random Forest. The reason being… squirrels like trees and that joke is the main reason I decided to do this project. Though other algorithms may perform better, their choice wouldn’t be as funny. If you have any funny suggestions for approaches let me know 🙂
Anyways, let’s get to modeling. First off, here is a view of how my dataframe looks like presently.
First thing we need to do is create our train and test sets. One thing to keep in mind when doing the splitting is that we must use stratified splitting using our target variable since the dataset is highly imbalanced. Stratified splitting will preserve the percentage of samples for each class.
# Assuming squirrel_df is your DataFrame and 'Approaches' is the target variable
X = squirrel_df.drop('Approaches', axis=1) # Features
y = squirrel_df['Approaches'] # Target
# Stratify the split to maintain the class distribution in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Given how imbalanced the dataset is, an approach I considered is to use data augmentation techniques. However, this is something I’ll save for a future iteration since introducing synthetic samples can lead to other complications and in my experience, they tend to not generalize well.
In the meantime, I’ll deal with the imbalance by using balanced weights when training the models. Regarding metrics, I’m going to look into both the Receiver Operating Characteristic Area Under the Curve (ROC-AUC) score and the Precision Recall Area Under the Curve (PR-AUC) score. The PR-AUC score in particular is useful for imbalanced classification problems since it focuses on the minority class while the ROC curve covers both classes.
I did a grid search on the Decision Tree and Random Forest models. I used a stratified K-fold with 5 splits since I want to preserve the class percentage between splits. I also used both the PR-AUC and ROC-AUC as the metrics to optimize during the hyperparameter tuning. For the random forest, I looked at n_estimators, max_depth, min_samples_split, and min_samples_leaf using a conservative range of values. The code for the random forest hyperparameter tuning is shown below:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, average_precision_score, roc_auc_score
# Define the parameter grid for Random Forest
rf_param_grid = {
'n_estimators': [50, 100, 200, 500], # Number of trees in the forest
'max_depth': [None, 5, 10, 20], # Maximum depth of the tree
'min_samples_split': [2, 5, 7, 10], # Minimum number of samples required to split a node
'min_samples_leaf': [1, 3, 5, 10] # Minimum number of samples required at each leaf node
}
# Initialize Stratified K-Fold
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Define multiple scoring metrics
scoring_metrics = {
'PR-AUC': make_scorer(average_precision_score, needs_proba=True),
'ROC-AUC': 'roc_auc'
}
# Initialize the GridSearchCV object with Stratified K-Fold
rf_grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42, class_weight='balanced'),
param_grid=rf_param_grid,
cv=stratified_kfold, # Using Stratified K-Fold here
scoring=scoring_metrics,
refit='PR-AUC',
n_jobs=-1,
verbose=2
)
rf_grid_search.fit(X_train, y_train)
# Best parameters and scores
print("Best parameters for Random Forest:", rf_grid_search.best_params_)
print("Best PR-AUC score:", rf_grid_search.best_score_)
# To access the results for each metric
cv_results = rf_grid_search.cv_results_
# Find the index of the best score based on the refit criterion
best_index = rf_grid_search.best_index_
# Display the best hyperparameters and corresponding scores
print("Best Hyperparameters:", rf_grid_search.best_params_)
print(f"Best Mean Test PR-AUC: {cv_results['mean_test_PR-AUC'][best_index]:.4f}")
print(f"Best Mean Test ROC-AUC: {cv_results['mean_test_ROC-AUC'][best_index]:.4f}")
For the Decision Tree, I also used a small range of values for max_depth, min_samples_split, and min_samples_leaf. The code for the hyperparameter tuning of the Decision Tree is shown below:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, average_precision_score, roc_auc_score
# Define the parameter grid for Decision Tree
dt_param_grid = {
'max_depth': [10, 20, None], # Maximum depth of the tree
'min_samples_split': [2, 5], # Minimum number of samples required to split a node
'min_samples_leaf': [1, 2] # Minimum number of samples required at each leaf node
}
# Initialize Stratified K-Fold
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Define multiple scoring metrics
scoring_metrics = {
'PR-AUC': make_scorer(average_precision_score, needs_proba=True),
'ROC-AUC': 'roc_auc'
}
# Initialize the GridSearchCV object with Stratified K-Fold
dt_grid_search = GridSearchCV(
estimator=DecisionTreeClassifier(random_state=42, class_weight='balanced'),
param_grid=dt_param_grid,
cv=stratified_kfold, # Using Stratified K-Fold here
scoring=scoring_metrics,
refit='PR-AUC',
n_jobs=-1,
verbose=2
)
dt_grid_search.fit(X_train, y_train)
# Best parameters and scores
print("Best parameters for Decision Tree:", dt_grid_search.best_params_)
print("Best PR-AUC score:", dt_grid_search.best_score_)
# To access the results for each metric
dt_cv_results = dt_grid_search.cv_results_
# Find the index of the best score based on the refit criterion
best_index = dt_grid_search.best_index_
# Display the best hyperparameters and corresponding scores
print("Best Hyperparameters:", dt_grid_search.best_params_)
print(f"Best Mean Test PR-AUC: {dt_cv_results['mean_test_PR-AUC'][best_index]:.4f}")
print(f"Best Mean Test ROC-AUC: {dt_cv_results['mean_test_ROC-AUC'][best_index]:.4f}")
Evaluating and Explaining the Model
In evaluating the performance of the best Random Forest and Decision Tree models on the highly imbalanced dataset, I observed distinct outcomes in ROC-AUC and PR-AUC scores. The Random Forest model excelled with a ROC-AUC score of 0.91, significantly outperforming the Decision Tree’s respectable score of 0.77. This indicates its superior ability to distinguish between classes. However, the PR-AUC scores, which are crucial in the context of an imbalanced dataset due to their focus on the precision-recall balance, told a different story. Both models surpassed the baseline performance of a ‘No Skill’ classifier (which predicts the majority class in all cases), yet they showed room for improvement in precision and recall: the Random Forest achieved a PR-AUC score of 0.46, and the Decision Tree only 0.20. These results underscore the importance of focusing on PR-AUC as the target metric in imbalanced scenarios and suggest avenues for future improvement, such as experimenting with different class imbalance mitigation techniques, adjusting decision thresholds, or exploring alternative algorithms better suited for imbalanced data.
The feature importances also show that latitude, longitude, and Indifferent are top features contributing the most towards the prediction made by the model. This is further supported by the SHAP value plots for both models.
The probability distributions for the Random Forest and Decision Tree models are shown below. The histogram shows a distribution of predicted probabilities with a peak around 0.1 to 0.2 and a long tail extending towards 0.8. This suggests that the Random Forest model has a tendency to predict many instances with a low probability of being the positive class, but there’s uncertainty as it does not peak near 0 or 1. The presence of a tail towards the higher probabilities indicates that the model is somewhat confident about certain predictions, but overall, there seems to be a cautious stance in the predictions, as the probabilities are not clustered near 0 or 1. The Decision Tree histogram shows an extremely peaked distribution at probability 0 and very few instances with a probability of 1. This is indicative of a model that is very confident about a large majority of instances belonging to the negative class. The lack of instances with intermediate probabilities suggests that the Decision Tree model makes very certain predictions – it is either very sure an instance is in the positive class or very sure it is not. This is typical for decision trees, as they tend to produce more extreme probability estimates.
The Random Forest model is likely to be better calibrated, as it provides a smoother distribution of probabilities and does not tend towards the extremes as much as the Decision Tree. The Decision Tree’s extreme confidence might be a sign of overfitting, where the model has learned the training data too well, including noise, leading to overly confident predictions. Neither model shows the ideal two-peaked distribution that would indicate high confidence in distinguishing between classes. This might suggest that the models are not capturing the underlying patterns perfectly and might benefit from further tuning, additional features, or more complex modeling techniques. I could certainly play around a bit more with features, however, I’ll save that for a future iteration. Hence, I decided that probability calibration for the better performing Random Forest model would be a good next step.
Calibrating the Model
Calibrating a Machine Learning model is the process of adjusting the predicted probabilities to better align with the observed frequencies in the data. It is a step that is often overlooked when training models which limits their usability. Calibrating a model is important because the raw estimates of a machine learning model may not represent the true likelihood of an event. This is crucial to consider if your model is to be used for any kind of decision making process (especially those where stakes are high). In addition to that, if your probabilities are well-calibrated, this can also allow for more meaningful comparisons between different models.
To do the calibration, I used Platt Scaling (also known as Sigmoid Calibration). This method fits a logistic regression model to the outputs of the classifier. The basic workflow is as follows:
- Train the model
- Split the data
- Fit the calibration model
- Evaluate the calibration
The code for this process is shown below.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
# Split your data into training and calibration sets
X_train_2, X_calib, y_train_2, y_calib = train_test_split(X_train, y_train, test_size=0.2)
# Fit the model on the training data
best_rf_classifier.fit(X_train_2, y_train_2)
# Calibrate the model on the calibration data
# Using Platt scaling (method='sigmoid') or Isotonic regression (method='isotonic')
calibrated_rf = CalibratedClassifierCV(estimator=best_rf_classifier, method='isotonic', cv='prefit')
calibrated_rf.fit(X_calib, y_calib)
# Now you can use calibrated_model to make predictions with calibrated probabilities
calibrated_probs = calibrated_rf.predict_proba(X_test)
# Predict probabilities on the test set
brier_probs = calibrated_rf.predict_proba(X_test)[:, 1]
# Evaluate calibration performance
brier_score = brier_score_loss(y_test, brier_probs)
print(f"Brier score: {brier_score:.4f}")
# Get the calibrated probabilities for the positive class
prob_true, prob_pred = calibration_curve(y_test, calibrated_probs[:, 1], n_bins=20)
The results of the calibration can be seen below. A perfectly calibrated model would have all the points lying exactly on top of the orange curve. The probabilities are overall well calibrated with a notable exception at ~ 0.25 where the model is clearly underestimating. In addition to this, I also calculated Brier Score for the calibration process. Brier Scores range between 0 and 1, where a score of 0 represents perfect accuracy and a score of 1 represents perfect inaccuracy. The Brier Score for this calibration is 0.0445 which is quite low and therefore good.
Though there are further improvements I can do to this process, this is something that I’ll refine in a future iteration. For now, I’ll simply save my model and continue.
import joblib
# Save the model to a file
joblib.dump(calibrated_rf, '/content/drive/MyDrive/SquirrelML/cal_Squirrel_RF.pkl')
Deploying the Model as Streamlit App
So many machine learning get made, yet sadly, so many are left to collect dust. I think it is important whenever you are working on a ML project to have a way that your model could be deployed and used by people. This way your work has a chance to be appreciated and serve as a bit more than just a training exercise even if its just a silly application like the one I did here. I find that people usually remember your work more if they get to interact with it somehow (which can be quite useful if you are job hunting). Deploying your model as part of a Streamlit app is a great way to showcase your work in a tangible way.
The basic premise of the app is to give the user the ability to play around with the various features used to train the model and give them the probability of a squirrel approaching them as an output. One of the inputs of the model are the Latitude and Longitude of the user. The user can enter them directly in the app, however, I also used Folium to include an interactive map that is initially centered on Central Park that gives you the Latitude and Longitude of the location you click.
The user can then play around with the different features found in the drop down menus and once they have their desired feature configuration they can click the Predict button. Pressing the Predict button will return the Squirrel Approach Probability (SAP) and an accompanying picture (it’ll be a happy squirrel if the probability is more than 50% and a sad squirrel picture if the probability is equal to or below 50%). You can see an example output below:
Conclusion
This was a fun little project to work on. The model can certainly be improved and I might revisit this in the future to fine tune it further. For instance, I can try and use the results of my PCA analysis alongside the SHAP values to try and reduce the dimensionality of my dataset down to the core features. There are also accompanying datasets to the Squirrel Census that have information like the amount of litter present, the activity levels present, the sighting duration, and the weather conditions amongst others. I found those datasets after I was done with my processing, but I’d be keen on incorporating them into my model since they seem like they’d be informative. Those datasets can be found on NYCOpenData here and here. I can also try and use data augmentation techniques.
As always, I hope you enjoyed my work and thanks for reading!
All data from NYCOpenData has no restrictions on their use. Check the following links for their FAQ and Terms of Use.