The world’s leading publication for data science, AI, and ML professionals.

Ensemble Classification: A Brief Overview With Examples

Classifying Points of Interest as Airports, Train stations, and Bus Stops using ensemble classification methods

Photo by Charles Deluvio on Unsplash
Photo by Charles Deluvio on Unsplash

Note: This article is the final article in a series of articles regarding the Classification of transportation POI data. The first article looked to use various Machine Learning models to classify records as Airports, Bus stops, and Train stations. The second article was centered around the use of feature reduction algorithms as means of tuning the models from the first article to provide better accuracy. The third article looked to use the Spark Multilayer Perceptron Classifier to classify these records properly. Check out these articles to see how this project has evolved in the search for the best way to classify POI data.

This article will use ensemble classification for properly classifying the POI records from SafeGraph using foot traffic patterns. SafeGraph is a data provider that provides POI data for hundreds of businesses and categories. It provides data for free to academics. For this project, I have chosen to use SafeGraph patterns data to classify records as various POI’s. The schema for the patterns data can be found here: Schema Info

What is Ensemble Classification:

Ensemble learning is the concept of multiple "weak learners" being used together to create a Machine Learning model that is capable of performing better than they each could individually. Most of the time these weak learners don’t perform well on their own because they have either high bias or high variance. The point of combining multiple weak learners in an ensemble model is to reduce this bias and variance.

Photo by Aron L on Unsplash
Photo by Aron L on Unsplash

Ensemble Classification of POI Data

Before we can start with the Ensemble Classification aspects we must first load the Data. This particular step has been covered in both the first article of the series as well as the second one. The basic steps behind the data loading and preprocessing to fit our needs for this article are:

Before we take our first steps into the notions of ensemble classification we must first load the data that we will use for this project: This process of loading the data can be found in the notebook and has been explained in detail in the first part of the series. The steps taken were:

  1. Dropping unnecesscay columns- [‘parent_safegraph_place_id’,’placekey’,’safegraph_place_id’,’parent_placekey’,’parent_placekey’,’safegraph_brand_ids’,’brands’, ‘poi_cbg’]
  2. Creating ground truth column that establishes each record as either Airport, Bus station, Airport, or Unkown
  3. Dropping Unknown records to clear out records that cannot be identified
  4. Horizontally exploding columns of JSON strings using pyspark
  5. Horizontally exploding columns of arrays
  6. Using Sklearn LabelEncoder package to transform class column

As a result of these transformations the outputted data looks like this and has the following columns:

Raw_visit_counts: Number of visits in our panel to this POI during the date range.

Raw_visitor_counts: Number of unique visitors from our panel to this POI during the date range.

Distance_from_home: Median distance from home traveled by visitors (of visitors whose home we have identified) in meters.

Median_dwell: Median minimum dwell time in minutes.

Bucketed Dwells (Exploded to <5, 5–10,11–20,21–60,61–120,121–240): Key is the range of minutes and value is the number of visits that were within that duration

Popularity_by_day(Exploded to Monday-Sunday): A mapping of the day of the week to the number of visits on each day (local time) in the course of the date range

Popularity_by_hour(Exploded to popularity_1-popularity_24): A mapping of the hour of the day to the number of visits in each hour over the course of the date range in local time. The first element in the array corresponds to the hour of midnight to 1 am

Device_type(Exploded to IOS and Android): The number of visitors to the POI that is using android vs. ios. Only device_type with at least 2 devices are shown and any category with less than 5 devices are reported as


Now that the data is ready to go, we can begin on the Ensemble Learning aspects.

For this portion of the article, we will be using the Sklearn Voting Classifier. This is the built-in model for ensemble learning in the Sklearn package. Before using this, however, we need to first train the three weak learners that we are using for this model. The models we are using are the same as those from the first portions of this series of articles, the Gaussian Naive Bayes model, the Decision Tree model, and the K-Nearest Neighbors model.

#Training all three models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)
gnb = GaussianNB().fit(X_train, y_train)
knn = KNeighborsClassifier(n_neighbors = 22).fit(X_train, y_train)

Now that all three models have been trained we can use these models to perform the next steps of the process is to call the model and aggregate the created models into a dictionary for the ensemble model’s use

from sklearn.ensemble import VotingClassifier
estimators=[('knn', knn), ('gnb', gnb), ('dtree_model', dtree_model)]
ensemble = VotingClassifier(estimators, voting='hard')

Upon training and testing the model we receive the following outputs:

ensemble.fit(X_train, y_train)
ensemble.score(X_test, y_test)
from sklearn.metrics import confusion_matrix
prediction = ensemble.predict(X_test)
confusion_matrix(y_test, prediction)
plot_confusion_matrix(ensemble, X_test, y_test, normalize='true', values_format = '.3f', display_labels=['Airport','Bus','Train'])

The accuracy of this model is slightly lower than that of the most effective model of the models used to create the ensemble classification model (.75 vs .68). The model performs much better than the Gaussian Naive Bayes model (.265) and performs on par with the K-Nearest Neighbors classifier (.679). Seeing the model is performing better than the average of these three accuracies, one can say that the use of ensemble classification for this particular dataset is an efficient way to increase the accuracy of the predictions. As before, the main cause for the model’s shortcomings is the imbalance in the dataset and the lack of Bus Station records. This problem was not solved by manually rebalancing the data as we saw in the Spark Deep Learning article, thus the best course of action for future classification endeavors would be to avoid strongly imbalanced data at all costs or suffer a severe drop in accuracy.

Conclusion

Through this project we were able to determine the importance of ensemble classification and the efficiency the technique can have when using several weak learners to create the model. From this exercise we were able to derive a model that performs much better than the average accuracy of the three weak learners that it combined. This concludes this series of articles on POI classification. Through the analysis that we have made in the last 4 portions of the series we can see the efficiency by which SafeGraph data can be classified using Machine Learning and that the accuracy of the predictions can be quite high.

Questions?

I invite you to ask them in the #safegraphdata channel of the SafeGraph Community, a free Slack community for data enthusiasts. Receive support, share your work, or connect with others in the GIS community. Through the SafeGraph Community, academics have free access to data on over 7 million businesses in the USA, UK, and Canada.


Related Articles