The world’s leading publication for data science, AI, and ML professionals.

POI Classification Using Visit and Popularity Metrics – Part 1

Using geospatial features for classification of locations as Bus Stops, Train Stations, or Airports

Train, Bus, or Plane? Predictive Classification of Points of Interest Using Visit and Popularity Metrics

Image by Annie Spratt and Unsplash
Image by Annie Spratt and Unsplash

This article looks to demonstrate how to do basic Classification with supervised machine learning using an applicable real-world dataset from SafeGraph. SafeGraph is a data provider that provides POI data for hundreds of businesses and categories. It provides data for free to academics. For this project, I have chosen to use SafeGraph patterns data in order to classify records as various POI’s. The schema for the patterns data can be found here: Schema Info

From the numerous categories that can be chosen in the shop, this project focuses on the classification of different transportation services. The categories chosen for this project are:

  • Bus and Other Vehicle Transit Systems
  • Rail Transportation
  • Other Airport Operations

Each of these categories correlates to POI data for train stations, bus stops, and airport data.

This article is the first of a three-part series using this data:

  • POI Classification using safegraph data and sklearn classifiers
  • Model tuning using PCA and cross-validation
  • POI Classification using spark Deep Learning classifiers
  • POI Classification within single NAICS code (fast-food restaurant classification)

Image from Markus Spiske and Unsplash
Image from Markus Spiske and Unsplash

Below are snapshots of the code that was written for this project and brief descriptions of the process of the code. The link to the code for this project can be found here: Notebook Link

Section 1: Dependencies

The following are the dependencies for this project. The packages that we need for this project are Pandas, NumPy, Seaborn, Matplotlib, and Pyspark

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as pltx
from sklearn.metrics import plot_confusion_matrix

— –

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

— –

import findspark
findspark.init("/content/spark-3.1.2-bin-hadoop2.7")
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Section 2: Data Load

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
def pd_read_csv_drive(id, drive, dtype=None):
    downloaded = drive.CreateFile({'id':id})
    downloaded.GetContentFile('Filename.csv')
    return(pd.read_csv('Filename.csv',dtype=dtype))
def get_drive_id(filename):
    drive_ids = {'patterns' : '1ReqpLgv50_3mCvZuKLlMdHxwfCPIHmqz'}
    return(drive_ids[filename])
transportation_df = pd_read_csv_drive(get_drive_id('patterns'), drive=drive)
transportation_df.head(3)

The output looks something like this (click on image for a better view):

The table generated below is the patterns data for the three POI categories listed above for the month of December 2018. This particular month was chosen because public transportation is more likely to be utilized during the holiday season – thus showing more clear popularity and visitor metrics for each record.

some of the columns that we will be using as features for our classification models are:

  • raw_visit_counts – Number of visits in our panel to this POI during the date range.
  • raw_visitor_counts – Number of unique visitors from our panel to this POI during the date range.
  • visits_by_day – The number of visits to the POI each day (local time) over the covered time period.
  • distance_from_home – Median distance from home traveled by visitors (of visitors whose home we have identified) in meters.
  • median_dwell – Median minimum dwell time in minutes.
  • bucketed_dwell_times – Key is range of minutes and value is number of visits that were within that duration
  • popularity_by_hour – A mapping of hour of day to the number of visits in each hour over the course of the date range in local time. First element in the array corresponds to the hour of midnight to 1 am
  • popularity_by_day – A mapping of day of week to the number of visits on each day (local time) in the course of the date range
  • device_type – The number of visitors to the POI that are using android vs. ios. Only device_type with at least 2 devices are shown and any category with less than 5 devices are reported as 4

Section 3: Data Cleansing

The first course of action would be to drop unnecessary columns.

transportation_df = transportation_df.drop(['parent_safegraph_place_id','placekey','safegraph_place_id','parent_placekey','parent_placekey','safegraph_brand_ids','brands', 'poi_cbg'], axis = 1)
transportation_df.head(3)

Here we are dropping the parent_safegraph_place_id, placekey, parent_placekey, safegraph_brand_ids, brands, poi_cbg. These columns are in correlation to identifiers, brands, census block groups, and placekeys. These columns are unrelated to the scope of this particular project and thus we are removing them.

Creation of ground truth Class column (Helper function)

def class_definer(record):
    fixed_record = record.lower()
    if('bus' in fixed_record):
        return 'Bus'
    elif('airport' in fixed_record):
        return 'Airport'
    elif('train' in fixed_record):
        return 'Train'
    elif('metro' in fixed_record):
        return 'Train'
    elif('transport' in fixed_record):
        return 'Bus'
    elif('amtrak' in fixed_record):
        return 'Train'
    elif('bart' in fixed_record):
        return 'Train'
    elif('cta' in fixed_record):
        return 'Train'
    elif('mta' in fixed_record):
        return 'Train'
    elif('transit' in fixed_record):
        return 'Bus'
    elif('mbta' in fixed_record):
        return 'Train'
    elif('station' in fixed_record):
        return 'Train'
    elif('railway' in fixed_record):
        return 'Train'
    else:
        return 'Unknown'
transportation_df['Class'] = transportation_df['location_name'].transform(lambda x: class_definer(x))
transportation_df.head(3)

These code snippets are attempting to create a Class column to establish the ground truth that is required for supervised learning algorithms such as classification. The data that is used for this project comes with some caveats, especially with the bus stop data. The bus stop NAICS category consists of ‘Bus Stop and Other Transit Services’. This category in particular has multiple items such as bus stops as well as truck rentals and yacht services. Thus in order to remove the records that are irrelevant, the above function was used.

transportation_df = transportation_df[transportation_df['Class'] != 'Unknown'].dropna()
transportation_df.head(3)

These unknown columns are representative of the Motor Transit records that come with the Bus data in the patterns data. We can drop these records because they are not exactly aligned with the notions of bus services – more tending towards ideas such as Truck services, and Yacht rentals, and other transit services that are away from the concept of a bus service.

Now the data is formatted properly and has a Class column that can be used for classification algorithms. The next course of action is to format the dataframe in a way where all of the features can be utilized for the classification algorithm. This process requires the horizontal expansion of columns such as the visits_by_day column and the popularity_by_hour column. Some of these columns that need to be horizontally exploded are columns of arrays and some of them are columns of JSON. We will first look into the horizontal expansion of the JSON columns:

In order to do this, we will use pyspark, specifically the from_json expression to horizontally explode the columns. In order to do this, we must first convert the pandas dataframe to a Spark dataframe.

transportation_df = spark.createDataFrame(transportation_df)
transportation_df.show(2)

Now that the data is in a Spark dataframe, we need to create a schema for the JSON string columns that need to be exploded. These schemas will show the unique columns that would be produced when the JSON string is expanded along with the data type associated with the column and using this schema we can explode the JSON string columns.

#Horizontal Explosion of JSON columns using Pyspark
from pyspark.sql.functions import from_json,expr
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType
day_schema = StructType(
  [
     StructField('Monday', IntegerType(),True),
     StructField('Tuesday', IntegerType(),True),
     StructField('Wednesday', IntegerType(),True),
     StructField('Thursday', IntegerType(),True),
     StructField('Friday', IntegerType(),True),
     StructField('Saturday', IntegerType(),True),
     StructField('Sunday', IntegerType(),True)
  ]
)
device_schema = StructType(
  [
     StructField('android', IntegerType(),True),
     StructField('ios', IntegerType(),True)
  ]
)
bucketedDT_schema = StructType(
  [
    StructField('<5',IntegerType(),True),
    StructField('5–10',IntegerType(),True),
    StructField('11–20',IntegerType(),True),
    StructField('21–60',IntegerType(),True),
    StructField('61–120',IntegerType(),True),
    StructField('121–240',IntegerType(),True),
    StructField('>240',IntegerType(),True)
  ]
)
transportation_df = transportation_df.withColumn('popularity_by_day', from_json('popularity_by_day', day_schema)).withColumn('device_type', from_json('device_type', device_schema)).withColumn('bucketed_dwell_times',from_json('bucketed_dwell_times',bucketedDT_schema)).select('location_name','raw_visit_counts','raw_visitor_counts','visits_by_day','distance_from_home','median_dwell','bucketed_dwell_times.*','popularity_by_hour','popularity_by_day.*','device_type.*','Class')
transportation_df = transportation_df.toPandas()
transportation_df.head(3)

Now that the JSON strings have been exploded we can explode the array columns. Since this can be done fairly easily in Pandas, we can convert the dataframe back to a pandas dataframe. The expansion of the array columns requires the column to be filled with arrays rather than a column of strings formatted like an array. Safegraph patterns data makes these columns strings rather than arrays so the first course of action is to convert the string to an array. This can be done using the literal_eval function from the ast package. From here we can take the individual values located in each index and explode them into separate columns.

from ast import literal_eval
transportation_df['popularity_by_hour'] = transportation_df['popularity_by_hour'].transform(lambda x: literal_eval(x))
pops = ['popularity_' + str(i) for i in range(1,25)]
transportation_df[pops] = pd.DataFrame(transportation_df.popularity_by_hour.to_list(), index=transportation_df.index)
transportation_df = transportation_df.drop(['popularity_by_hour'], axis = 1)
transportation_df = transportation_df.reindex()
transportation_df.drop('visits_by_day', axis=1, inplace=True)
transportation_df.head(3)

The final part of data cleansing that we need to do is using the Sklearn LabelEncoder function to convert the Class function to numeric values for the classification models.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(transportation_df['Class'])
transportation_df['Class'] = le.transform(transportation_df['Class'])
transportation_df.head(3)

Section 4: Classification

Image from Umberto and Unsplash
Image from Umberto and Unsplash

Since there are 3 unique classes, we need to use a multiclass classifier. We will try Gaussian Naive Bayes, Decision Trees, and the K-Nearest Neighbors Classifier. First, we must split our data into test and train sets.

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
x_cols = []
for item in list(transportation_df.columns):
if(item != 'Class' and item != 'location_name'):
x_cols.append(item)
X = transportation_df[x_cols]
y = transportation_df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Gaussian Naive Bayes Classifier:

Here is a little background regarding the functionality of the Naive Bayes Classifier: Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X_train, y_train)
gnb_predictions = gnb.predict(X_test)
gnb_predictions
accuracy = gnb.score(X_test, y_test)
print(accuracy)
confusion_matrix(y_test, gnb_predictions)

Now let’s visualize the results using a heatmap

plot_confusion_matrix(gnb, X_test, y_test, normalize='true', values_format = '.3f', display_labels=['Airport','Bus','Train'])

yellow indicates high incidence and purple indicates low incidence. Our "correct" predictions are shown in the diagonal from the top-left corner to the bottom-right corner.

Ideally, we would see yellow in the diagonal and purple everywhere else.

This shows that the Gaussian Naive Bayes does not do too well with classifying this particular dataset. This could be attributed to the classifier considering each of the features individually without the considerations of their correlations (thus the classifier being ‘naive’). This would make it difficult for the classifier to do well when features have high correlation coefficients such as this particular dataset, where many of the features are exploded from one column and are directly correlated with one another.

The accuracy of the model comes out to about 26%. The heatmap shows that a lot of values that are actually an airport are misclassified as a train station. The same can be said of true bus stations being misclassified as train stations. This shows that the majority of the records are actually classified as train stations.

Decision Trees:

Here is a little background regarding the functionality of the Decision Tree classifier: Decision Tree

from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)
dtree_predictions = dtree_model.predict(X_test)
dtree_model.score(X_test,y_test)
confusion_matrix(y_test, dtree_predictions)

Visualizing results using a heatmap

plot_confusion_matrix(dtree_model, X_test, y_test, normalize='true', values_format = '.3f', display_labels=['Airport','Bus','Train'])

From this, we can see that the Decision Tree model performs much better than the Naive Bayes, with an accuracy of 75%. Considering that this is a multiclass classifier, this accuracy is fairly decent.

The classifier seems to do a very good job at correctly classifying Airports, doing so at an accuracy rate of ~91.7% indicating that a model that takes into account the correlations between the various features of the data can perform very well in classifying the airports of this data correctly. The same cannot be said of the bus stop records, for none of the true Bus stop records were classified as bus stops. The Train station data seems to perform slightly better, classifying ~55.3% of true Train station data as Train stations. An interesting observation to make regarding this particular model’s predictions is that none of the records are classified as a Bus stop – -either correctly or incorrectly. This could be attributed to the smaller number of bus station records in comparison to the Train station and Airport records and the nature of the Decision Tree algorithm. The Decision Tree algorithm performs extremely well when using very balanced data, but doesn’t perform as well on imbalanced data as in the case we have here.

K-Nearest Neighbors:

Here is a little background regarding the functionality of the KNN classifier: KNN

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 22).fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
knn_predictions = knn.predict(X_test)
confusion_matrix(y_test, knn_predictions)

Visualizing results using a heatmap

plot_confusion_matrix(knn, X_test, y_test, normalize='true', values_format = '.3f', display_labels=['Airport','Bus','Train'])

This classifier has an accuracy of about ~ 68.0%. The classifier does extremely well with classifying the Airport records properly, having an accuracy of ~93.7%. The number of correctly classified bus stops is no longer 0, but it’s still very low, this can again be attributed to the imbalanced data. This classifier does slightly worse with train classification than the decision tree model, predicting only ~18.3% of the true train stations as train stations.

Conclusion:

This project gave us an introduction to the classification of POI categories based on SafeGraph Patterns visit data. Our algorithms saw some success in classifying airports vs train stations vs bus stops, all based on things like dwell time, visitor distance from home, popularity by the hour of the day, and popularity by day of the week.

The next post in the series (link coming soon!) will add complexity and power to our classifiers by tuning our models and conducting principal component analysis.

Questions?

I invite you to ask them in the #safegraphdata channel of the SafeGraph Community, a free Slack community for data enthusiasts. Receive support, share your work, or connect with others in the GIS community. Through the SafeGraph Community, academics have free access to data on over 7 million businesses in the USA, UK, and Canada.


Related Articles