PCA & K-Means for Traffic Data in Python

Reduce dimensionality and cluster Taipei MRT stations based on hourly traffic

Published in

Towards Data Science

8 min readMay 7, 2024

Taipei Rail Map ( Actually Introduced Romanization Standards based ) Including THSR, TRA, Taipei MRT & Other Lines. Image by Taiwan J.

Principal Component Analysis (PCA) has been used in traffic data to detect anomalies, but it can also be used to capture the patterns of a transit station’s traffic history, just like what it does on the purchase data of a customer.

In this article, we will go through:

What tricks does PCA do
What can we do after applying PCA
Playtime!
Take a look into our dataset:
Taipei Metro Rapid Transit System, Hourly Traffic Data
Full codes is also included in the above kaggle dataset.
Using PCA on hourly traffic data
Clustering on the PCA result
Insights on the Taipei MRT traffic
Key takeaways

1. What tricks does PCA do

In brief, PCA summarizes the data by finding linear combinations of features, which can be thought of as taking several pictures of an 3D object, and it will naturally sort the pictures by the most representative to the least before handing to you.

With the input being our original data, there would be 2 useful outputs of PCA: Z and W. By multiply them, we can get the reconstruction data, which is the original data but with some tolerable information loss (since we have reduced the dimensionality.)

We will explain these 2 output matrices with our data in the practice below.

2. What can we do after applying PCA

After applying PCA to our data to reduce the dimensionality, we can use it for other machine learning tasks, such as clustering, classification, and regression.

In the case of Taipei MRT later in this artical, we will perform clustering on the lower dimensional data, where a few dimensions can be interpreted as passenger proportions in different parts of a day, such as morning, noon, and evening. Those stations share similar proportions of passengers in the daytime would be consider to be in the same cluster (their patterns are alike!).

3. Take a look in our traffic dataset!

The datast we use here is Taipei Metro Rapid Transit System, Hourly Traffic Data, with columns: date, hour, origin, destination, passenger_count.

In our case, I will keep weekday data only, since there are more interesting patterns between different stations during weekdays, such as stations in residential areas may have more commuters entering in the daytime, while in the evening, those in business areas may have more people getting in.

Stations in residential areas may have more commuters entering in the daytime.

The plot above is 4 different staitons’ hourly traffic trend (the amount the passengers entering into the station). The 2 lines in red are Xinpu and Yongan Market, which are actually located in the super crowded areas in New Taipei City. On the otherhands, the 2 lines in blue are Taipei City Hall and Zhongxiao Fuxing, where most of the companies locate and business activities happen.

The trends reflect both the nature of these areas and stations, and we can notice that the difference is most obvious when comparing their trends during commute hours (7 to 9 a.m., and 17 to 19 p.m.).

4. Using PCA on hourly traffic data

Why reducing dimensionality before conducting further machine learning tasks?

There are 2 main reasons:

As the number of dimensions increases, all data points appear to be sparse and dissimilar in many ways, which would be refered to as “the curse of dimensionality”.
Due to the high-dimensional nature of the traffic data, it is difficult to visualize and interpret.

By applying PCA, we can identify the hours when the traffic trends of different stations are most obvious and representative. Intuitively, by the plot shown previously, we can assume that hours around 8 a.m. and 18 p.m. may be representative enough to cluster the stations.

Remember we mentioned the useful output matrices, Z and W, of PCA in the previous section? Here, we are going to interpret them with our MRT case.

Original data, X

Index : stations
Column : hours
Values : the proportion of passenger entering in the specific hour (For each station: #passenger / #total passengers)

With such X, we can apply PCA by the following code:

from sklearn.decomposition import PCA

n_components = 3
pca = PCA(n_components=n_components)

X_tran = StandardScaler().fit_transform(X)

pca.fit(X_tran)

Here, we specify the parameter n_components to be 3, which implies that PCA will extract the 3 most significant components for us.

Note that, it is like “taking several pictures of an 3D object, and it will sort the pictures by the most representative to the least,” and we choose the top 3 pictures. So, if we set n_components to be 5, we will get 2 more pictures, but our top 3 will remain the same!

PCA output, W matrix

W can be thought of as the weights on each features (i.e. hours) with regard to our “pictures”, or more specificly, principal components.

pd.set_option('precision', 2)

W = pca.components_
W_df = pd.DataFrame(W, columns=hour_mapper.keys(), index=[f'PC_{i}' for i in range(1, n_components+1)])
W_df.round(2).style.background_gradient(cmap='Blues')

For our 3 principal components, we can see that PC_1 weights more on night hours, while PC_2 weights more on noon, and PC_3 is about morning time.

PCA output, Z matrix

We can interpret Z matrix as the representations of stations.

Z = pca.fit_transform(X)

# Name the PCs according to the insights on W matrix
Z_df = pd.DataFrame(Z, index=origin_mapper.keys(), columns=['Night', 'Noon', 'Morning'])

# Look at the stations we demonstrated earlier
Z_df = Z_df.loc[['Zhongxiao_Fuxing', 'Taipei_City_Hall', 'Xinpu', 'Yongan_Market'], :]
Z_df.style.background_gradient(cmap='Blues', axis=1)

In our case, as we have interpreted the W matrix and understood the latent meaning of each components, we can assign the PCs with names.

The Z matrix for these 4 stations indicates that the first 2 stations have larger proportion of night hours, while the other 2 have more in the mornings. This distribution also seconds the findings in our EDA (recall the line chart of these 4 stations in the earlier part).

5. Clustering on the PCA result with K-Means

After getting the PCA result, let’s further cluster the transit stations according to their traffic patterns, which is represented by 3principal components.

In the last section, Z matrix has representations of stations with regard to night, noon, and morning.

We will cluster the stations based on these representations, such that the stations in the same group would have similar passenger distributions among these 3 periods.

There are bunch of clustering methods, such as K-Means, DBSCAN, hierarchical clustering, e.t.c. Since the main topic here is to see the convenience of PCA, we will skip the process of experimenting which method is more suitable, and go with K-Means.

from sklearn.cluster import KMeans

# Fit Z matrix to K-Means model 
kmeans = KMeans(n_clusters=3)
kmeans.fit(Z)

After fitting the K-Means model, let’s visualize the clusters with 3D scatter plot by plotly.

import plotly.express as px

cluster_df = pd.DataFrame(Z, columns=['PC1', 'PC2', 'PC3']).reset_index()

# Turn the labels from integers to strings, 
# such that it can be treated as discrete numbers in the plot.
cluster_df['label'] = kmeans.labels_
cluster_df['label'] = cluster_df['label'].astype(str)

fig = px.scatter_3d(cluster_df, x='PC1', y='PC2', z='PC3', 
                       color='label', 
                       hover_data={"origin": (pca_df['index'])},
                       labels={
                          "PC1": "Night",
                          "PC2": "Noon",
                          "PC3": "Morning",
                          },
                      opacity=0.7,
                      size_max=1,
                      width = 800, height = 500
                    ).update_layout(margin=dict(l=0, r=0, b=0, t=0)
                    ).update_traces(marker_size = 5)

6. Insights on the Taipei MRT traffic — Clustering results

Cluster 0 : More passengers in daytime, and therefore it may be the “living area” group.
Cluster 2 : More passengers in evening, and therefore it may be the “business area” group.
Cluster 1 : Both day and night hours are full of people entering the stations, and it is more complicated to explain the nature of these stations, for there could be variant reasons for different stations. Below, we will take a look into 2 extreme cases in this cluster.

For example, in Cluster 1, the station with the largest amount of passengers, Taipei Main Station, is a huge transit hub in Taipei, where commuters are allowed to transfer from buses and railway systems to MRT here. Therefore, the high-traffic pattern during morning and evening is clear.

On the contrary, Taipei Zoo station is in Cluster 1 as well, but it is not the case of “both day and night hours are full of people”. Instead, there is not much people in either of the periods because few residents live around that area, and most citizens seldom visit Taipei Zoo on weekdays.

The patterns of these 2 stations are not much alike, while they are in the same cluster. That is, Cluster 1 might contain too many stations that are actually not similar. Thus, in the future, we would have to fine-tune hyper-parameters of K-Means, such as the number of clusters, and methods like silhouette score and elbow method would be helpful.

Conclusion

In summary,

Applying PCA on traffic data to reduce dimensionality can be implemented as extracting 3 important periods (morning, noon, evening) from totally 21 working hours.
PCA outputs are W and Z matrices, where Z can be viewed as the representations of stations with regard to principal components (time periods), and W can be thought of as the representations of principal components (time periods) with regard to original features (hours).
Considering W matrix can help us understand the latent meaning of each principal component.
Clustering methods can be used on the PCA output, Z matrix.

Note that we skipped EDA and hyper-parameters tuning here in order to focus on the topic of this article, but they are actually important.

Thank you for reading so far!
Hope you have a wonderful online journey in Taipei 🫶

Reference

DSCI 563 Lecture Notes, UBC Master of Data Science, Varada Kolhatkar
K Means Clustering on High Dimensional Data., shivangi singh
Curse of Dimensionality — A “Curse” to Machine Learning, Shashmi Karanam

Unless otherwise noted, all images are by the author.