Cluster analysis has become one of the most important methods in Data Analysis, Machine Learning and Data Science. The general idea of clustering is to divide a set of objects with various features into groups, called clusters. There are many algorithms used for clustering nowadays, but I am going to reveal the powerful and the simplest one, named K-means clustering.

Table of contents
- Getting familiar with the dataset
- The History and the intuition of the algorithm
- Elbow method or how to find the best number of clusters
- Clustering validation: Silhouette analysis
- Additional resources
Getting familiar with the dataset
Today we are going to apply the full power of Cluster Analysis on the dataset provided by San Francisco Open Data, a monthly statistics by Airlines in San Francisco International Airport. For our work we will use Python and its libraries: matplotlib(seaborn), pandas, scikit-learn and numpy. So let’s begin our journey into the clustering!!!
Firstly, as the in every Data Science task we have to get acquainted with the data we are working with.
df = pd.read_csv("Datasets/Air_Traffic_Passenger_Statistics.csv")
df.head()

As we can see there are multiple columns in our dataset, but for cluster analysis we will use Operating Airline, Geo Region, Passenger Count and Flights held by each airline. So firstly to determine potential outliers and get some insights about our data, let’s make some plots using Python data visualization library Seaborn.
plt.figure(figsize = (15,15))
sns.countplot(x = "Operating Airline", palette = "Set3",data = df)
plt.xticks(rotation = 90)
plt.ylabel("Number of fights held")
plt.show()

From the frequency plot of the Airlines we can see that there are some examples with an amount of flights held bigger than the others. These are United Airlines and Unites Airlines – Pre 07/01/2013. So these airlines are our potential outliers. The reason why we are trying to find and eliminate outliers will be explained later. Let’s see the frequency plot of Countries that hold flights.
plt.figure(figsize = (15,15))
sns.countplot(x = "GEO Region", palette = "Set3",data = df)
plt.xticks(rotation = 90)
plt.ylabel("Number of fights held")
plt.show()

As we can see the USA is the leader among the countries, that is because our outliers Airlines(United Airlines and Unites Airlines – Pre 07/01/2013) are the American ones. We can obviously see that either Countries or Airlines can be separated into some groups (or clusters) by the amount of flights they held.
The History and the intuition of the algorithm
K – means Clustering is one of the most popular clustering algorithms nowadays. It was created in the 1950’s by Hugo Steinhaus. The main idea of the algorithm is to divide a set of points X in n-dimensional space into the groups with centroids C, in such a way that the objective function (the MSE of the points and corresponding centroids) is minimized.

Let’s see how it can be achieved by looking at the algorithm:
Firstly the centroids are initialized randomly. The second step is to repeat until convergence (the centroids haven’t changed from the previous iteration) two steps:
- To find for each point in the X the closest centroid and to add this point to a cluster
- To calculate the new centroids for each cluster, taking the mean in values in every dimension
So as this algorithms is working with distances it is very sensitive to outliers, that’s why before doing cluster analysis we have to identify outliers and remove them from the dataset. In order to find outliers more accurately, we will build the scatter plot.
airline_count = df["Operating Airline"].value_counts()
airline_count.sort_index(inplace=True)
passenger_count = df.groupby("Operating Airline").sum()["Passenger Count"]
passenger_count.sort_index(inplace=True)
from sklearn.preprocessing import scale
x = airline_count.values
y = passenger_count.values
plt.figure(figsize = (10,10))
plt.scatter(x, y)
plt.xlabel("Flights held")
plt.ylabel("Passengers")
for i, txt in enumerate(airline_count.index.values):
a = plt.gca()
plt.annotate(txt, (x[i], y[i]))
plt.show()

We can see that most of the airlines are grouped together in the bottom left part of the plot, some are above them, and our 2 outliers United Airlines and Unites Airlines – Pre 07/01/2013. So let’s get rid of them.
df_1 = airline_count + passenger_count
df_1.sort_values(ascending = False, inplace = True)
outliers = df_1.head(2).index.values
airline_count = airline_count.drop(outliers)
airline_count.sort_index(inplace=True)
passenger_count = passenger_count.drop(outliers)
passenger_count.sort_index(inplace = True)
x = airline_count.values
y = passenger_count.values
We have prepared our dataset for clustering. However the question is how to determine the "best" number of clusters to use. In order to figure this out we will apply the "Elbow method".
Elbow method or how to find the best number of clusters
As we have already discussed, we are trying to minimize the objective function. It is clear that the more centroids we have the less is the value of the objective function.Moreover, it will become 0 – global minimum, when all points are centroids for themselves, but such variant is not for us, we will try to have some trade-off between number of clusters and the value of the objective function.In order to do this, let’s make plot of dependence of objective function value based on the number of clusters.
from sklearn.cluster import KMeans
X = np.array(list(zip(x,y)))
inertias = []
for k in range(2, 10):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.plot(range(2,10), inertias, "o-g")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()

Our task is to find such number of clusters after which the objective function is decreasing not so fast or in analogy with human arm: if we imagine the plot as the human’s arm then the optimal number of clusters will be the point of the elbow. In out case the optimal number of clusters is 6. So now we are ready to do cluster analysis on our dataset.
kmeans = KMeans(n_clusters=6)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure(figsize = (15,15))
plt.xlabel("Flights held")
plt.ylabel("Passengers")
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=300, cmap='Set1')
for i, txt in enumerate(airline_count.index.values):
plt.annotate(txt, (X[i,0], X[i,1]), size = 7)
plt.show()

Now we have done the clustering of our data and we have make some assumptions and predictions based on the results. We can see that American Airlines and SkyWest airlines hold many flights and transport many people, so we can infer that maybe this airlines have planes with bigger capacity or people use these airlines more than other. Cluster analysis allows you to get many insights about your data. Now it’s important for us to validate the results of our clustering (to see how good the partition is), for this we will use Silhouette analysis.
Clustering validation: Silhouette analysis
In order to understand the main feature of such a type of validation we have to find out what a silhouette coefficient is.

For each point in our dataset we can calculate a silhouette coefficient using the formula above but there are some unclear things about what a(i) and b(i) is. So let’s understand the meaning of a(i) parameter.

For each point we can calculate a(i) – the average of distances between current point i and i’s cluster neighbors point denoted as j (either i or j should correspond to the same cluster). Now, let’s understand what b(i) parameter means.

For each point we can calculate b(i) – a minimum values of distances between current point i and points j that are not from the same cluster as i. Using parameters described above we can calculate the silhouette coefficient for each point in the dataset. Silhouette score takes values in range [-1;1]. When the value of coefficient is 1 this means that this point is really close to it’s cluster and not to the other,analogically -1 is the worst case. It’s always useful to make a silhouette plot. Let’s figure out what does it look like. In silhouette plot we draw a silhouette coefficient for every point grouped into clusters and sorted inside them. Looking at the plot we can tell how good is our classification: if the width of all "bars" if the same and every point inside a cluster has the silhouette coefficient bigger than the global average, such classification can be called good, otherwise we can’t tell that it’s absolutely perfect, but we can make some trade-offs due to the data we are working with. So now let’s make silhouette plot for our clustering (code is based on the example from scikit-learn documentation on Silhouette analysis).
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
n_clusters = 6
plt.figure(figsize = (10,10))
plt.gca().set_xlim([-0.1,1])
plt.gca().set_ylim([0, len(X) + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
labels = clusterer.fit_predict(X)
print("The average silhouette_score is :{}".format(silhouette_score(X, labels)))
sample_silhouette_values = silhouette_samples(X, labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values =
sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
plt.gca().fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
plt.gca().text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
plt.gca().axvline(x=silhouette_score(X, labels), color="red", linestyle="--", label = "Avg silhouette score")
plt.title("The silhouette plot")
plt.xlabel("The silhouette coefficient values")
plt.ylabel("Cluster label")
plt.legend()
plt.show()

As we can see clusters are not of the same size and not all clusters are made of points that have a silhouette coefficient bigger than the global average, but looking at our data, that is distributed in a bit strange way (see the scatter plot) we can conclude that this clustering has shown good results.
Thanks you a lot for reading, hope you enjoyed the great power of cluster analysis!!!