The world’s leading publication for data science, AI, and ML professionals.

Conceptual vs inbuilt Principal Component Analysis for Breast Cancer Diagnosis

To fit the concept of eigenvalues and eigenvectors and python inbuilt function in PCA followed by the ML models to measure advanced…

To fit the concept of eigenvalues, eigenvectors and python inbuilt function in PCA followed by the ML models to measure advanced accuracy parameters for breast cancer (Wisconsin) dataset. Use of 2 clustering and 4 Classification models for analysis.

Photo by Štefan Štefančík on Unsplash
Photo by Štefan Štefančík on Unsplash

Introduction:

We, the Data Science enthusiasts, passionate about exploring the world through python, know that python has made Principal Component Analysis **** very easy with just a few lines of code. But we must not forget the fundamental concept behind choosing our principal components. In this article, before applying the machine learning models for Clustering and classification, I have implemented PCA in two ways- one with calculating the eigenvalues and eigenvectors and the other one with the traditional scikit-learn tool.

Before diving into the methods, let me first ask you two very simple questions:

Why deal with an enormous dataset which makes computations a mammoth task?

Why not diminish it while retaining almost all the valuable information it has to offer?

This is where PCA comes to your rescue with a dagger and chops down the data reducing its dimension to make analysis easier, and yet preserve the statistics of the data at hand. Now you might think how does PCA determine which variables to remove and which to keep? That is the beauty of dimensionality reduction; instead of removing the variables, it combines the information into new principal components which are built as a linear combination of the original variables. Yes, accuracy does get compromised a bit but the simplicity gained is worth much more.

While you think on that let me sketch a very simple situation for you to relate..

Suppose you plan to meet your school friends after a long time. They are eagerly waiting for you at their homes. However, your friend circle is quite big, and visiting each of them at their respective residences will take up quite a lot of time and energy. What is the immediate solution that you can think of? That is when you decide to call up each one of them and ask them to meet at a café nearby. In this way, you get to meet all your friends with minimum time and energy loss, without having to rule out anyone. Isn’t this similar to the explanation above?

PCA constructs the new variables or the principal components in such a way that most of the information is crammed into the first component, while the maximum remaining in the second component and so on. It creates an equal number of components as the original variables but the amount of information contained in each decreases with the increasing number of components. A simple graph will help you visualize how the principal components account for the maximum total variance in the data. This picture will get clearer soon when we implement this on a dataset.

The dataset which we shall work on-

I have chosen a dataset whose dimension is large enough to apply PCA since all real-world data be it structured or unstructured is of huge sizes. The ‘Breast Cancer (Wisconsin)’ dataset from Kaggle contains data on cancerous and non-cancerous patients. I have shared the link to the data- https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.

The analysis is implemented on Python (Google colab) and here is the link to my code in GitHub-

https://github.com/debangana97/Data-science-works/blob/main/Breast_cancer_diagnosis_using_PCA_two_ways.ipynb

As is evident from the file, there are 569 rows and 30 columns, that is to say, 30 different variables corresponding to information on 30 different features of the patients. These characteristics help in determining whether a patient is suffering from breast cancer or not. Nevertheless, inspecting all 30 variables at once to predict the condition of a future patient is very tedious and susceptible to errors. The wrong prediction of cancer is the last thing a doctor would want to do!

Objectives:

The main objectives of this analysis can be distinctly indicated in the following points:-

  1. To use PCA to decrease the dimension of the data simply means that PCA will combine the 30 variables into much lesser principal components
  2. To cluster the data based on the new variables (principal components) obtained through PCA
  3. To check whether the reduced data can be used for classification or not
  4. To find the optimum number of principal components based on accuracies of the classification models

Now that we have clarity on the main purposes of this article let us get into it.

Preparing the data ..

Data preprocessing or cleaning of the data is one of the major if not the most important prerequisite in data analysis. Proper preparation of data will lead to reliable and comprehensible results. Here are a few steps which I found necessary for my analysis-

  • Removing the unnecessary columns which do not contribute to the study
#removing the unnecessary columns
exclude = ['Unnamed: 32','id']
data = data.drop(exclude, axis = 1)
  • Checking for missing values. Quite fortunately there were no missing values in my dataset and hence I could use the entire original data without having to manipulate any point
  • Checking for data imbalance
Fig: Countplot showing the number of subjects in each class
Fig: Countplot showing the number of subjects in each class
  • Splitting the data into training and testing sets where 20% of the data has been kept aside for testing the performance of the model and the remaining was used for training. The dimension of the training set is 455 (rows) by 30 (columns)
  • Dividing the training data into two separate data frames. The dependent/target variable(y) consists of the cancer result namely ‘diagnosis’ column of the data and the feature variable(x) consists of all the characteristics of patients
  • Standardization of the matrix of features x in both the training set and the testing set is done for better computation and unbiased results. ‘StandardScalar‘ package is used to both centralize and normalize the data
#standardization or feature scaling of the matrix of features X does both centering and normalizing of the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.fit_transform(x_test)
  • Encoding the categorical variable ‘y’

Principal Component Analysis:

Method 1

Before carrying out PCA it is essential to standardize the data. Standardization makes sure that each of the original variables contribute equally to the analysis. This dimension reduction method is very sensitive to the variance of the original variables where variables with large ranges in values will dominate over the ones with lower ranges and hence the results will tend to become biased. This step converts the variables in such a manner that the entire training data has a mean of 0 and a standard deviation of 1.

print(np.mean(x_train_std))
print(np.std(x_train_std))
Output: 4.021203394686281e-17 
        1.0

In python, I have computed the covariance matrix. The positive and negative signs in the matrix suggests whether the feature variables are directly or inversely correlated among them.

From the knowledge of linear algebra, from this matrix, we can construct eigenvectors which are truly the directions of the new axes where there is the most information(highest variance) and are termed as the Principal Components. Their corresponding eigenvalues which are the coefficients of the eigenvectors indicate the amount of information(variance) conveyed by each principal component.

So now which components should you choose?

Think..

Think..

Correct! the ones which have the greatest eigenvalues right?

I selected the two components with the highest eigenvalues and plotted a 2-D graph showing the distinction between ‘Malignant’ and ‘Benign’ on the basis of these two components. Here too, I played around a bit using different sets of components to see which pair gave better visualization of the two kinds of patients.

  • First projection was made on the plane of the 1st and 2nd principal components. Second projection was made on the plane of 2nd and 3rd principal components.
Fig: Plotting the data grouped by the two kinds of patients based on the 1st and 2nd principal components
Fig: Plotting the data grouped by the two kinds of patients based on the 1st and 2nd principal components
Fig: Plotting the data grouped by the two kinds of patients based on the 2nd and 3rd principal components
Fig: Plotting the data grouped by the two kinds of patients based on the 2nd and 3rd principal components
  • From the plots it is clearly visible that the first projection makes a better separation of classes. This is an obvious result since the first component contains the highest proportion of information.

Try plotting the graph using the first and the third components. What do you get? Did the result improve?

Method 2

Next, I performed Principal Component Analysis using scikit-learn directly which will seem so effortless after the first method. Python does all the above calculations and finally presents us with a graph (scree plot) showing the principal components in order of their percentage of variation explained. The plot shows cumulatively about 73% of the total variation is explained by the first three components only.

Fig: Scree plot to show the percentage of variation explained by each of the 30 components
Fig: Scree plot to show the percentage of variation explained by each of the 30 components

To compare the results with Method 1, I plotted the data showing the two different classes based on the first two principal components which give a similar visualization to the one which was done without using scikit-learn directly. Hence, both the methods have performed equally well and python is coming at par with the human brain!

Fig: Python scikit-learn giving a similar graph to the one above
Fig: Python scikit-learn giving a similar graph to the one above

What Next?..

Now that the data is reduced from 30 variables to 3 variables a new feature vector is defined which contains as columns the eigenvectors of the three components. Therefore, the dimension of the new feature vector is 455 (rows) by 3 (columns).

#transforming the data to the reduced dimension data
#this is the data that we are going to work with during the analysis
pca_data = Pca.fit_transform(x_train_std)
print('The reduced data is of the dimension: ', pca_data.shape)
Output: The reduced data is of the dimension:  (455, 3)

The first machine learning algorithms that I am about to implement are the K-Means Clustering and Hierarchical Agglomerative Clustering models. Roughly speaking clustering is an unsupervised ML algorithm and requires no training whatsoever since it creates clusters of the entire data on the basis of the feature variables. Let us now see how these models work on our dataset.

K-means Clustering vs Hierarchical Clustering-

Before determining the clusters how can we decide how many clusters we want to divide our data into? The clusters should be formed in such a way that they are homogenous among themselves while sharing heterogeneity with other clusters.

In K-Means clustering we use the Within Cluster Sum of Squares in the elbow method plot, to find the optimum number of clusters. Why elbow you ask? Take a look at the plot and you will notice the curve to be shaped like an elbow and hence the name. But how can we decide whether to use 2, 3 or even 5 clusters?

Fig: Elbow method plot
Fig: Elbow method plot

For this confusion, I have used a metric called the Silhouette score which measures the closeness of each point in one cluster to the points in neighboring clusters. A Silhouette score of +1 suggests that the point is very far away from the neighboring cluster thus indicating that the clusters are distinct while a score of 0 suggests that the point is overlapping with the points in the other cluster or is not very far from them. So, clearly, the number of clusters corresponding to the highest Silhouette score is the best choice for analysis.

Fig: Silhouette score for each number of cluster
Fig: Silhouette score for each number of cluster

As can be seen from the graph that 2 clusters give the best results.

Coming to the hierarchical method of clustering, it uses the dendrogram to determine the optimal number of clusters which prominently shows that 2 clusters will give promising results.

#Using dendrogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(pca_data, method = 'ward'))
Fig: Dendrogram
Fig: Dendrogram

In the following lines of codes, the two methods have been thoroughly compared using 2, 3, 4 and 5 clusters. With the help of visualization techniques, the clusters are presented with different colors. The graphs are truly eye-catching, aren’t they?

Fig: Clustering done using 3 clusters
Fig: Clustering done using 3 clusters

Fig: Clustering done using 5 clusters
Fig: Clustering done using 5 clusters

As the number of clusters increase, they start overlapping with each other.

I have used another validation metric known as the Davies-Bouldin Index to fixate on the optimal number of clusters. It evaluates the ratio between the cluster scatter and the cluster separation.

So, what do you think will be an ideal value for this index?

If the clusters are well defined, compact and distinct then cluster scatter is sure to be less than the cluster separation. Hence, you are right to think that a lower value of the DV Index will indicate better clustering. The table below gives the Davies-Bouldin Index for the different values of k each along with their respective centroid coordinates. The one with the least value of this index is the best choice for clustering as discussed earlier.

Fig: Table showing the DV index for different numbers of clusters accompanied by the cluster co-ordinates
Fig: Table showing the DV index for different numbers of clusters accompanied by the cluster co-ordinates

As is evident from the above discussion and the table thereafter, for k = 2, clustering will give the best results.

Can Clustering and Classification be applied together?

Basic knowledge of machine learning will allow us to question ourselves that while Clustering is an unsupervised ML algorithm, Classification is a supervised one then how can both be applied on the same dataset? Is the data pre-labelled or not? Can we generate labels for unlabeled data?

The answers to these questions can be given once we compute the Purity metric of clustering. Before getting into the results I will tell you a bit about this metric and how it works.

  • Poor clustering will have a purity value close to 0. The highest value of purity score is 1 where a value of 1 implies that each document has its own cluster. This means each document is getting its own label. As we know classification is a supervised machine learning algorithm where the classes are pre-labelled and so if the purity of clustering is 1 that means it can be used for classification.
  • In general, high purity is easy to achieve when the number of clusters is large. However, if purity becomes equal to 1 then there is no point in forming clusters.
  • In our case, I have computed this metric for both K-Means clustering and Hierarchical Agglomerative Clustering with the number of clusters = 2, 3, 4 and 5 and found that in all the cases the purity value is close to 1. Thus even with a lower number of clusters, we can apply classification.

    K-NN Classification or Logistic Regression or Support Vector Machine, who classifies the best?

Since the purity metric confirmed that classification models will work efficiently on the data, I decided to use four different models just to compare the results. The target variable(y) shows whether the patient is diagnosed with breast cancer or not. It is a binary variable where ‘M’ stands for malignant and ‘B’ stands for benign. Label Encoding helps in classifying and visualizing the data. For each model, I have used a varied number of components and analyzed the outcomes and presented the results in a tabular format.

#Encoding the target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

Visualizations of classification results are a treat to watch! Using 2 principal components I have presented some fascinating graphs for you to enjoy..

Fig: K-Nearest Neighbors Classifier on training and testing sets
Fig: K-Nearest Neighbors Classifier on training and testing sets

Fig: Logistic Regression on training and testing sets
Fig: Logistic Regression on training and testing sets

Fig: Linear SVM on training and testing sets
Fig: Linear SVM on training and testing sets

Fig: Radial Basis Function (kernel) SVM on training and testing sets
Fig: Radial Basis Function (kernel) SVM on training and testing sets

Well, that was a lot of graphs in one go! But could you point out some of the differences in them? It looks like linear classifiers will work better on this dataset.

Using different number of principal components, I have applied the classification models to go back to answering our primary question- How many principal components do we actually require?

The accuracy results (in %) obtained from the classification analysis can be tabulated as follows-

Fig: Table showing the overall accuracy of each classification model specific to number of principal components
Fig: Table showing the overall accuracy of each classification model specific to number of principal components

If we study the table closely the following points can be concluded-

  • For 2 principal components, K-NN gives the greatest accuracy while kernel SVM gives the lowest
  • K-NN reaches the maximum accuracy for 4 components and thereafter it begins to decrease
  • Linear SVM results in the greatest accuracy of about 96% using 4 principal components
  • Logistic Regression also was able to gain about 95% accuracy when 4 principal components were considered and after that, it remained constant
  • Kernel SVM provided an accuracy of about 94% using 4 principal components
  • The overall accuracy obtained by Kernel SVM results is lower than that obtained by Linear SVM which suggests that the two classes can be better separated by a straight line

Some other comparable metrics that I have added include-

Confusion matrix: It is a 2×2 contingency table showing the number of correct predictions and the number of wrong predictions made by the model

Accuracy score for each separate class: Calculates the individual accuracy of the model in predicting each class

Average class specific accuracy: The mean of the individual accuracies calculated above

Predictive positive value score: The probability that patients who are predicted to have cancer by the model actually is suffering from cancer.

Go and check out the entire code which I have provided and you will surely understand these values.

Concluding we can say..

Since our objective was to decrease the dimension of the data while retaining the maximum information, I think it is safe to say that the optimum number of principal components should be 4.

Do you think otherwise?

I hope reading this article contributed to your data science knowledge repository and helped you gain deeper insights. So, remember from next time don’t overfeed your python workbook if you can give it the same nutrients with lesser amount of food.. Thank you!


Related Articles