The world’s leading publication for data science, AI, and ML professionals.

Principal Component Analysis and SVM in a Pipeline with Python

Pipeline, GridSearchCV and Contour Plot

Decision Boundary (Picture: Author's Own Work, Saitama, Japan)
Decision Boundary (Picture: Author’s Own Work, Saitama, Japan)

In a previous post I have described about principal component analysis (PCA) in detail and, the mathematics behind support vector machine (SVM) algorithm in another. Here, I will combine SVM, PCA, and Grid-search Cross-Validation to create a pipeline to find best parameters for binary classification and eventually plot a decision boundary to present how good our algorithm has performed. What you expect to learn/review in this post –

  • Joint-plots and representing data in a meaningful way through Seaborn Library.
  • If you have more than 2 components in principal component analysis, how to choose and represent which 2 components are more relevant than others?
  • Creating a pipeline with PCA, and SVM to find best fit parameters through grid search cross-validation.
  • Finally, we choose the 2 principal components to represent SVM decision boundary in a 3d/2d plot, drawn using Matplotlib.

1. Know the Data-Set Better: Joint-plots and Seaborn

Here, I have used scikit-learn cancer data-set, relatively easy data-set for studying binary classification, with 2 classes being Malignant and Benign. Let’s look at the few rows of the data-frames.

As we can see there are total 569 samples and 30 features in the data-set and, our task is to classify malignant samples from benign samples. After checking that there are no missing data, we check the feature names and check correlation plots of the mean features.

Below is the correlation plot of mean features potted using seaborn library. As expected ‘area’, ‘perimeter’, and ‘radius’ are highly correlated.

Fig. 1: Correlation plot of mean features.
Fig. 1: Correlation plot of mean features.

We can use ‘seaborn jointplot‘ to understand relationship between individual features. Let’s see 2 examples below, where as an alternative of scatter plots, I have opted for 2D density plots. On the right panel, I used ‘hex’ setting, where along with histograms, we can understand the concentration of number of points in a small hexagonal area. Darker the hexagon, more number of points (observations) fall in that region and this intuition can also be checked with the histograms plotted on the boundaries for the 2 features.

Fig. 2: Joint-plots can carry more info than simple scatter plots.
Fig. 2: Joint-plots can carry more info than simple scatter plots.

On the left, apart from the histogram of individual features that are plotted on the boundaries, the contours are representing the 2D kernel density estimation (KDE). Instead of just discrete histograms, KDE’s are often useful and, you can find one fantastic explanation here.

We can also plot some pair plots to study which features are kind of ‘ more relevant ‘ to classify malignant from benign samples. Let’s see one example below –

Fig. 3: Pair plots of few features in Cancer data-set. Code can be found in my GitHub.
Fig. 3: Pair plots of few features in Cancer data-set. Code can be found in my GitHub.

Once we have played enough with the data-set to explore and understand what we have got in hand, then, let’s move towards the main classification task.


2. Pipeline, GridSearchCV, and Support Vector Machine:

2.1. Principal Components:

Now we will create a pipeline using StandardScaler, PCA and Support Vector Machine following the steps below –

  • Start from splitting the data-set in train and test set
  • Check the effect of Principal Component Analysis: PCA reduces the dimensions of feature space in such a way that final features are orthogonal to each other. As we have 30 features in cancer data-set it is good to visualize what PCA actually does. You can read in more details in my other post on PCA. Before applying PCA, we normalize our features by subtracting the mean and scale it to unit variance, using StandardScaler. So, we start of by selecting 4 orthogonal components –

Let’s plot the cancer-data for these selected 4 principal components –

Fig. 4: Which principal components are more relevant?
Fig. 4: Which principal components are more relevant?

As you can see in the plots, first 2 principal components are more relevant to separate malignant and benign samples. How about the variance ratio?

As expected, first 2 components are contributing for ~80% of the total variance. This is relevant to show before choosing 2 components for plotting the decision boundary because, you may have some data-set with many features where choosing 2 principal components is not justified in terms of percentage variance ratio. It is always a good check before you create pipeline with PCA and some classifier to justify the choice of 2 principal components.

2.2. Pipeline & Grid-Search Cross Validation:

Once we have seen how important PCA is for classification and plotting decision boundary for the classifier, now, we create a pipeline with StandardScaler, PCA and SVM.

You can check more about Pipeline and Grid-Search Cross Validation in details, that I wrote separately. I chose 2 principal components because, our goal is to draw decision boundary in a 2D/3D plot and, the best parameters ‘C’ and ‘Gamma’ for SVM with radial basis function kernel are obtained with a fix value of number principal components. Let’s check the fit parameters –

Here, we see that using 2 principal components and 4 fold cross-validation our pipeline with SVM classifier obtained 94% accuracy. You can of course play around with various values or maybe use a different kernel. For more on mathematics behind kernel function, you can check my other post. Before moving on to the next section, we can complete the analysis by plotting the confusion matrix, from which, one can obtain precision, recall and F1 score.

Fig. 5: Confusion Matrix obtained using our pipeline with SVM classifier and RBF kernel.
Fig. 5: Confusion Matrix obtained using our pipeline with SVM classifier and RBF kernel.

3. Plot SVM Decision Boundary:

I have followed Scikit-Learn’s tutorial to plot maximum margin separating hyperplane, but instead of linear kernel as used in the tutorial, radial basis function kernel is used here. We are going to use decision function method which returns the decision function of the sample for each class in the sample. Intuitively, for binary classification, we can think of this method as if it tells us on which side and how far of the hyperplane generated by the classifier, we are. For the mathematical formulation of decision rule for SVM, you can check my previous post, if you are interested. Let’s see contours of decision function

Fig. 6: Decision Function contours with Radial Basis Function kernel is shown here along with the support vectors. Best fit-parameters are obtained from 5 fold grid search cross-validation.
Fig. 6: Decision Function contours with Radial Basis Function kernel is shown here along with the support vectors. Best fit-parameters are obtained from 5 fold grid search cross-validation.

Let’s understand the code I have used to plot the the above figure –

  • Started of with choosing 2 principal components for PCA. Only 2 components, because it helps us to visualize the boundary in a 2D/3D plot. Visualizing more than 2 components along with the decision function contours is problematic!
  • Set up the SVM classifier with radial basis function kernel and ‘C’, ‘gamma’ parameter are set to the best fit values obtained from grid-search cross-validation.
  • Define a function to create contours consisting of x, y (eventually the chosen principal components) and Z (the decision function for SVM).
  • We create a rectangular grid out of an array of x values and an array of y values through the function make-meshgrid. Check what is the necessity of Numpy Meshgrid.
  • Finally we plot the decision function as a 2D contour plot along with the support vectors as scatter points.

Hopefully, this post has helped you to understand the strategy to set up a support vector machine classifier and effectively use Principal Component Analysis to visualize decision boundary.

Let’s see a 3D animated view of it –

Fig. 7: Decision Function contours as viewed from top. 3D representation for the figure 6.
Fig. 7: Decision Function contours as viewed from top. 3D representation for the figure 6.

Finally, I conclude the post by emphasizing that it is always good to check your understanding and, here we can see how the gamma factor affects the decision function contours.

For the plots above, we had γ = 0.5, C = 1. You can read my other post on SVM kernels, where I have discussed how increasing γ parameter can give rise to complicated decision boundaries.

Let’s check this using 2 different values of ‘Gamma’ by keeping ‘C’ parameter fixed to the best fit value—

  • For γ = 0.01
Fig. 8: Decision function contours plotted with low gamma (γ=0.01) value. Can be compared with figure 6, where γ = 0.5.
Fig. 8: Decision function contours plotted with low gamma (γ=0.01) value. Can be compared with figure 6, where γ = 0.5.
  • For γ = 10.0
Fig. 9: High gamma parameter is causing extremely complicated decision boundaries.
Fig. 9: High gamma parameter is causing extremely complicated decision boundaries.

As you can see that increasing gamma parameter gives rise to extremely complicated contours. Most importantly, almost all the samples (of each classes) for high gamma values are acting as support vectors. This is certainly a case of over-fitting for such simple, easy to classify cancer data-set.

"Always check your intuition and understanding !!" – Anonymous


For further reading, I suggest you to check several excellent demonstrations given in Sebastian Raschka’s book on Machine Learning with Python (Pages: 76–88, 2nd Edition September 2017).

Stay strong and happy, Cheers!

Find Codes used for this post in Github.


Related Articles