
For a couple of months now, I have been looking for projects to work on. I have gone from wanting to work on this NFL challenge, to actually building a web app with a friend that some people would probably classify as PG13 (one day I will write about it and, maybe, deploy it).
I ended up looking for datasets on Kaggle that could help me brush up my time series skills. You know… just for fun. That was when I ran into this dataset about wind power forecasting, and decided to give it a try.
Let’s do time series analysis!
After I cleaned up the data (got rid of some columns with a huge amount of missing values, turned some categorical features into numbers, parsed the dates, filled some missing values with the median, dropped more columns that didn’t add much to the data, etc.) I found that:
- The features left were highly correlated with the dependent variable 😍
- Many of the features were also highly correlated with each other 😫
When your features are correlated with each other, that means that they are not independent. This is a common problem in data science and is called "multicollinearity".
In an ideal case features should be independent from each other. Solving this issue is super important, because working with highly correlated features in multivariate regression models can lead to inaccurate results.
One way to approach this problem is by doing Principal component analysis (PCA).
Did I say time series before? Oh, never mind, we are doing PCA now
PCA is a feature extraction method commonly used to tackle multicollinearity, among other things. The greatest advantage of PCA in this case, is that after applying it, each of the "new" variables will be independent of one another.
This section is based on this article by Matt Brems. (I have to thank my time series professor for sending it to me!). I will go step by step and explain how to do PCA using just Numpy (and a little bit of pandas to manipulate the dataframe at the beginning).
Step 1
It is very important to clean the dataset first. Since that is not the objective of this article, you can go take a look at what I did in this repo.
Step 2
Separate your data into the dependent variable (Y) and the features or independent variables (X).
Step 3
Take the matrix of independent variables X and, for each column, subtract the mean of that column from each feature. (This ensures that each column has a mean of zero.) You can also normalize X by dividing each column for its standard deviation. The purpose of this step is to standarize the features so they have a mean equal to zero and a standard deviation equal to one.
Step 4
This is a sanity-check step. Let’s make sure the mean and standard deviation are zero and one, respectively.
Step 5
Take the matrix Z, transpose it, and multiply the transposed matrix by Z. This is the covariance matrix of Z.
Step 6
Get an array of computed eigenvalues and a matrix whose columns are the normalized eigenvectors corresponding to the eigenvalues in that order.
In this step it is important to make sure that the eigenvalues and its eigenvectors are sorted in descending order (from largest to smallest). Sort the eigenvalues and then the eigenvectors, accordingly.
Step 7
Assign P to the matrix of eigenvectors and D to the diagonal matrix with eigenvalues on the diagonal and values of zero everywhere else. The eigenvalues on the diagonal of D will be associated with the corresponding column in P.
Step 8
Calculate Z = ZP. This new matrix, Z, is a centered or standardized version of X but now each observation is a combination of the original variables, where the weights are determined by the eigenvector.
One important thing about this new matrix Z is that because the eigenvectors in P are independent to one another, so will be the columns in Z!
Step 9
With Z* one can now make the decision about how many features to keep versus how many to drop. This is commonly done through a scree plot.

A scree plot shows how much variation each principal component captures from the data. The y axis stands for the amount of variation (see this article for more information about scree plots and how to interpret them). The plot on the left shows both amount of variation derived from each principal component and the cumulative variation that results from adding or considering another principal component for the model.
How many principal components to choose, comes down to a rule of thumb: the selected principal components should be able to describe at least 80–85% of the variance. In this case, the first principal component alone explains roughly 82% of the variance. Adding the second principal component pushes that figure to almost 90%.
If you are wondering how to interpret the principal components in the context of the data, I found this article to be particularly easy to follow. They use the iris dataset and gives lots of examples using the scree, profile, and pattern plots.
Conclusions
PCA can be tricky sometimes. But with the right resources and patience, it can become enjoyable. On top of that, I found useful getting back to the basics and getting a full grasp of PCA by coding everything using numpy and understanding the linear algebra behind every line of code.
I hope you found this article helpful! You can see the full Python script here.
Also, if you have any feedback, please do not hesitate to get in touch.
Thanks for reading and happy coding! 🤓