The world’s leading publication for data science, AI, and ML professionals.

Evaluation of Clustering Methods for Discovering Wind Turbine Neighbors

An application of data science techniques in renewable energy data analysis

Photo by Abby Anaday on Unsplash
Photo by Abby Anaday on Unsplash

Introduction

In this article, we will employ popular Clustering algorithms to discover turbine "neighbors" on a wind farm. The term neighbors here represents a group of turbines that have similar characteristics based on geographical location, wind speed, power output, or other environmental and mechanical variables.

In a previous article titled "Clustering-based data preprocessing for operational wind turbines", we covered in detail the cluster-based approach for imputing missing values during wind turbine data preprocessing. We used only the KMeans algorithm in that analysis and the focus here is to evaluate other available algorithms for this purpose.

Clustering-based data preprocessing for operational wind turbines

In choosing the "best" model for grouping the turbines, we are specifically looking for the model that creates the optimal number of clusters whose statistics give the best prediction of missing wind speed values.

We use the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the missing wind speed predictions. In addition, we will evaluate the non-zero missing values using the mean absolute percentage error (MAPE).

It should be noted that the clustering-based approach may not fill all missing values if there are no available data for turbines in a cluster at the desired timestamp or if some clusters consist of too few turbines. Hence, we will also report the percentage of missing values filled by the algorithms.

Efficiently grouping turbines in a wind farm can reduce both the human and computational effort required for data analysis.


The Data

We use the publicly available Longyuan wind turbine data on Kaggle with citations as necessary. The data is released for public use here. This is an operational dataset from 134 turbines for 245 days at a 10-minute resolution. Turbine location data is also provided. The heads of the datasets are shown below:

Image by author
Image by author
Image by author
Image by author

Data Preparation

We first perform data cleaning and filtering to extract the data points representing the normal operating conditions of the turbines. The raw and filtered operational data are shown below:

Image by author
Image by author

In addition, we create a test set from the cleaned data for evaluating the algorithms. The test data consists of 19940 rows which are 1.05 percent of the cleaned data.

The training data is further transformed into a pivot-table-like structure with the turbine ID as row index, timestamp as columns index, and wind speed as values.

Image by author
Image by author

To conclude the data preparation step, we scale the transformed data using the Standard Scaler module in Scikit-learn before passing the data into the clustering algorithms.


Cluster Modeling

We create turbine clusters based on wind speed using KMeans, Agglomerative Clustering (AGC), Gaussian Mixture Models (GMM), and Affinity Propagation (AP) algorithms where the cluster average at a given timestamp is used to predict (fill) missing values at that timestamp.

In addition, important hyperparameters such as the number of clusters and damping factor (in the case of AP) are chosen using the Silhouette metric. The Silhouette metric is used to evaluate the quality of clusters created by clustering algorithms and hence, can be used to select the hyperparameters.

KMeans Clustering

This is the most popular clustering method. It creates clusters by minimizing the within-cluster sum of squared distances while maximizing the between-cluster sum of squares. It requires the selection of the number of clusters as a hyperparameter.

Image by author
Image by author

From the plot above, the optimal number of clusters for this model is 6. Next, we fit the model using this hyperparameter.

Agglomerative Clustering (AGC):

This is a type of hierarchical clustering algorithm **** that recursively merges a pair of sample data clusters using linkage distance such that data points in the same cluster are more similar and data points in different clusters are dissimilar. The optimal number of clusters needs to be selected as in the case of Kmeans.

Image by author
Image by author

For this model, the optimal number of clusters is 10 and we use this hyperparameter to fit the model.

Gaussian Mixture Models (GMM):

This is a probabilistic model which learns a mixture of finite numbers of Gaussian distributions from the training data and employs the expectation-maximization (EM) algorithm for fitting the Gaussian models. In this case, we use KMeans to initialize the model components and use the Silhouette metric to select the optimal number of clusters.

Image by author
Image by author

The optimal number of clusters for the GMM algorithm is 10.

Affinity Propagation (AP):

This algorithm creates clusters by exchanging messages between pairs of data points to determine the attractiveness of each point to one another and this process is performed iteratively until convergence is achieved. Although the damping factor and preferences are key hyperparameters for this algorithm we tweak only the damping factor.

Image by author
Image by author

The optimal damping factor is 0.6 and this results in the creation of 13 clusters.


Results

Using the optimal hyperparameters for each algorithm, we group the turbines which are considered neighbors with respect to the wind speed. The results are shown below:

Image by author
Image by author

Across all models studied, the results show that turbines on both edges along the X-axis are more likely to be neighbors than those geographically closer but are between other turbines. Similarly, turbines near the middle along the X-axis are more likely to be neighbors based on wind speed.

However, the models differ in how they partition the middle-row turbines which directly impacts the representativeness of resulting clusters and hence, the ability of the clusters to accurately predict the missing values of member turbines.

The obtained results are quite intuitive given that the turbines on the edges are likely to experience similar minimal obstructions to the wind speed compared with those in the middle of the park.


Model Evaluation

Based on the test data, we evaluate the performance of each clustering model in predicting missing wind speed values at a given timestamp from cluster average values.

In addition, the fewer the number of turbines in a cluster, the lesser the ability of that cluster to fill the missing values due to insufficient data. Hence, clustering models which use a large number of clusters may fill less number of missing values. The table below shows a summary of the results:

Next, we visualize the performance of the models.

Image by author
Image by author

The Affinity propagation model gives the best performance in terms of MAE and MAPE while the KMeans model gives the least error in terms of the RMSE. In general, the performances of all considered algorithms are comparable and can be combined to fill more missing values.

The RMSE metric is more sensitive to outliers and applies a higher penalty for large deviations when compared with the other metrics due to the initial squaring of the errors. Whereas, the MAE and MAPE are more intuitive measures of model performance.

A sample of the predicted missing values displayed below shows a very good agreement between the models and the ground truth.

Image by author
Image by author

Now, we visualize the imputation errors.

Image by author
Image by author

The imputation errors for all algorithms are well-behaved and symmetric around the zero-error position.

The Jupyter Lab notebook containing all Python codes used in this article can be found here.


Conclusions

We have explored four different clustering algorithms for creating turbine groups. These turbine groups if optimally created, can provide useful information about member turbines that can be used for filling missing values during data preprocessing.

The Kmeans model is quite intuitive and fills the most missing values. However, other methods such as the Affinity Propagation method give better performance based on MAE and MAPE. Hence, the results from these algorithms can be combined for greater benefits.

I hope you enjoyed reading this article, until next time. Cheers!


What’s more interesting? You can access more enlightening articles from me and other authors by subscribing to Medium via my referral link below which also supports my writing.

Join Medium with my referral link – Abiodun Olaoye

Don’t forget to check other stories on applying state-of-the-art Data Science principles in the renewable energy space.

References

Zhou, J., Lu, X., Xiao, Y., Su, J., Lyu, J., Ma, Y., & Dou, D. (2022). SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022. arXiv. https://doi.org/10.48550/arXiv.2208.04360

Clustering-based data preprocessing for operational wind turbines

Wind energy analytics toolbox: Iterative power curve filter


Related Articles