Introduction
In this article, we will employ popular Clustering algorithms to discover turbine "neighbors" on a wind farm. The term neighbors here represents a group of turbines that have similar characteristics based on geographical location, wind speed, power output, or other environmental and mechanical variables.
In a previous article titled "Clustering-based data preprocessing for operational wind turbines", we covered in detail the cluster-based approach for imputing missing values during wind turbine data preprocessing. We used only the KMeans algorithm in that analysis and the focus here is to evaluate other available algorithms for this purpose.
Clustering-based data preprocessing for operational wind turbines
In choosing the "best" model for grouping the turbines, we are specifically looking for the model that creates the optimal number of clusters whose statistics give the best prediction of missing wind speed values.
We use the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the missing wind speed predictions. In addition, we will evaluate the non-zero missing values using the mean absolute percentage error (MAPE).
It should be noted that the clustering-based approach may not fill all missing values if there are no available data for turbines in a cluster at the desired timestamp or if some clusters consist of too few turbines. Hence, we will also report the percentage of missing values filled by the algorithms.
Efficiently grouping turbines in a wind farm can reduce both the human and computational effort required for data analysis.
The Data
We use the publicly available Longyuan wind turbine data on Kaggle with citations as necessary. The data is released for public use here. This is an operational dataset from 134 turbines for 245 days at a 10-minute resolution. Turbine location data is also provided. The heads of the datasets are shown below:


Data Preparation
We first perform data cleaning and filtering to extract the data points representing the normal operating conditions of the turbines. The raw and filtered operational data are shown below:

In addition, we create a test set from the cleaned data for evaluating the algorithms. The test data consists of 19940 rows which are 1.05 percent of the cleaned data.
The training data is further transformed into a pivot-table-like structure with the turbine ID as row index, timestamp as columns index, and wind speed as values.

To conclude the data preparation step, we scale the transformed data using the Standard Scaler module in Scikit-learn before passing the data into the clustering algorithms.
Cluster Modeling
We create turbine clusters based on wind speed using KMeans, Agglomerative Clustering (AGC), Gaussian Mixture Models (GMM), and Affinity Propagation (AP) algorithms where the cluster average at a given timestamp is used to predict (fill) missing values at that timestamp.
In addition, important hyperparameters such as the number of clusters and damping factor (in the case of AP) are chosen using the Silhouette metric. The Silhouette metric is used to evaluate the quality of clusters created by clustering algorithms and hence, can be used to select the hyperparameters.
This is the most popular clustering method. It creates clusters by minimizing the within-cluster sum of squared distances while maximizing the between-cluster sum of squares. It requires the selection of the number of clusters as a hyperparameter.

From the plot above, the optimal number of clusters for this model is 6. Next, we fit the model using this hyperparameter.
Agglomerative Clustering (AGC):
This is a type of hierarchical clustering algorithm **** that recursively merges a pair of sample data clusters using linkage distance such that data points in the same cluster are more similar and data points in different clusters are dissimilar. The optimal number of clusters needs to be selected as in the case of Kmeans.

For this model, the optimal number of clusters is 10 and we use this hyperparameter to fit the model.
Gaussian Mixture Models (GMM):
This is a probabilistic model which learns a mixture of finite numbers of Gaussian distributions from the training data and employs the expectation-maximization (EM) algorithm for fitting the Gaussian models. In this case, we use KMeans to initialize the model components and use the Silhouette metric to select the optimal number of clusters.

The optimal number of clusters for the GMM algorithm is 10.
This algorithm creates clusters by exchanging messages between pairs of data points to determine the attractiveness of each point to one another and this process is performed iteratively until convergence is achieved. Although the damping factor and preferences are key hyperparameters for this algorithm we tweak only the damping factor.

The optimal damping factor is 0.6 and this results in the creation of 13 clusters.
Results
Using the optimal hyperparameters for each algorithm, we group the turbines which are considered neighbors with respect to the wind speed. The results are shown below:

Across all models studied, the results show that turbines on both edges along the X-axis are more likely to be neighbors than those geographically closer but are between other turbines. Similarly, turbines near the middle along the X-axis are more likely to be neighbors based on wind speed.
However, the models differ in how they partition the middle-row turbines which directly impacts the representativeness of resulting clusters and hence, the ability of the clusters to accurately predict the missing values of member turbines.
The obtained results are quite intuitive given that the turbines on the edges are likely to experience similar minimal obstructions to the wind speed compared with those in the middle of the park.
Model Evaluation
Based on the test data, we evaluate the performance of each clustering model in predicting missing wind speed values at a given timestamp from cluster average values.
In addition, the fewer the number of turbines in a cluster, the lesser the ability of that cluster to fill the missing values due to insufficient data. Hence, clustering models which use a large number of clusters may fill less number of missing values. The table below shows a summary of the results:
Next, we visualize the performance of the models.

The Affinity propagation model gives the best performance in terms of MAE and MAPE while the KMeans model gives the least error in terms of the RMSE. In general, the performances of all considered algorithms are comparable and can be combined to fill more missing values.
The RMSE metric is more sensitive to outliers and applies a higher penalty for large deviations when compared with the other metrics due to the initial squaring of the errors. Whereas, the MAE and MAPE are more intuitive measures of model performance.
A sample of the predicted missing values displayed below shows a very good agreement between the models and the ground truth.

Now, we visualize the imputation errors.

The imputation errors for all algorithms are well-behaved and symmetric around the zero-error position.
The Jupyter Lab notebook containing all Python codes used in this article can be found here.
Conclusions
We have explored four different clustering algorithms for creating turbine groups. These turbine groups if optimally created, can provide useful information about member turbines that can be used for filling missing values during data preprocessing.
The Kmeans model is quite intuitive and fills the most missing values. However, other methods such as the Affinity Propagation method give better performance based on MAE and MAPE. Hence, the results from these algorithms can be combined for greater benefits.
I hope you enjoyed reading this article, until next time. Cheers!
What’s more interesting? You can access more enlightening articles from me and other authors by subscribing to Medium via my referral link below which also supports my writing.
Don’t forget to check other stories on applying state-of-the-art Data Science principles in the renewable energy space.
References
Zhou, J., Lu, X., Xiao, Y., Su, J., Lyu, J., Ma, Y., & Dou, D. (2022). SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022. arXiv. https://doi.org/10.48550/arXiv.2208.04360
Clustering-based data preprocessing for operational wind turbines