Significant Wave Prediction Using Neural Network
Test Drive Using Pytorch and Comparison of Models
Prologue
So, right now I want to test my skill using PyTorch. I am new to PyTorch, previously I am using Tensorflow to build an ML model. I already read all the basic tutorials and ready to build a real model using real data. What your favorite site to found real data to test your model? For me it’s kaggle.com. After scrolling through all available datasets, I choose this wave dataset. Download it and prepare to play with it.
First Glimpse of the Dataset
Here we have the date and time for when this data captured, significant wave height (Hs), maximum wave height (Hmax), The zero up crossing wave period (Tz) (err, I don’t know what is this), The peak energy wave period (Tp) (Energy? in what? I don’t know), peak direction (I think this is a direction of where the wave come from?), and the last is sea surface temperature (SST). We have a dataset with 30-min time resolution. -99.9
is a flag for no data I think. I am not familiar with Oceanographic data but I will do my best :D.
From all available columns, what I understand is just Hs, Hmax, and SST (and of course date and time too). I will try to predict Hs.
Hs Prediction
I will just play with data that I understand (Hs, Hmax, SST). Let’s see the correlation between Hs and another 2 variables. In the process, we must get rid -99.9
. So a datum that contains this number in any of our 3 selected variables is ignored. Here the code
Here the scatter plot of Hs and Hmax
Looks very good. Their correlation coefficient is 0.97, almost perfect. But wait, Hmax
is the highest wave height observed, and Hs
is the average of the high wave. So it’s just redundant Hs
if we include this variable as a predictor of Hs. And for SST
with a correlation coefficient of 0.26. Looks bad. So the last choice just uses Hs
to predict Hs
. We will try to predict the future Hs
using a historical time-series ofHs
.
For the first try, we use 3 hours of Hs data to predict the next 30 minutes. Before that, we will try to build a param and label dataset first. It will be quite a time consuming if we try to merge the build param and label section and model section every time we try to experiment with our model.
And now using already builder param and label data, we will try to predict Hs
using all of the data. Let’s use a Multi-Layer Perceptron Model.
Here we use the last 10000 data as testing data and the rest is training data. Now, how we can benchmark this model? Let’s use the standard deviation of the data. Why? If you look thoroughly at the formula of standard deviation, actually it’s the same as RMSE of the model that only outputting the mean of the data.
It’s the simplest method to predict something if we don’t know anything about the data. The standard deviation of Hs data is 0.5. So, the error of the model must below this value to make a good prediction.
And the RMSE of our MLP model here is
It’s 0.15 of the standard deviation of the Hs
data. Let’s see the graph for its training and test error of this MLP model.
at the early stage it converges very fast and the later it slows down. Anyway, before I post this result and graph here I try many combinations of hyperparameter of MLP, and for every combination of hyperparameter, you will get a slightly different result and graph (Yeah, even when you don’t change anything and just re-run the model, you will get a slightly different result because it’s the nature of neural network). I think this is the best combination so far.
Can we improve this? Maybe, let’s try using the Long Short Term Memory (LSTM) model.
And the RMSE is
The error is improving but just a little. Here it’s the train and test graph for LSTM.
The model converges very fast at the early stage and slows down exponentially after that same as MLP.
Next try, let’s just use a portion of data. Let’s just use the first 3400th data and from there, the 96th last will be test data. Just edit testing_data_num
, all_param
and all_label
in the code.
Here the result for MLP
And for the LSTM
Yay, it’s getting a little better but this time, MLP is better than LSTM. As you can see in the train and test graph, the error for the test data is decreasing steadily although there is a fluctuation in the train data. The moral of the story, don’t be afraid when your training error is increasing or fluctuated, it doesn't mean the same happen in your test data. It’s just a way of the algorithm to generalize your model.
Epilogue
So that is for the test drive using PyTorch. You still can improve this model like try to predict Hs
but multi timestep in the future or just tune the hyperparameter to get the better result. You can get the full code here. See you in another post.
References
https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba (wave dataset)