Significant Wave Prediction Using Neural Network

Test Drive Using Pytorch and Comparison of Models

Muhammad Ryan
Towards Data Science

--

Prologue

So, right now I want to test my skill using PyTorch. I am new to PyTorch, previously I am using Tensorflow to build an ML model. I already read all the basic tutorials and ready to build a real model using real data. What your favorite site to found real data to test your model? For me it’s kaggle.com. After scrolling through all available datasets, I choose this wave dataset. Download it and prepare to play with it.

First Glimpse of the Dataset

Image by Author

Here we have the date and time for when this data captured, significant wave height (Hs), maximum wave height (Hmax), The zero up crossing wave period (Tz) (err, I don’t know what is this), The peak energy wave period (Tp) (Energy? in what? I don’t know), peak direction (I think this is a direction of where the wave come from?), and the last is sea surface temperature (SST). We have a dataset with 30-min time resolution. -99.9 is a flag for no data I think. I am not familiar with Oceanographic data but I will do my best :D.

From all available columns, what I understand is just Hs, Hmax, and SST (and of course date and time too). I will try to predict Hs.

Hs Prediction

I will just play with data that I understand (Hs, Hmax, SST). Let’s see the correlation between Hs and another 2 variables. In the process, we must get rid -99.9. So a datum that contains this number in any of our 3 selected variables is ignored. Here the code

Here the scatter plot of Hs and Hmax

Image by Author

Looks very good. Their correlation coefficient is 0.97, almost perfect. But wait, Hmax is the highest wave height observed, and Hs is the average of the high wave. So it’s just redundant Hs if we include this variable as a predictor of Hs. And for SST

Image by Author

with a correlation coefficient of 0.26. Looks bad. So the last choice just uses Hs to predict Hs. We will try to predict the future Hs using a historical time-series ofHs.

For the first try, we use 3 hours of Hs data to predict the next 30 minutes. Before that, we will try to build a param and label dataset first. It will be quite a time consuming if we try to merge the build param and label section and model section every time we try to experiment with our model.

And now using already builder param and label data, we will try to predict Hs using all of the data. Let’s use a Multi-Layer Perceptron Model.

Here we use the last 10000 data as testing data and the rest is training data. Now, how we can benchmark this model? Let’s use the standard deviation of the data. Why? If you look thoroughly at the formula of standard deviation, actually it’s the same as RMSE of the model that only outputting the mean of the data.

It’s the simplest method to predict something if we don’t know anything about the data. The standard deviation of Hs data is 0.5. So, the error of the model must below this value to make a good prediction.

And the RMSE of our MLP model here is

0.077 (Image by Author)

It’s 0.15 of the standard deviation of the Hs data. Let’s see the graph for its training and test error of this MLP model.

Image by Author

at the early stage it converges very fast and the later it slows down. Anyway, before I post this result and graph here I try many combinations of hyperparameter of MLP, and for every combination of hyperparameter, you will get a slightly different result and graph (Yeah, even when you don’t change anything and just re-run the model, you will get a slightly different result because it’s the nature of neural network). I think this is the best combination so far.

Can we improve this? Maybe, let’s try using the Long Short Term Memory (LSTM) model.

And the RMSE is

0.76 (Image by Author)

The error is improving but just a little. Here it’s the train and test graph for LSTM.

The way it converges is the same as MLP (Image by Author)

The model converges very fast at the early stage and slows down exponentially after that same as MLP.

Next try, let’s just use a portion of data. Let’s just use the first 3400th data and from there, the 96th last will be test data. Just edit testing_data_num, all_paramand all_label in the code.

Here the result for MLP

Image by Author
Image by Author

And for the LSTM

Image by Author
Image by Author

Yay, it’s getting a little better but this time, MLP is better than LSTM. As you can see in the train and test graph, the error for the test data is decreasing steadily although there is a fluctuation in the train data. The moral of the story, don’t be afraid when your training error is increasing or fluctuated, it doesn't mean the same happen in your test data. It’s just a way of the algorithm to generalize your model.

Epilogue

So that is for the test drive using PyTorch. You still can improve this model like try to predict Hs but multi timestep in the future or just tune the hyperparameter to get the better result. You can get the full code here. See you in another post.

References

https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba (wave dataset)

https://en.wikipedia.org/wiki/Standard_deviation

--

--