The world’s leading publication for data science, AI, and ML professionals.

Linear Regression: Modeling Oceanographic Data

The most important, overlooked, and predictable regression model.

Linear Regression Image - By Author
Linear Regression Image – By Author

Linear Regression is a simple, yet powerful tool that should be in every data scientist’s back pocket. In this article, I will use a real-world example to showcase the power of linear regression on real-world data.

CalCOFI Dataset

We are going to be working with the CalCOFI Dataset (CC BY 4.0 License). This dataset contains comprehensive oceanographic measurement data from California from 1949 up to the present day. The dataset includes features such as water temperature, salinity, measurement depth, O2 level, and more. In this article, we will only be using the water temperature and the water salinity features.

California Coast - By Karsten Koehn
California Coast – By Karsten Koehn

Introduction

First of all, linear regression models are used to predict the value of a continuous variable, based on the value of another continuous variable. Simple Linear regression is the most basic form of linear regression, where the model mapping this relationship is a straight line.

Let’s start by importing the dataset and taking a sample of 500 measurements from the CalCOFI dataset. In the plot below, we can see the relationship between temperature and salinity. As the temperature increases, the water salinity decreases. This indicates a negative relationship between the variables.

Data points - By Author
Data points – By Author

Now let’s say we were given a new data point that had a temperature measurement, but no salinity value. Could we estimate its salinity value based on temperature? (Surprise! Yes we can.)

There are many ways of tackling this problem, but one of the most common ways is to draw a line that fits the data. This way given any temperature measurement, we are able to provide an accurate estimate of its salinity.

We could simply draw a line that looks good to the eye, but to make a really good guess, we will need to work on some mathematics. In the following section, we can find the equation of this line using ordinary least squares.

OLS (Ordinary Least Squares)

Ordinary least squares is the method used to find the unknown parameters of a linear regression model. It chooses these parameters by minimizing the sum of the squares of the differences between the observed labels and the predicted labels. Simply, it is choosing the line of best fit. The equation for this line is shown below.

Simple Linear Regression Formula - By Author
Simple Linear Regression Formula – By Author

The equation of the line above has two parameters, B0 and B1. B0 corresponds to the y-intercept, and B1 corresponds to the slope of the line. By tuning these two parameters we can create any line possible in 2D space.

To find the optimal values for B0 and B1, we can use the following OLS implementation. This equation consists of a matrix X, a vector Y, and a resulting vector B. The vector B will contain the optimal coefficients (B0, B1) after we input our data and solve the equation.

Ordinary Least Squares Formula - By Author
Ordinary Least Squares Formula – By Author

The formula above looks complex but is structured in a way so that we can easily increase the complexity of the line without adding new terms (We will do this later in this article).

So using the following code block, we can do just that. If you trust that my computer can do matrix and vector multiplication correctly, then we get a straight line fit to the data. This is shown below the code block.

Simple Linear Regression - By Author
Simple Linear Regression – By Author

The line looks good to the eye, but it is important to use a loss function when performing the model evaluation. The loss function that we will use is called MSE (Mean Squared Error). This measures how far/close the data points are to the regression line. In other words, we are taking the square of the residuals.

Mean Squared Error Formula - By Author
Mean Squared Error Formula – By Author

You may think, why are we squaring the residuals?

We are doing this simply to penalize data points that are far from the regression line. If we took the Mean Absolute Error, then getting a close point closer, or a far point closer has the same effect on the loss function. If we take the MSE, then we are adding a penalty to outlying data points. Penalizing these outlying points is a double-edged sword, as a single extreme outlier can have a significant effect on the MSE.

When we evaluate the simple linear regression model that we created, we get an MSE value of ~1.5355.

Simple Linear Regression w/ Loss - By Author
Simple Linear Regression w/ Loss – By Author

Linear Regression (non-simple)

Fitting the data with a straight line worked quite well, but it is evident that a non-straight line would fit the data best. Thankfully we can easily add columns to the X matrix in the OLS formula to do this. We add a new column to the matrix that is all the temperature values square. We are now fitting a model that is non-linear in its inputs with linear regression (mind-blown). If we add a new column this would be temperature³, then temperature⁴, temperature⁵, and so on.

The following functions aid in plotting models with increasing degrees using Matplotlib. The polynomial_features() function arranges the feature inputs into a Numpy array (the X matrix) Then, the plot_regression() function calculates the optimal Beta values using the OLS function, calculates the MSE, and plots the results.

Degree 2 Model - By Author
Degree 2 Model – By Author

This model fits much better! Simply adding the squared values of the data points improved the simple linear regression model. If we keep adding more and more columns to the X matrix, then we will fit the data better and better.

Careful! The more we increase the complexity of the line, the higher the risk of overfitting the training data. If we get a new sample of data points, we want to make sure that the model fits well to those points as well. We need to choose a model that picks up on the general patterns of the data, but not the random patterns of the training data.

A simple way to solve this problem is using a validation dataset. We can use this dataset to test the model performance on "unseen data". We will also set a threshold for which the MSE must increase for us to continue to increase complexity. In our case, we will keep increasing the degree of the input features until the MSE does not increase by 0.02 then we will stop training. This is done with the following code block.

In the end, the model that generalizes best to new data is a linear regression model to the 3rd degree. We can visualize how increasing the line complexity affects the model in the plot below.

Model Progression - By Author
Model Progression – By Author

We can then plot the optimal model and see how it performs on the validation set. In the following plot, we can see that the model achieves an MSE validation score of ~0.70 and a training MSE score of ~0.75. We can say that the model is generalizing well to new data.

Final Degree 3 Model - By Author
Final Degree 3 Model – By Author

Closing Thoughts

This model type is one of the building blocks of Machine Learning. The mathematics behind the model is very intuitive, it is an interpretable model, and it can generate strong predictions. In this article, we proved that we can model the relationship between water salinity and temperature using linear regression.

Linear regression has proven to be useful in predicting values in Academia and in Industry. That being said, Linear regression is only the tip of the iceberg. There are other model types that can incorporate more than one input variable in their predictions. Examples of these models include multiple linear regression, decision trees, and random forests. You could use one of these models to predict salinity using temperature and latitude. Time to go down another rabbit hole exploring these models!

Code for this article can be found here.

References

Throughout this article, I used various sources to aid in the writing of this article. Thank you to all for sharing!

CalCOFI Dataset – Sohier Dane

OLS Explained Visually — Victor Powell, Lewis Lehe

What is Linear Regression? – IBM

MSE vs RMSE — O’Reilly

7 Common Regression Algorithms – Dominik Polzer


Related Articles