Photo by ThisisEngineering on Unsplash

Predictive Analytics: Regression Analysis with LSTM, GRU and BiLSTM in TensorFlow

A step-by-step tutorial on developing LSTM, GRU and BiLSTM models for multi-step forecasting of water consumption

Niousha Rasifaghihi
Towards Data Science
10 min readJul 16, 2020

--

In this post, I develop three sequential models; LSTM, GRU and Bidirectional LSTM, to predict water consumption under the impact of climate change. Then, I use the most reliable one for multi-step forecasting of urban water consumption for the next 10 years.

First, let me refresh your mind on the fundamentals. Then I walk you through a complete data science project in Python.

👩‍💻 Python Code on GitHub

Recurrent Neural Network

Recurrent Neural Network (RNN) are types of Neural Networks designed to use sequential data such as time-series. In RNNs, the outputs can be fed back into the network as inputs creating a recurrent structure.

Gradient Vanishing problem

RNNs are trained by backpropagation. During backpropagation, RNNs suffer from gradient vanishing problem. The gradient is the value used to update Neural Networks’ weight. The gradient vanishing problem is when gradient shrink as it back propagates through time. Therefore, layers that get a small gradient do not learn and they cause the network to have short-term memory.

💡 What can be a solution to the gradient vanishing problem ❓

Long Short-Term Memory

Long Short-Term Memory (LSTM) is a specialized RNN to mitigate the gradient vanishing problem. LSTMs can learn long-term dependencies using a mechanism called gates. These gates can learn what information in the sequence is important to keep or throw away. LSTMs have three gates; input, forget and output.

The architecture of LSTM cell

Bidirectional LSTMs

The idea of Bidirectional LSTMs (BiSTM) is to aggregate input information in the past and future of a specific time step in LSTM models. In BiLSTM, at any point in time, you are able to preserve information from both past and future.

Gated Recurrent Unit

Gated Recurrent Unit (GRU) is a new generation of Neural Networks and is pretty similar to LSTM. GRU gets rid of the cell state and uses a hidden state to transfer information. The other difference between GRU and LSTM is that GRU has only two gates; reset and update gate.

The architecture of GRU cell

☺ Time to get our hands dirty❗

Dataset

The city of Brossard, Quebec, Canada, is chosen as a study site. The city is a part of the metropolitan area of Montreal. For this project, Daily water consumption data are obtained from 2011–09–01 to 2015–09–30. For the same period, minimum temperature, maximum temperature and total precipitation are collected. Measurements of these climatic variables are made from Environment Canada.

Import libraries

Set random seed

Set random seed to get the same result after each time running the code.

# Set random seed for reproducibility
tf.random.set_seed(1234)

Step 1: Read and explore data

In this project, I am dealing with multi-variate time-series data. While I import the data from a CSV file, I make sure the Date column has the correct DateTime format by parse_dates = [‘Date’]. Also, when I work with date and time, it becomes much easier if I set the Date column as the dataframe index.

# Read file
file = 'Data.csv'
raw_data = pd.read_csv(file, parse_dates = ['Date'],
index_col = 'Date')
df = raw_data.copy()
  • Max_T: Maximum temperature (℃)
  • Min_T: Minimum temperature (℃)
  • T_P: Total precipitation (mm)
  • UWC: Urban water consumption (m³/capita.day)
# Define a function to draw time_series plot
def timeseries (x_axis, y_axis, x_label, y_label):
plt.figure(figsize = (10, 6))
plt.plot(x_axis, y_axis, color ='black')
plt.xlabel(x_label, {'fontsize': 12})
plt.ylabel(y_label, {'fontsize': 12})
timeseries(df.index, df['WC (m3/capita.day)'], 'Time (day)','Daily Water consumption ($m^3$/capita.day)')
Time-series of daily water consumption from 2011–09–01 to 2015–09–30

Step 2: Data pre-processing

Data pre-processing is the most time-consuming step and includes:

  • Handle missing values
  • Replace outliers
  • Split the dataset into train and test data
  • Split the target variable and dependent variables
  • Data transformation
  • Create a 3D input dataset

2.1 Handle missing values

When it comes to time-series data, it is a good idea to use linear interpolation to replace missing values.

# Check missing values
df.isnull().sum()
# Replace missing values by interpolation
def replace_missing (attribute):
return attribute.interpolate(inplace=True)
replace_missing(df['Max_T'])
replace_missing(df['Min_T'])
replace_missing(df['T_P'])
replace_missing(df['UWC'])

2.2 Replace outliers

I use statistical methods to detect outliers. The statistical methods assume a normal distribution of data points. Therefore, values in a low probability region are considered as outliers. I apply the concept of maximum likelihood in the statistical methods meaning that values outside the range of μ±2σ are labelled as an outlier. Note that μ±2σ contains 95% of data under the assumption of normal distribution.

# Outlier detection
up_b = df['UWC'].mean() + 2*df['UWC'].std()
low_b = df['UWC'].mean() - 2*df['UWC'].std()
# Replace outlier by interpolation for base consumption
df.loc[df['UWC'] > up_b, 'UWC'] = np.nan
df.loc[df['UWC'] < low_b, 'UWC'] = np.nan
df['UWC'].interpolate(inplace=True)

2.3 Split the dataset into train and test data

In this project, I set the first 80% of data as train data and the remaining 20% as test data. I train the model with train data and validate its performance with test data.

# Split train data and test data
train_size = int(len(df)*0.8)
train_dataset, test_dataset = df.iloc[:train_size],
df.iloc[train_size:]
# Plot train and test data
plt.figure(figsize = (10, 6))
plt.plot(train_dataset.UWC)
plt.plot(test_dataset.UWC)
plt.xlabel('Time (day)')
plt.ylabel('Daily water consumption ($m^3$/capita.day)')
plt.legend(['Train set', 'Test set'], loc='upper right')
print('Dimension of train data: ',train_dataset.shape)
print('Dimension of test data: ', test_dataset.shape)
Time-series of pre-processed daily water consumption for train data and test data

2.4 Split the target variable and dependent variables

UWC is the target variable (output) and is a function of dependent variables (input); Max_T, Min_T and T_P.

# Split train data to X and y
X_train = train_dataset.drop('UWC', axis = 1)
y_train = train_dataset.loc[:,['UWC']]
# Split test data to X and y
X_test = test_dataset.drop('UWC', axis = 1)
y_test = test_dataset.loc[:,['UWC']]

2.5 Data transformation

A good rule of thumb is that normalized data lead to better performance in Neural Networks. In this project, I use MinMaxScaler from scikit-learn.

I define different scalers for input and output as they have different shapes. This is especially important for using the inverse-transform function.

  • X_train.shape: (1192, 3)
  • y_train.shape: (1192, 1)
  • X_test.shape: (299, 3)
  • y_test.shape: (299, 1)

It is important to ensure that the scale of the output is in the range 0–1 to match the scale of the activation function (tanh) on the output layer of LSTM, GRU and BiLSTM. Also, input variables are better to be small values, probably in the range of 0–1.

💡 What are the data transformation steps❓

  • Fit the scaler (MinMaxScaler) using available training data (It means that the minimum and maximum observable values are estimated using training data.)
  • Apply the scaler to training data
  • Apply the scaler to test data

It is important to note that we should scale the unseen data with the scaler fitted on the training data.

# Different scaler for input and output
scaler_x = MinMaxScaler(feature_range = (0,1))
scaler_y = MinMaxScaler(feature_range = (0,1))
# Fit the scaler using available training data
input_scaler = scaler_x.fit(X_train)
output_scaler = scaler_y.fit(y_train)
# Apply the scaler to training data
train_y_norm = output_scaler.transform(y_train)
train_x_norm = input_scaler.transform(X_train)
# Apply the scaler to test data
test_y_norm = output_scaler.transform(y_test)
test_x_norm = input_scaler.transform(X_test)

2.6 Create a 3D input dataset

LSTM, GRU and BiLSTM take a 3D input (num_samples, num_timesteps, num_features). So, I create a helper function, create_dataset, to reshape input.

In this project, I define time_steps = 30. It means that the model makes predictions based on the last 30-day data (In the first iteration of the for-loop, the input carries the first 30 days and the output is UWC on the 30th day).

# Create a 3D input
def create_dataset (X, y, time_steps = 1):
Xs, ys = [], []
for i in range(len(X)-time_steps):
v = X[i:i+time_steps, :]
Xs.append(v)
ys.append(y[i+time_steps])
return np.array(Xs), np.array(ys)
TIME_STEPS = 30X_test, y_test = create_dataset(test_x_norm, test_y_norm,
TIME_STEPS)
X_train, y_train = create_dataset(train_x_norm, train_y_norm,
TIME_STEPS)
print('X_train.shape: ', X_test.shape)
print('y_train.shape: ', y_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_test.shape: ', y_train.shape)

Step 3: Create BiLSTM, LSTM and GRU models

3.1 BiLSTM, LSTM and GRU models in TensorFlow

The first function, create_model_bilstm, creates a BDLSM and gets the number of units (neurons) in hidden layers. The second function, create_model, gets two inputs; number of units in hidden layers and model name (LSTM or GRU).

For the sake of simplicity, BiLSTM, LSTM and GRU have 64 neurons in the input layer, one hidden layer including 64 neurons and 1 neuron in the output layer.

To make the LSTM and GRU model robust to changes, the Dropout function is used. Dropout(0.2) randomly drops 20% of units from the network.

# Create BiLSTM model
def create_model_bilstm(units):
model = Sequential()
model.add(Bidirectional(LSTM(units = units,
return_sequences=True),
input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Bidirectional(LSTM(units = units)))
model.add(Dense(1))
#Compile model
model.compile(loss='mse', optimizer='adam')
return model
# Create LSTM or GRU model
def create_model(units, m):
model = Sequential()
model.add(m (units = units, return_sequences = True,
input_shape = [X_train.shape[1], X_train.shape[2]]))
model.add(Dropout(0.2))
model.add(m (units = units))
model.add(Dropout(0.2))
model.add(Dense(units = 1))
#Compile model
model.compile(loss='mse', optimizer='adam')
return model
# BiLSTM
model_bilstm = create_model_bilstm(64)
# GRU and LSTM
model_gru = create_model(64, GRU)
model_lstm = create_model(64, LSTM)

3.2 Fit the models

I train the model with train data for 100 epoch and batch_size = 32. I get the model to use 20% of train data as validation data. I set shuffle = False because it gives a better performance.

To avoid overfitting, I set an early stop to stop training when validation loss has not improved after 10 epochs (patience = 10).

# Fit BiLSTM, LSTM and GRU
def fit_model(model):
early_stop = keras.callbacks.EarlyStopping(monitor = 'val_loss',
patience = 10)
history = model.fit(X_train, y_train, epochs = 100,
validation_split = 0.2, batch_size = 32,
shuffle = False, callbacks = [early_stop])
return history
history_bilstm = fit_model(model_bilstm)
history_lstm = fit_model(model_lstm)
history_gru = fit_model(model_gru)

Plot train loss and validation loss

In this plot, I am going to see the number of epochs in each model and evaluate the model performance in prediction.

# Plot train loss and validation loss
def plot_loss (history):
plt.figure(figsize = (10, 6))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.legend(['Train loss', 'Validation loss'], loc='upper right')
plot_loss (history_bilstm)
plot_loss (history_lstm)
plot_loss (history_gru)
Train loss vs validation loss for BiLSTM
Train loss vs validation loss for LSTM
Train loss vs validation loss for GRU

3.3 Inverse transform the target variable

After building the model, I have to transform the target variable back to original data space for train and test data using scaler_y.inverse_transform.

y_test = scaler_y.inverse_transform(y_test)
y_train = scaler_y.inverse_transform(y_train)

3.4 Make prediction using BiLSTM, LSTM and GRU

Here I predict UWC using BiLSTM, LSTM and GRU models. Then, I plot true future (test data) vs prediction the three models.

# Make prediction
def prediction(model):
prediction = model.predict(X_test)
prediction = scaler_y.inverse_transform(prediction)
return prediction
prediction_bilstm = prediction(model_bilstm)
prediction_lstm = prediction(model_lstm)
prediction_gru = prediction(model_gru)
# Plot true future vs prediction
def plot_future(prediction, y_test):
plt.figure(figsize=(10, 6))
range_future = len(prediction)
plt.plot(np.arange(range_future), np.array(y_test),
label='True Future')
plt.plot(np.arange(range_future),np.array(prediction),
label='Prediction')
plt.legend(loc='upper left')
plt.xlabel('Time (day)')
plt.ylabel('Daily water consumption ($m^3$/capita.day)')
plot_future(prediction_bilstm, y_test)
plot_future(prediction_lstm, y_test)
plot_future(prediction_gru, y_test)
True future vs prediction of daily water consumption for BiLSTM model
True future vs prediction of daily water consumption for LSTM model
True future vs prediction of daily water consumption for GRU model

3.5 Calculate RMSE and MAE

Let me use two goodness-of-fit measure to evaluate the performance of the models.

# Define a function to calculate MAE and RMSE
def evaluate_prediction(predictions, actual, model_name):
errors = predictions - actual
mse = np.square(errors).mean()
rmse = np.sqrt(mse)
mae = np.abs(errors).mean()
print(model_name + ':')
print('Mean Absolute Error: {:.4f}'.format(mae))
print('Root Mean Square Error: {:.4f}'.format(rmse))
print('')
evaluate_prediction(prediction_bilstm, y_test, 'Bidirectional LSTM')
evaluate_prediction(prediction_lstm, y_test, 'LSTM')
evaluate_prediction(prediction_gru, y_test, 'GRU')

The goodness-of-fit of the three models shows that they have quite similar performance. Even so, the BiLSTM model has higher accuracy compared to LSTM and GRU. So I use the BiLSTM model for multi-step forecasting of UWC in the next 10 years.

⚠️ Note: The result of this project does not imply that BiLSTM has better results than LSTM and GRU. This is just an illustration of how you can compare these models to come up with the most reliable one for prediction.

Step 4: Multi-step forecasting of water consumption in 10 years

I import climate data projections in the study site and filter them for the period of 2015–01–01 to 2025–01–01.

I create a helper function, plot_history_future, to plot history and future UWC. Then, I create a function, forecast, to reshape the unseen input and make predictions using the LSTM model.

# Plot histoy and future data
def plot_history_future(y_train, prediction):
plt.figure(figsize=(10, 6))
range_history = len(y_train)
range_future = list(range(range_history, range_history +
len(prediction)))
plt.plot(np.arange(range_history), np.array(y_train),
label='History')
plt.plot(range_future, np.array(prediction),label='Prediction')
plt.legend(loc='upper right')
plt.xlabel('Time (day)')
plt.ylabel('Daily water consumption ($m^3$/capita.day)')
# Multi-step forecasting
def forecast(X_input, time_steps):
# Scale the unseen input with the scaler fitted on the train set
X = input_scaler.transform(X_input)
# Reshape unseen data to a 3D input
Xs = []
for i in range(len(X) - time_steps):
v = X[i:i+time_steps, :]
Xs.append(v)
X_transformed = np.array(Xs)# Make prediction for unseen data using LSTM model
prediction = model_bilstm.predict(X_transformed)
prediction_actual = scaler_y.inverse_transform(prediction)
return prediction_actual
prediction = forecast(X_new, TIME_STEPS)
plot_history_future(y_train, prediction)
History and prediction of daily water consumption based on the BiLSTM model

Conclusion

Thank you for reading this article. I know it was quite a long tutorial😏 I hope it helped you to develop LSTM, GRU and BiLSTM models in Tensorflow for a data science project😊

Your feedback is greatly appreciated. You can reach me on LinkedIn.

--

--