Neural Networks

Intro
Gated Recurrent Units (GRU) and Long Short-Term Memory (LSTM) have been introduced to tackle the issue of vanishing / exploding gradients in the standard Recurrent Neural Networks (RNNs).
In this article, I will give you an overview of GRU architecture and provide you with a detailed Python example that you can use to build your own GRU models.
Contents
- GRU’s place within the Machine Learning universe
- How is GRU constructed, and how does it differ from standard RNN and LSTM?
- A complete Python example of building GRU neural networks with Keras and Tensorflow libraries
GRU’s place within the Machine Learning universe
The below chart is my attempt to categorize the most common Machine Learning algorithms.
While we often use Neural Networks in a supervised manner with labelled training data, I felt that their unique approach to Machine Learning deserved a separate category.
Hence, my graph shows Neural Networks (NNs) branching out from the core of the Machine Learning universe. Recurrent Neural Networks occupy a sub-branch of NNs and contain algorithms such as standard RNNs, LSTMs, and GRUs.
The below graph is interactive, so please click on different categories to enlarge and reveal more👇 .
If you enjoy Data Science and Machine Learning, please subscribe to get an email with my new articles.
How is GRU constructed, and how does it differ from standard RNN and LSTM?
Let’s remind ourselves of the typical RNN structure, which contains input, hidden and output layers. Note that you can have any number of nodes, and the below 2–3–2 design is just for illustration.

Unlike Feed Forward Neural Networks, RNNs contain recurrent units in their hidden layer, which allow the algorithm to process sequence data. This is done by recurrently passing hidden states from previous timesteps and combining them with inputs of the current one.
Timestep – single processing of the inputs through the recurrent unit. The number of timesteps is equal to the length of the sequence.
The recurrent unit architecture inside a standard RNN and LSTM
We know that RNNs utilize recurrent units to learn from the sequence data, which is true for all three types – standard RNN, LSTM, and GRU.
However, what happens inside the recurrent unit is very different between them.
For example, standard RNN uses a hidden state to remember information. Meanwhile, LSTM and GRU introduce gates to control what to remember and what to forget before updating the hidden state. In addition to that, LSTM also has a cell state, which acts as long-term memory.
Here are simplified recurrent unit diagrams (weights and biases not shown) for standard RNN and LSTM. See how they compare to each other.

Note that in both cases, after the hidden state (and the cell state for LSTM) is calculated at timestep t, they are passed back to the recurrent unit and combined with the input at timestep t+1 to calculate the new hidden state (and cell state) at timestep t+1. This process repeats for t+2, t+3, …, t+n until the predefined number (n) of timesteps is reached.
How does GRU work?
GRU is similar to LSTM, but it has fewer gates. Also, it relies solely on a hidden state for memory transfer between recurrent units, so there is no separate cell state. Let’s analyze this simplified GRU diagram in detail (weights and biases not shown).

1–2 Reset gate – previous hidden state (h_t-1) and current input (x_t) are combined (multiplied by their respective weights and bias added) and passed through a reset gate. Since the sigmoid function ranges between 0 and 1, step one sets which values should be discarded (0), remembered (1), or partially retained (between 0 and 1). Step two resets the previous hidden state multiplying it with outputs from step one.
3–4–5 Update gate – step three may seem analogous to step one, but keep in mind that weights and biases used to scale these vectors are different, providing a different sigmoid output. So, after passing a combined vector through a sigmoid function, we subtract it from a vector containing all 1s (step four) and multiply it by the previous hidden state (step five). That’s one part of updating the hidden state with new information.
6–7–8 Hidden state candidate – after resetting a previous hidden state in step two, the outputs are combined with new inputs (x_t), multiplying them by their respective weights and adding biases before passing through a tanh activation function (step six). Then the hidden state candidate is multiplied by the results of an update gate (step seven) and added to previously modified h_t-1 to form the new hidden state h_t.
Next, the process repeats for timestep t+1, etc., until the recurrent unit processes the entire sequence.

Python example of building GRU neural networks with Keras and Tensorflow libraries
Now, we will use GRU to create a many-to-many prediction model, which means using a sequence of values to predict the following sequence. Note that GRU could also be used in one-to-one (not recommended because it’s not sequence data), many-to-one, and one-to-many setups.
Data preparation
First, we need to get the following data and libraries:
- Australian weather data from Kaggle (license: Creative Commons, the original source of the data: Commonwealth of Australia, Bureau of Meteorology).
- Pandas and Numpy for data manipulation
- Plotly for data visualizations
- Tensorflow/Keras for GRU Neural Networks
- Scikit-learn library for data scaling (MinMaxScaler)
Let’s import all libraries:
# Tensorflow / Keras
from tensorflow import keras # for building Neural Networks
print('Tensorflow/Keras: %s' % keras.__version__) # print version
from keras.models import Sequential # for creating a linear stack of layers for our Neural Network
from keras import Input # for instantiating a keras tensor
from keras.layers import Bidirectional, GRU, RepeatVector, Dense, TimeDistributed # for creating layers inside the Neural Network
# Data manipulation
import pandas as pd # for data manipulation
print('pandas: %s' % pd.__version__) # print version
import numpy as np # for data manipulation
print('numpy: %s' % np.__version__) # print version
# Sklearn
import sklearn
print('sklearn: %s' % sklearn.__version__) # print version
from sklearn.preprocessing import MinMaxScaler # for feature scaling
# Visualization
import plotly
import plotly.express as px
import plotly.graph_objects as go
print('plotly: %s' % plotly.__version__) # print version
The above code prints package versions I used in this example:
Tensorflow/Keras: 2.7.0
pandas: 1.3.4
numpy: 1.21.4
sklearn: 1.0.1
plotly: 5.4.0
Next, download and ingest Australian weather data (source: Kaggle). We only ingest a subset of columns since we don’t need the whole dataset for our model.
Also, we perform some simple data manipulation and derive a couple of new variables: Year-Month and Median Temperature.
# Set Pandas options to display more columns
pd.options.display.max_columns=150
# Read in the weather data csv - keep only the columns we need
df=pd.read_csv('weatherAUS.csv', encoding='utf-8', usecols=['Date', 'Location', 'MinTemp', 'MaxTemp'])
# Drop records where target MinTemp=NaN or MaxTemp=NaN
df=df[pd.isnull(df['MinTemp'])==False]
df=df[pd.isnull(df['MaxTemp'])==False]
# Convert dates to year-months
df['Year-Month']= (pd.to_datetime(df['Date'], yearfirst=True)).dt.strftime('%Y-%m')
# Derive median daily temperature (mid point between Daily Max and Daily Min)
df['MedTemp']=df[['MinTemp', 'MaxTemp']].median(axis=1)
# Show a snaphsot of data
df

Currently, we have one Median Temperature record for each location and date. However, daily temperatures fluctuate a lot making the prediction much harder. So, let’s calculate monthly averages and transpose the data to have locations as rows and Year-Months as columns.
# Create a copy of an original dataframe
df2=df[['Location', 'Year-Month', 'MedTemp']].copy()
# Calculate monthly average temperature for each location
df2=df2.groupby(['Location', 'Year-Month'], as_index=False).mean()
# Transpose dataframe
df2_pivot=df2.pivot(index=['Location'], columns='Year-Month')['MedTemp']
# Remove locations with lots of missing (NaN) data
df2_pivot=df2_pivot.drop(['Dartmoor', 'Katherine', 'Melbourne', 'Nhil', 'Uluru'], axis=0)
# Remove months with lots of missing (NaN) data
df2_pivot=df2_pivot.drop(['2007-11', '2007-12', '2008-01', '2008-02', '2008-03', '2008-04', '2008-05', '2008-06', '2008-07', '2008-08', '2008-09', '2008-10', '2008-11', '2008-12', '2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06'], axis=1)
# Display the new dataframe
df2_pivot

Since we are working with real-life data, we notice that three months (2011–04, 2012–12, and 2013–02) are entirely missing from the dataframe. Therefore, we impute values for the missing months by taking an average of the preceding and subsequent month.
# Add missing months 2011-04, 2011-04, 2011-04 and impute data
df2_pivot['2011-04']=(df2_pivot['2011-03']+df2_pivot['2011-05'])/2
df2_pivot['2012-12']=(df2_pivot['2012-11']+df2_pivot['2013-01'])/2
df2_pivot['2013-02']=(df2_pivot['2013-01']+df2_pivot['2013-03'])/2
# Sort columns so Year-Months are in the correct order
df2_pivot=df2_pivot.reindex(sorted(df2_pivot.columns), axis=1)
Finally, we can plot data on a chart.
# Plot average monthly temperature derived from daily medians for each location
fig = go.Figure()
for location in df2_pivot.index:
fig.add_trace(go.Scatter(x=df2_pivot.loc[location, :].index,
y=df2_pivot.loc[location, :].values,
mode='lines',
name=location,
opacity=0.8,
line=dict(width=1)
))
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'), showlegend=True)
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black',
title='Date'
)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black',
title='Degrees Celsius'
)
# Set figure title
fig.update_layout(title=dict(text="Average Monthly Temperatures", font=dict(color='black')))
fig.show()

The graph shows all locations initially, but I have picked five of them (Cairns, Canberra, Darwin, Gold Coast, and Mount Ginini) to display in the above image.
Note how the mean temperature, as well as variation, differs between locations. We can either train a location-specific model for better precision or a generic model to predict temperatures for every area.
In this example, I will create a generic model trained on all locations. Note that you can find a location-specific model code in my LSTM article.
Training and evaluating GRU model
Here are a few things to highlight before we start.
- We will use sequences of 18 months to predict the average temperatures for the next 18 months. You can adjust that to your liking but beware that there will not be enough data for sequences beyond 23 months in length.
- We will split the data into two separate dataframes – one for training and the other for validation (out of time validation).
- Since we are creating a many-to-many prediction model, we need to use a slightly more complex encoder-decoder configuration. Both encoder and decoder are hidden GRU layers, with information passed from one to another via a repeat vector layer.
- A repeat vector is necessary when we want to have sequences of different lengths, e.g., a sequence of 18 months to predict the next 12 months. It ensures that we provide the right shape for a decoder layer. However, if your input and output sequences are of the same length as in my example, then you can also choose to set _returnsequences=True in the encoder layer and remove the repeat vector.
- Note that we added a Bidirectional wrapper to GRU layers. It allows us to train the model in both directions, which sometimes produces better results. However, its use is optional.
- Also, we need to use a Time Distributed wrapper in the output layer to predict outputs for each timestep individually.
- Finally, I have used MinMaxScaling in this example because it has produced better results than the unscaled version. You can find both scaled and unscaled setups within Jupyter Notebooks in my GitHub repository (link available at the end of the article).
First, let’s define a helper function to reshape the data to a 3D array required by GRU.
def shaping(datain, timestep, scaler):
# Loop through each location
for location in datain.index:
datatmp = datain[datain.index==location].copy()
# Convert input dataframe to array and flatten
arr=datatmp.to_numpy().flatten()
# Scale using transform (using previously fitted scaler)
arr_scaled=scaler.transform(arr.reshape(-1, 1)).flatten()
cnt=0
for mth in range(0, len(datatmp.columns)-(2*timestep)+1): # Define range
cnt=cnt+1 # Gives us the number of samples. Later used to reshape the data
X_start=mth # Start month for inputs of each sample
X_end=mth+timestep # End month for inputs of each sample
Y_start=mth+timestep # Start month for targets of each sample. Note, start is inclusive and end is exclusive, that's why X_end and Y_start is the same number
Y_end=mth+2*timestep # End month for targets of each sample.
# Assemble input and target arrays containing all samples
if mth==0:
X_comb=arr_scaled[X_start:X_end]
Y_comb=arr_scaled[Y_start:Y_end]
else:
X_comb=np.append(X_comb, arr_scaled[X_start:X_end])
Y_comb=np.append(Y_comb, arr_scaled[Y_start:Y_end])
# Reshape input and target arrays
X_loc=np.reshape(X_comb, (cnt, timestep, 1))
Y_loc=np.reshape(Y_comb, (cnt, timestep, 1))
# Append an array for each location to the master array
if location==datain.index[0]:
X_out=X_loc
Y_out=Y_loc
else:
X_out=np.concatenate((X_out, X_loc), axis=0)
Y_out=np.concatenate((Y_out, Y_loc), axis=0)
return X_out, Y_out
Next, we train GRU neural network over 50 epochs and display the model summary with evaluation metrics. You can follow my comments within the code to understand each step.
##### Step 1 - Specify parameters
timestep=18
scaler = MinMaxScaler(feature_range=(-1, 1))
##### Step 2 - Prepare data
# Split data into train and test dataframes
df_train=df2_pivot.iloc[:, 0:-2*timestep].copy()
df_test=df2_pivot.iloc[:, -2*timestep:].copy()
# Use fit to train the scaler on the training data only, actual scaling will be done inside reshaping function
scaler.fit(df_train.to_numpy().reshape(-1, 1))
# Use previously defined shaping function to reshape the data for GRU
X_train, Y_train = shaping(datain=df_train, timestep=timestep, scaler=scaler)
X_test, Y_test = shaping(datain=df_test, timestep=timestep, scaler=scaler)
##### Step 3 - Specify the structure of a Neural Network
model = Sequential(name="GRU-Model") # Model
model.add(Input(shape=(X_train.shape[1],X_train.shape[2]), name='Input-Layer')) # Input Layer - need to speicfy the shape of inputs
model.add(Bidirectional(GRU(units=32, activation='tanh', recurrent_activation='sigmoid', stateful=False), name='Hidden-GRU-Encoder-Layer')) # Encoder Layer
model.add(RepeatVector(X_train.shape[1], name='Repeat-Vector-Layer')) # Repeat Vector
model.add(Bidirectional(GRU(units=32, activation='tanh', recurrent_activation='sigmoid', stateful=False, return_sequences=True), name='Hidden-GRU-Decoder-Layer')) # Decoder Layer
model.add(TimeDistributed(Dense(units=1, activation='linear'), name='Output-Layer')) # Output Layer, Linear(x) = x
##### Step 4 - Compile the model
model.compile(optimizer='adam', # default='rmsprop', an algorithm to be used in backpropagation
loss='mean_squared_error', # Loss function to be optimized. A string (name of loss function), or a tf.keras.losses.Loss instance.
metrics=['MeanSquaredError', 'MeanAbsoluteError'], # List of metrics to be evaluated by the model during training and testing. Each of this can be a string (name of a built-in function), function or a tf.keras.metrics.Metric instance.
loss_weights=None, # default=None, Optional list or dictionary specifying scalar coefficients (Python floats) to weight the loss contributions of different model outputs.
weighted_metrics=None, # default=None, List of metrics to be evaluated and weighted by sample_weight or class_weight during training and testing.
run_eagerly=None, # Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
steps_per_execution=None # Defaults to 1. The number of batches to run during each tf.function call. Running multiple batches inside a single tf.function call can greatly improve performance on TPUs or small models with a large Python overhead.
)
##### Step 5 - Fit the model on the dataset
history = model.fit(X_train, # input data
Y_train, # target data
batch_size=1, # Number of samples per gradient update. If unspecified, batch_size will default to 32.
epochs=50, # default=1, Number of epochs to train the model. An epoch is an iteration over the entire x and y data provided
verbose=1, # default='auto', ('auto', 0, 1, or 2). Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. 'auto' defaults to 1 for most cases, but 2 when used with ParameterServerStrategy.
callbacks=None, # default=None, list of callbacks to apply during training. See tf.keras.callbacks
validation_split=0.2, # default=0.0, Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
#validation_data=(X_test, y_test), # default=None, Data on which to evaluate the loss and any model metrics at the end of each epoch.
shuffle=True, # default=True, Boolean (whether to shuffle the training data before each epoch) or str (for 'batch').
class_weight=None, # default=None, Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight=None, # default=None, Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only).
initial_epoch=0, # Integer, default=0, Epoch at which to start training (useful for resuming a previous training run).
steps_per_epoch=None, # Integer or None, default=None, Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined.
validation_steps=None, # Only relevant if validation_data is provided and is a tf.data dataset. Total number of steps (batches of samples) to draw before stopping when performing validation at the end of every epoch.
validation_batch_size=None, # Integer or None, default=None, Number of samples per validation batch. If unspecified, will default to batch_size.
validation_freq=10, # default=1, Only relevant if validation data is provided. If an integer, specifies how many training epochs to run before a new validation run is performed, e.g. validation_freq=2 runs validation every 2 epochs.
max_queue_size=10, # default=10, Used for generator or keras.utils.Sequence input only. Maximum size for the generator queue. If unspecified, max_queue_size will default to 10.
workers=1, # default=1, Used for generator or keras.utils.Sequence input only. Maximum number of processes to spin up when using process-based threading. If unspecified, workers will default to 1.
use_multiprocessing=True, # default=False, Used for generator or keras.utils.Sequence input only. If True, use process-based threading. If unspecified, use_multiprocessing will default to False.
)
##### Step 6 - Use model to make predictions
# Predict results on training data
#pred_train = model.predict(X_train)
# Predict results on test data
pred_test = model.predict(X_test)
##### Step 7 - Print Performance Summary
print("")
print('-------------------- Model Summary --------------------')
model.summary() # print model summary
print("")
print('-------------------- Weights and Biases --------------------')
print("Too many parameters to print but you can use the code provided if needed")
print("")
#for layer in model.layers:
# print(layer.name)
# for item in layer.get_weights():
# print(" ", item)
#print("")
# Print the last value in the evaluation metrics contained within history file
print('-------------------- Evaluation on Training Data --------------------')
for item in history.history:
print("Final", item, ":", history.history[item][-1])
print("")
# Evaluate the model on the test data using "evaluate"
print('-------------------- Evaluation on Test Data --------------------')
results = model.evaluate(X_test, Y_test)
print("")
The above code prints the following summary and evaluation metrics for our GRU neural network (note, your results may differ due to the stochastic nature of neural network training):

Now, let’s regenerate predictions for the 5 locations we picked earlier and plot the results on a chart to compare actual and predicted values.
Predict
# Select locations to predict temperatures for
location=['Cairns', 'Canberra', 'Darwin', 'GoldCoast', 'MountGinini']
dfloc_test = df_test[df_test.index.isin(location)].copy()
# Reshape test data
X_test, Y_test = shaping(datain=dfloc_test, timestep=timestep, scaler=scaler)
# Predict results on test data
pred_test = model.predict(X_test)
Plot
# Plot average monthly temperatures (actual and predicted) for test (out of time) data
fig = go.Figure()
# Trace for actual temperatures
for location in dfloc_test.index:
fig.add_trace(go.Scatter(x=dfloc_test.loc[location, :].index,
y=dfloc_test.loc[location, :].values,
mode='lines',
name=location,
opacity=0.8,
line=dict(width=1)
))
# Trace for predicted temperatures
for i in range(0,pred_test.shape[0]):
fig.add_trace(go.Scatter(x=np.array(dfloc_test.columns[-timestep:]),
# Need to inverse transform the predictions before plotting
y=scaler.inverse_transform(pred_test[i].reshape(-1,1)).flatten(),
mode='lines',
name=dfloc_test.index[i]+' Prediction',
opacity=1,
line=dict(width=2, dash='dot')
))
# Change chart background color
fig.update_layout(dict(plot_bgcolor = 'white'))
# Update axes lines
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black',
title='Year-Month'
)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey',
zeroline=True, zerolinewidth=1, zerolinecolor='lightgrey',
showline=True, linewidth=1, linecolor='black',
title='Degrees Celsius'
)
# Set figure title
fig.update_layout(title=dict(text="Average Monthly Temperatures", font=dict(color='black')))
fig.show()

It looks like our GRU model has done a pretty good job capturing temperature trends for each location!
Final remarks
GRU and LSTM are similar not only in their architecture but also in their predictive ability. Hence, it’s up to you to try them both before picking your favourite.
If you want the complete Python code in one piece, you can find Jupyter Notebook in my GitHub repository.
Thanks for reading, and do not hesitate to get in touch if you have any questions or suggestions.
Cheers! 👏 Saul Dobilas
RNN: Recurrent Neural Networks – How to Successfully Model Sequential Data in Python
Feed Forward Neural Networks – How To Successfully Build Them in Python
Deep Feed Forward Neural Networks and the Advantage of ReLU Activation Function