Bayesian Hyper-Parameter Optimization: Neural Networks, TensorFlow, Facies Prediction Example

Automate hyper-parameters tuning for NNs (learning rate, number of dense layers and nodes and activation function)

Ryan A. Mardani
Towards Data Science

--

The purpose of this work is to optimize the neural network model hyper-parameters to estimate facies classes from well logs. I will include some codes in this paper but for a full jupyter notebook file, you can visit my Github.

note: if you are new in TensorFlow, its installation elaborated by Jeff Heaton.

In machine learning, model parameters can be divided into two main categories:
1- Trainable parameters: such as weights in neural networks learned by training algorithms and the user does not interfere in the process,
2- Hyper-parameters: users can set them before training operation such as learning rate or the number of dense layers in the model.
Selecting the best hyper-parameters can be a tedious task if you try it by hand and it is almost impossible to find the best ones if you are dealing with more than two parameters.
One way is to divide each parameter into a valid evenly range and then simply ask the computer to loop for the combination of parameters and calculate the results. The method is called Grid Search. Although it is done by machine, it will be a time-consuming process. Suppose you have 3 hyper-parameters with 10 possible values in each. In this approach, you will run 10³ neural network models (even with reasonable training datasets size, this task is huge).
Another way is a random search approach. In fact, instead of using organized parameter searching, it will go through a random combination of parameters and look for the optimized ones. You may estimate that chance of success decreases to zero for larger hyper-parameter tunings.

Scikit-Optimize, skopt, which we will use here to the facies estimation task, is a simple and efficient library to minimize expensive noisy black-box functions. Bayesian optimization constructs another model of search-space for parameters. Gaussian Process is one kind of these models. This generates an estimate of how model performance varies with hyper-parameter changes.

As we see in the picture, the true objective function(red dash line) is surrounded by noise (red shade). The red line shows how scikit optimize sampled the search space for hyper-parameters(one dimension). Scikit-optimize fills the area between sample points with the Gaussian process (green line) and estimates true real fitness value. In the areas with low samples or lack(like the left side of the picture between two red samples), there is great uncertainty (big difference between red and green lines causing big uncertainty green shade area such as two standard deviations uncertainty).
In this process, then we ask a new set of hyper-parameter to explore more search space. In the initial steps, it goes with sparse accuracy but in later iterations, it focuses on where sampling points are more with the good agreement of fitness function with true objective function(trough area in the graph).
For more study, you may refer to Scikit Optimize documentation.

Data Review
The Council Grove gas reservoir is located in Kansas. From this carbonate reservoir, nine wells are available. Facies are studied from core samples in every half foot and matched with logging data in well location. Feature variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. For more detail refer here. For the dataset, you may download it from here. The seven variables are:

  1. GR: this wireline logging tools measure gamma emission
  2. ILD_log10: this is resistivity measurement
  3. PE: photoelectric effect log
  4. DeltaPHI: Phi is a porosity index in petrophysics.
  5. PNHIND: Average of neutron and density log.
  6. NM_M:nonmarine-marine indicator
  7. RELPOS: relative position

The nine discrete facies (classes of rocks) are:

  1. (SS) Nonmarine sandstone
  2. (CSiS) Nonmarine coarse siltstone
  3. (FSiS) Nonmarine fine siltstone
  4. (SiSH) Marine siltstone and shale
  5. (MS) Mudstone (limestone)
  6. (WS) Wackestone (limestone)
  7. (D) Dolomite
  8. (PS) Packstone-grainstone (limestone)
  9. (BS) Phylloid-algal bafflestone (limestone)

After reading the dataset into python, we can keep one well data as a blind set for future model performance examination. We also need to convert facies numbers into strings in the dataset. Refer to the full notebook.

df = pd.read_csv(‘training_data.csv’)

blind = df[df['Well Name'] == 'SHANKLE']
training_data = df[df['Well Name'] != 'SHANKLE']

Feature Engineering
Facies classes should be converted to dummy variable in order to use in neural network:

dummies = pd.get_dummies(training_data[‘FaciesLabels’]) 
Facies_cat = dummies.columns
labels = dummies.values # target matirx
# select predictors
features = training_data.drop(['Facies', 'Formation', 'Well Name', 'Depth','FaciesLabels'], axis=1)

Preprocessing (make standard)

As we are dealing with various range of data, to make network efficient, let’s normalize it.

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(features)
scaled_features = scaler.transform(features)
#Data split
from
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
scaled_features, labels, test_size=0.2, random_state=42)

Hyper-Parameters

In this work, we will predict facies from well logs using deep learning in Tensorflow. There several hyper-parameters that we may adjust for deep learning. I will try to find out the optimized parameters for:

  1. Learning rate
  2. Number of dense layers
  3. Number of nodes for each layer
  4. Which activation function: ‘relu’ or sigmoid

To elaborate in this search dimension, we will use scikit-optimize(skopt) library. From skopt, real function will define our favorite range(lower bound = 1e-6, higher bound = 1e-1) for learning rate and will use logarithmic transformation. The search dimension for the number of layers (we look between 1 to 5) and each layer’s node amounts(between 5 to 512) can be implemented with Integer function of skopt.

dim_learning_rate = Real(low=1e-6, high=1e-1, prior='log-uniform',
name='learning_rate')
dim_num_dense_layers = Integer(low=1, high=10, name='num_dense_layers')dim_num_dense_nodes = Integer(low=5, high=512, name='num_dense_nodes')

For activation algorithms, we should use categorical function for optimization.

dim_activation = Categorical(categories=['relu', 'sigmoid'],
name='activation')

Bring all search-dimensions into a single list:

dimensions = [dim_learning_rate,
dim_num_dense_layers,
dim_num_dense_nodes,
dim_activation]

If you already worked with deep learning for a specific project and found your hyper-parameters by hand for that project, you know how hard it is to optimize. You may also use your own guess (like mine as default) to compare the results with the Bayesian tuning approach.

default_parameters = [1e-5, 1, 16, ‘relu’]

Hyper-Parameter Optimization

Create Model

Like some examples developed by Tneseflow, we also need to define a model function first. After defining the type of model(Sequential here), we need to introduce the data dimension (data shape) in the first line. The number of layers and activation types are those two hyper-parameters that we are looking for to optimize. Softmax activation should be used for classification problems. Then another hyper-parameter is the learning rate which should be defined in the Adam function. The model should be compiled considering that loss function should be ‘categorical_crossentropy’ as we are dealing with the classification problems (facies prediction).

def create_model(learning_rate, num_dense_layers,
num_dense_nodes, activation):

model = Sequential()

model.add(InputLayer(input_shape=(scaled_features.shape[1])))

for i in range(num_dense_layers):
name = 'layer_dense_{0}'.format(i+1)

# add dense layer
model.add(Dense(num_dense_nodes,
activation=activation,
name=name))

# use softmax-activation for classification.
model.add(Dense(labels.shape[1], activation='softmax'))

# Use the Adam method for training the network.
optimizer = Adam(lr=learning_rate)

#compile the model so it can be trained.
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])

return model

Train and Evaluate the Model

This function aims to create and train a network with given hyper-parameters and then evaluate model performance with the validation dataset. It returns fitness value, negative classification accuracy on the dataset. It is negative because skopt performs minimization rather than maximization.

@use_named_args(dimensions=dimensions)
def fitness(learning_rate, num_dense_layers,
num_dense_nodes, activation):
"""
Hyper-parameters:
learning_rate: Learning-rate for the optimizer.
num_dense_layers: Number of dense layers.
num_dense_nodes: Number of nodes in each dense layer.
activation: Activation function for all layers.
"""

# Print the hyper-parameters.
print('learning rate: {0:.1e}'.format(learning_rate))
print('num_dense_layers:', num_dense_layers)
print('num_dense_nodes:', num_dense_nodes)
print('activation:', activation)
print()

# Create the neural network with these hyper-parameters.
model = create_model(learning_rate=learning_rate,
num_dense_layers=num_dense_layers,
num_dense_nodes=num_dense_nodes,
activation=activation)

# Dir-name for the TensorBoard log-files.
log_dir = log_dir_name(learning_rate, num_dense_layers,
num_dense_nodes, activation)

# Create a callback-function for Keras which will be
# run after each epoch has ended during training.
# This saves the log-files for TensorBoard.
# Note that there are complications when histogram_freq=1.
# It might give strange errors and it also does not properly
# support Keras data-generators for the validation-set.
callback_log = TensorBoard(
log_dir=log_dir,
histogram_freq=0,
write_graph=True,
write_grads=False,
write_images=False)

# Use Keras to train the model.
history = model.fit(x= X_train,
y= y_train,
epochs=3,
batch_size=128,
validation_data=validation_data,
callbacks=[callback_log])

# Get the classification accuracy on the validation-set
# after the last training-epoch.
accuracy = history.history['val_accuracy'][-1]

# Print the classification accuracy.
print()
print("Accuracy: {0:.2%}".format(accuracy))
print()

# Save the model if it improves on the best-found performance.
# We use the global keyword so we update the variable outside
# of this function.
global best_accuracy

# If the classification accuracy of the saved model is improved ...
if accuracy > best_accuracy:
# Save the new model to harddisk.
model.save(path_best_model)

# Update the classification accuracy.
best_accuracy = accuracy

# Delete the Keras model with these hyper-parameters from memory.
del model

# Clear the Keras session, otherwise it will keep adding new
# models to the same TensorFlow graph each time we create
# a model with a different set of hyper-parameters.
K.clear_session()

# NOTE: Scikit-optimize does minimization so it tries to
# find a set of hyper-parameters with the LOWEST fitness-value.
# Because we are interested in the HIGHEST classification
# accuracy, we need to negate this number so it can be minimized.
return -accuracy
# This function exactly comes from :Hvass-Labs, TensorFlow-Tutorials

run this:

fitness(x= default_parameters)

Run Hyper-Parameter Optimization

We already checked the default hyper-parameter performance. Now we can examine Bayesian optimization from scikit-optimize library. Here we use 40 runs for fitness function, though it is an expensive operation and needs to used carefully with datasets.

search_result = gp_minimize(func=fitness,
dimensions=dimensions,
acq_func='EI', # Expected Improvement.
n_calls=40,
x0=default_parameters)

just some last runs shows below:

Progress visualization

Using plot_convergence function of skopt, we may see the optimization progress and the best fitness value found on y-axis.

plot_convergence(search_result) 
# plt.savefig("Converge.png", dpi=400)

Optimal Hyper-Parameters

Using the serach_result function, we can see the best hyper-parameter that Bayesian-optimizer generated.

search_result.x

Optimized hyper-parameters are in order: Learning rate, number of dense layers, number of nodes in each layer, and the best activation function.

We can see all results for 40 calls with corresponding hyper-parameters and fitness values.

sorted(zip(search_result.func_vals, search_result.x_iters))

An interesting point is that the ‘relu’ activation function is almost dominant.

Plots

First, let’s look at 2D plot of two optimized parameters. Here we made landscape-plot of estimated fitness values for learning rate and number of nodes in each layer.
The Bayesian optimizer builds a surrogate model of search space and searches inside this dimension rather than real search-space, that is why it is faster. In the plot, the yellow regions are better and blue regions are worse. Balck dots are the optimizer’s sampling location and the red star is the best parameter found.

from skopt.plots import plot_objective_2D
fig = plot_objective_2D(result=search_result,
dimension_identifier1='learning_rate',
dimension_identifier2='num_dense_nodes',
levels=50)
# plt.savefig("Lr_numnods.png", dpi=400)

Some points:

  1. The surrogate model can be inaccurate because it is built from only 40 samples of calls to the fitness function
  2. The plot may change in each time of optimization re-run because of random noise and training process in NN
  3. This is 2D plot, while we optimized 4 parameters and could be imagined 4 dimensions.
# create a list for plotting
dim_names = ['learning_rate', 'num_dense_layers', 'num_dense_nodes', 'activation' ]
fig, ax = plot_objective(result=search_result, dimensions=dim_names)
plt.savefig("all_dimen.png", dpi=400)

In these plots, we can see how the optimization happened. The Bayesian approach tries to fit model parameters with prior info at the points with a higher density of sampling. Gathering all four parameters into a scikit-optimization approach will introduce the best results in this run if the learning rate is about 0.003, the number of dense layers 6, the number of nodes in each layer about 327, and activation function is ‘relu’.

Evaluate the model with optimized hyper-parameters with blind data

The same steps of data preparation are required here as well. We skip repeating here. Now we can make a model with optimized parameters to see the prediction.

opt_par = search_result.x

# use hyper-parameters from optimization

learning_rate = opt_par[0]
num_layers = opt_par[1]
num_nodes = opt_par[2]
activation = opt_par[3]

create model:

import numpy as np
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential()
model.add(InputLayer(input_shape=(scaled_features.shape[1])))
model.add(Dense(num_nodes, activation=activation, kernel_initializer='random_normal'))
model.add(Dense(labels.shape[1], activation='softmax', kernel_initializer='random_normal'))

optimizer = Adam(lr=learning_rate)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])


monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=20,
verbose=1, mode='auto', restore_best_weights=True)

histories = model.fit(X_train,y_train, validation_data=(X_test,y_test),
callbacks=[monitor],verbose=2,epochs=100)

let’s see the model accuracy development:

plt.plot(histories.history['accuracy'], 'bo')
plt.plot(histories.history['val_accuracy'],'b' )
plt.title('Training and validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig("accu.png", dpi=400)
plt.show()

Training and validation accuracy plot shows that almost after 80% accuracy (iteration 10), the model starts to overfit because we can not see improvement in test data prediction accuracy.

Let’s evaluate model performance with a dataset that has not seen yet (blind well). We always predict that Machine Learning models will predict with blind data by less accuracy than training process if dataset is small or features are not big enough to cover all complexity of data dimensions.

result = model.evaluate(scaled_features_blind, labels_blind)
print("{0}: {1:.2%}".format(model.metrics_names[1], result[1]))

Predict Blind Well Data and Plot

y_pred = model.predict(scaled_features_blind) # result is probability arrayy_pred_idx = np.argmax(y_pred, axis=1) + 1
# +1 becuase facies starts from 1 not zero like index
blind['Pred_Facies']= y_pred_idx

function to plot:

def compare_facies_plot(logs, compadre, facies_colors):
#make sure logs are sorted by depth
logs = logs.sort_values(by='Depth')
cmap_facies = colors.ListedColormap(
facies_colors[0:len(facies_colors)], 'indexed')

ztop=logs.Depth.min(); zbot=logs.Depth.max()

cluster1 = np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
cluster2 = np.repeat(np.expand_dims(logs[compadre].values,1), 100, 1)

f, ax = plt.subplots(nrows=1, ncols=7, figsize=(12, 6))
ax[0].plot(logs.GR, logs.Depth, '-g', alpha=0.8, lw = 0.9)
ax[1].plot(logs.ILD_log10, logs.Depth, '-b', alpha=0.8, lw = 0.9)
ax[2].plot(logs.DeltaPHI, logs.Depth, '-k', alpha=0.8, lw = 0.9)
ax[3].plot(logs.PHIND, logs.Depth, '-r', alpha=0.8, lw = 0.9)
ax[4].plot(logs.PE, logs.Depth, '-c', alpha=0.8, lw = 0.9)
im1 = ax[5].imshow(cluster1, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)
im2 = ax[6].imshow(cluster2, interpolation='none', aspect='auto',
cmap=cmap_facies,vmin=1,vmax=9)

divider = make_axes_locatable(ax[6])
cax = divider.append_axes("right", size="20%", pad=0.05)
cbar=plt.colorbar(im2, cax=cax)
cbar.set_label((5*' ').join([' SS ', 'CSiS', 'FSiS',
'SiSh', ' MS ', ' WS ', ' D ',
' PS ', ' BS ']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')

for i in range(len(ax)-2):
ax[i].set_ylim(ztop,zbot)
ax[i].invert_yaxis()
ax[i].grid()
ax[i].locator_params(axis='x', nbins=3)

ax[0].set_xlabel("GR")
ax[0].set_xlim(logs.GR.min(),logs.GR.max())
ax[1].set_xlabel("ILD_log10")
ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
ax[2].set_xlabel("DeltaPHI")
ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
ax[3].set_xlabel("PHIND")
ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
ax[4].set_xlabel("PE")
ax[4].set_xlim(logs.PE.min(),logs.PE.max())
ax[5].set_xlabel('Facies')
ax[6].set_xlabel(compadre)

ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
ax[4].set_yticklabels([]); ax[5].set_yticklabels([]); ax[6].set_yticklabels([])
ax[5].set_xticklabels([])
ax[6].set_xticklabels([])
f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)

Run:

compare_facies_plot(blind, 'Pred_Facies', facies_colors)
plt.savefig("Compo.png", dpi=400)

Conclusion

In this work, we optimized hyper-parameters using a Bayesian approach with a scikit-learn library called skopt. This approach is superior to a random search and grid search, especially in complex datasets. Using this method, we can get rid of the hand-tuning of hyper-parameters for the neural networks, although in each run, you will face new parameters.

--

--