Large amounts of data are acquired daily from wells around the world. However, the quality of that data can vary significantly from missing data to data impacted by sensor failure and borehole conditions. This can have knock-on consequences on other parts of a subsurface project, such as delays and inaccurate assumptions and conclusions.
As missing data is one of the most common issues we face with well log data quality, numerous methods and techniques have been developed to estimate values and fill in the gaps. This includes the application of Machine Learning technology – which has increased in popularity over the past few decades with libraries such as TensorFlow and PyTorch.
In this tutorial, we will be using Keras, which is a high-level neural networks API that runs on top of TensorFlow. We will use it to illustrate the process of building a machine-learning model to allow predictions of bulk density (RHOB). This is a commonly acquired logging measurement, however, it can be significantly impacted by bad hole conditions or, in some cases, tools can fail, resulting in no measurements over key intervals.
We will start with a very simple model, that does not account for normalising the inputs, a common step in the machine learning workflow. Then, we will then build a second model with normalised inputs and illustrate its impact on the final prediction result.
Importing Libraries and Loading Data
The first step in this tutorial is to import the libraries we will be working with.
For this tutorial, we need 4 libraries:
- Pandas: Loading and manipulating our dataset
- sklearn.model_selection: To create our training and testing data split
- Tensorflow: To build and run our neural network
- matplotlib: To visualise prediction results
These are imported as follows:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import matplotlib.pyplot as plt
Once we have imported the libraries, we need to load the data we will train and test our model.
For this tutorial, we will use a dataset containing a series of well log measurements from 3 wells in the Volve Field, located off the west coast of Norway. This data comes from the publically available Equinor Volve Dataset.
Full details of this dataset can be found at the end of the article.
To read our CSV file, we simply call upon:
df = pd.read_csv("Data/Volve/VolveNN.csv")
df
When we view the dataframe, we can see what logging measurements are contained within it, and also the first and last 5 rows of the data.
For this tutorial, we will assume that all data preparation steps have been carried out and that the data has been quality-checked by a petrophysicist/geoscientist.
However, if we want to double-check that we have columns full of data and no null rows, we can call upon df.describe()
. When we do this, we need to check that we have 24,111 for the count row in all columns/measurements.
It should be noted that ensuring we have quality data before applying machine learning is very important, as it could lead to errors and other issues.
Splitting Data into Training and Testing
For this tutorial, we will attempt to predict Bulk Density (RHOB). This logging measurement can sometimes be missing from well-logging datasets for various reasons. Some of these reasons include the data not being required for the objectives of drilling that well, or it could be simply excluded to save on drilling and logging costs.
Consequently, we often have to use existing well log datasets containing the RHOB measurement to build a machine learning model that can be used to predict the measurement within other wells where it was not acquired.
Our next step is to split our data in two.
The data placed in the X
variable is the data that will be used as input for our model, and y
, which contains our target output – in this case, RHOB.
# Define feature variables (X) and target variable (y)
X = df[['DTS', 'GR', 'NPHI', 'PEF', 'DT']]
y = df['RHOB']
We could carry on with the data as is and build, train and predict using it, however, we would not really be able to understand how well our model is performing.
This is where we split our data into two subsets. A subset for training the model, and a subset for validating and tuning the model. Ideally, we would have a third dataset for testing how well our model performs on completely unseen data.
However, for this example, we will stick with two subsets.
To split our data, we call upon the train_test_split
method from sklearn, and pass in our X
and y
variables.
We will also set the split to be 70% for training, and the remaining 30% for validating and fine-tuning our data. This can be varied depending on the size of your dataset. For example, with smaller datasets, you may want a larger training subset.
# Split the dataset into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
We can check that the split has worked by checking the length of X_train and X_test
len(X_train), len(X_test)
Which returns a tuple with the sizes of the subsets:
(16877, 7234)
In a standard workflow, we would normally normalise/standardise our data to account for the varying data ranges. However, for our first model, we will run with un-normalised data and then apply normalisation in our second model to see if it improves the results.
Building and Training a Keras Model
When building models with keras there are two main ways of creating a neural network model. These are the Sequential API and Functional API methods.
With the Sequential method, we simply stack layers on top of each other in a linear manner, whereas the Functional API offers more flexibility and can be used to create more complex models which have multiple inputs and outputs and shared layers.
For this tutorial, we will be using the Sequential API, as it is the simplest to use and get started with.
Defining The Keras Neural Network Model
To get started with the Sequential API, we first create our model as follows.
It is often best to start simple and small when building neural networks, gradually increasing the complexity until you are happy with the results.
For this example, we are going to create a very simple Neural Network consisting of a single hidden layer with 8 neurons and relu
as the activation function of relu
. This layer transforms our input data by applying several weights, biases and the activation function and then passes it to the final output layer. This layer is set up to provide a numerical output representing the Acoustic Shear Slowness curve.
To find out more about the different activation functions and how they work, I recommend checking out the following page:
# Define a simple Neural Network using Keras Sequential API
model = tf.keras.Sequential([
tf.keras.layers.Dense(8, activation='relu'),
tf.keras.layers.Dense(1)
])
Compiling The Keras Neural Network Model
Once we have defined our model, we next need to compile it. This configures and sets up how the model will learn.
To keep our model simple, we will use Mean Absolute Error (MAE) as the loss function (which is used to quantify how well the model is performing against the target feature) and the metric (also used to judge how our model is performing – but can be a more human-friendly score if using different loss functions).
We will also set the optimizer to ‘Adam’. This is a common optimszer model and is used to determine how the model will update its weights based on the selected loss function.
# Compile the model
model.compile(loss='mae',
optimizer='Adam',
metrics='mae')
Fitting / Training The Keras Neural Network Model
The final step of creating our model is to "fit" our model to the training data. This will begin the training process of our defined model.
We will also set the number epochs to 30. This represents a complete pass of the data through the neural network. After each pass, the model weights are updated in order to minimise the selected loss function.
When building models, saving the model fit results to a history variable is a good idea. This will allow us to plot the results and keep a record of the training history.
history = model.fit(X_train, y_train, epochs=30,
validation_data=(X_test, y_test))
Once we start running the model, we will get the following text output detailing the progress of the model and how well each epoch is performing.
Once the model has completed, we can view the history in graphical form by generating a matplotlib figure like so.
plt.figure(figsize=(10, 8))
pd.DataFrame(history.history).plot()
plt.xlabel('Epochs')
plt.ylabel('Mean Absolute Error (MAE)')
In the image above, the blue curve (loss) and the orange curve (mae) represent the training loss and performance metric (MAE).
Both curves decrease sharply at the beginning, suggesting that the model is learning and improving its performance in the initial epochs. However, as the number of epochs increases, the decrease in the loss and metric becomes more gradual. This may indicate that our model has reached convergence and arrived at its final solution.
Applying the Keras Model to the Test Data
Once our model has been trained, we can apply the model and predict the values within the target feature of the test subset.
This is done by calling upon the following code.
y_pred = model.predict(X_test)
Once we run this line, we will see Keras making its prediction.
227/227 [==============================] - 0s 1ms/step
Assessing Model Performance
After our model has made its prediction, we can now evaluate how well it performs on the test data by calling upon a couple of metrics available within Keras.
These are the Mean Absolute Error (MAE)— which represents the average absolute difference between the actual and predicted values – and the Root Mean Square Error (RMSE) – which represents the average error magnitude between the actual and predicted values.
In order to pass our predicted variable into these metrics, we first need to remove any extra dimensions of size from the generated result. This ensures that the shapes of y_test
and y_pred
are the same.
mae_1 = tf.keras.metrics.mae(y_test, tf.squeeze(y_pred)).numpy()
mse_1 = np.sqrt(tf.keras.metrics.mse(y_test, tf.squeeze(y_pred)).numpy()
mae_1, mse_1
When we view the metrics, we get the following scores back.
(0.07578269, 0.11054294)
This tells us that, on average, our predicted result is 0.0757 g/cc off from the actual result, and that the RMSE of 0.1105 indicates that we have some instances where our results are significantly different from the actual values.
Visualising the Error Results
It is all good and well to look at metrics. However, judging how well our model performs from just these two numbers alone can be difficult.
One way we can visualise our results is with a simple scatter plot of the actual and true measurements.
def create_scatter_comparison(ytrue, ypreds):
# Auto calculate the min and max scales for the data
minscale = min(ytrue.min(), ypreds.min())*0.95
maxscale = max(ytrue.max(), ypreds.max())*1.05
plt.figure(figsize=(10, 10))
plt.scatter(ytrue, ypreds)
# Create a 1:1 relationship line
line_points = np.linspace(minscale, maxscale, 100)
plt.plot(line_points, line_points, c='red')
plt.xlim(minscale, maxscale)
plt.ylim(minscale, maxscale)
plt.xlabel('Actual Measurements')
plt.ylabel('Predicted Measurements')
plt.show()
We can then call upon our plot by passing in the actual and predicted values.
create_scatter_comparison(y_test, y_pred)
When we run this, we get back the following plot.
Overall, the model is doing a good job of predicting bulk density (RHOB) – based on most data points sitting close to the 1:1 relationship line. However, there are a few areas where we could benefit from some improvement in the model.
We can see a higher spread of values between 2.2g/cc and 2.6 g/cc, indicating that our model is under-predicting bulk density in this range.
Improving the Keras Model By Applying MinMaxScaler to the Input Data
Some machine learning models, including Neural Networks, perform better when the data is normalised to a standard range.
In well log measurements, we sometimes have data scaled from 0 to 0.5, and others that reach the 10s of thousands. This can result in some input curves having more weight compared to others.
To give each input curve equal footing when it comes to modelling, we need to change the input data to a standard range. Additionally, it can also improve model training times and model prediction accuracy.
One way to normalise the values is by using the MinMaxScaler from sklearn.
This function will scale the data between 0 and 1.
Once the MinMaxScaler
has been imported, we can then fit and transform our X_train
and X_test
data.
We do not need to change the target feature.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
We can then pass in our newly scaled variables and re-run the model.
# Create a model
model = tf.keras.Sequential([
tf.keras.layers.Dense(8, activation='relu'),
tf.keras.layers.Dense(1)
])
# Compile model
model.compile(loss='mae',
optimizer='Adam',
metrics='mae')
# Fit the model
history = model.fit(X_train_s, y_train, epochs=30)
Again, Keras will report on the model’s progress as it is running.
Right away, we notice that the mae values are smaller than the original run. Hopefully, this should mean a better model.
And then, we can make a new prediction by re-assessing the key metrics.
y_pred = model.predict(X_test_s)
mae_1 = tf.keras.metrics.mae(y_test, tf.squeeze(y_pred)).numpy()
rmse_1 = np.sqrt(tf.keras.metrics.mse(y_test, tf.squeeze(y_pred)).numpy())
This gives us an MAE of 0.0292 g/cc and an RMSE of 0.0455. This indicates that our model has improved by applying our MinMaxScaler.
(0.029252911, 0.04554793)
We can further confirm this by passing our new prediction results into the scatter plot and comparing the results with the previous model.
Even though the plots are on different scales, we can see a significant improvement around the 2.2g/cc and 2.6 g/cc range. This confirms that applying the normalisation to our data has resulted in a better prediction result.
Summary
In this tutorial, we have seen how to build a very simple Keras Neural Network model to predict a common well log measurement using other available well log data. This can be extremely useful when we have missing data or data impacted by poor borehole conditions, such as washout.
Even though this tutorial stops after one revision to the model, it is always wise to try different variations of the model setup and different combinations of inputs.
Remember, the whole process of building a successful machine-learning model involves multiple iterations to arrive at the final model. Even after the model has been deployed, it can still be revised when new data becomes available.
Finally, making predictions using machine learning technologies should not be seen as a direct substitute for domain expertise. Instead, domain expertise should be used in conjunction with the modelling and prediction process. This ensures that accuracy is maintained and any erroneous results are caught.
Dataset Used
The data used within this tutorial is a subset of the Volve Dataset that Equinor released in 2018. Full details of the dataset, including the licence, can be found at the link below
The Volve data license is based on CC BY 4.0 license. Full details of the license agreement can be found here:
https://cdn.sanity.io/files/h61q9gi9/global/de6532f6134b9a953f6c41bac47a0c055a3712d3.pdf?equinor-hrs-terms-and-conditions-for-licence-to-data-volve.pdf
Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here! Also, if you have enjoyed this content and want to show your appreciation, consider giving it a few claps.