Using data aggregation techniques can help us transform an overwhelming and almost incomprehensible numeric dataset into something that is easily digestible and much more reader-friendly. The process of data aggregation involves summarising multiple data points into single metrics that can be used to provide a high-level overview of the data.
One way we can apply this process within petrophysics and Geoscience is to summarise the lithological composition of geological formations that have been interpreted from well log measurements.
Within this short tutorial, we will see how we can take a large dataset consisting of 90 plus wells from the Norwegian Continental Shelf, and extract the lithology composition of the Zechstein Group.
Importing Libraries and Loading Data
To begin, we first need to import the pandas library, which will be used to load our datafile from CSV and carry out the aggregations.
import Pandas as pd
Once the pandas library has been imported, we can then read the CSV file using pd.read_csv()
.
The data we are going to be using is from the combined XEEK and Force 2020 Machine Learning competition which aimed to predict lithology from well log measurements. The dataset we are using represents all of the available training data. Further details of this dataset can be found at the end of the article.
As this CSV file contains data that is separated by a semi-colon rather than by a comma, we need to pass in a colon to the sep
parameter.
df = pd.read_csv('data/train.csv', sep=';')
We can then run this code to begin the loading process. As we have a large dataset (11 million + rows), this may take several seconds. But, once it has finished loading, we can view our dataframe by calling upon the df
object. This will return our dataframe and show the first five and last five rows from it.
Using pandas .map() to Convert Numeric Codes to Lithology Strings
Within this dataset, the lithology data is stored within the FORCE_2020_LITHOFACIES_LITHOLOGY
column. However, when we look closely at our data, we will see that the lithology values are encoded numerically. Unless you know the key, it will be hard to decipher what number represents what lithology.
Luckily for this dataset, we have the key and can create a dictionary with the key and lithology pairs.
lithology_numbers = {30000: 'Sandstone',
65030: 'Sandstone/Shale',
65000: 'Shale',
80000: 'Marl',
74000: 'Dolomite',
70000: 'Limestone',
70032: 'Chalk',
88000: 'Halite',
86000: 'Anhydrite',
99000: 'Tuff',
90000: 'Coal',
93000: 'Basement'}
To apply this to our dataset, we can use the pandas map()
function, which will perform a lookup using our dictionary and then assign the correct lithology label to the numeric value.
df['LITH'] = df['FORCE_2020_LITHOFACIES_LITHOLOGY'].map(lithology_numbers)
Once this has run, we can view the dataframe again to make sure the mapping has been successful and a new LITH column has been added to the end of the dataframe.
Filtering the Dataframe for a Specific Geological Group
As we have a rather large dataset with 11,705,511 rows, it would be good to focus on a particular geological group for our lithology compositional analysis.
In this case, we will subset the data and look at the Zechstein Group.
We can do this by using the query()
method and passing in a simple string: GROUP == "ZECHSTEIN GP."
df_zechstein = df.query('GROUP == "ZECHSTEIN GP."')
df_zechstein.WELL.unique()
We can check how many wells we have in the subset by calling upon df_zechstein.WELL.unique()
, which returns the following array containing 8 wells.
array(['15/9-13', '16/1-2', '16/10-1', '16/11-1 ST3', '16/2-16', '16/2-6',
'16/4-1', '17/11-1'], dtype=object)
As we are only interested in the lithologies, we can simply extract the Well name and the lithology columns. This will also make it easier to carry out the aggregation.
df_zechstein_liths = df_zechstein[['WELL', 'LITH']]
Aggregating the Data Using Chained Pandas Functions
Now that we have the data in a format that can be worked with we can begin the aggregation process. For this, we are going to chain multiple pandas methods together in a single.
First, we will group the data by the WELL column using the groupby
function. This essentially makes subsets of the dataframe for each unique well name within the WELL column
Next, we will count the occurrences of each lithology type within each group. The normalize=True
part means it will give a proportion (between 0 and 1) rather than absolute counts. For example, if in one well (group), ‘Sandstone’ occurs 5 times and ‘Shale’ occurs 15 times, the function will return 0.25 for ‘Sandstone’ and 0.75 for ‘Shale’ instead of 5 and 15.
Finally, we need to rearrange our resulting dataframe so that the row index contains the well names and the columns contain the lithology names. If a well did not have any instances of a certain lithology then they are filled with zero, due to fill_value=0
.
summary_df = df_zechstein_liths.groupby('WELL').value_counts(normalize=True).unstack(fill_value=0)
summary_df
What we get back is the following dataframe with the decimal proportions of each lithology within each of the wells.
If we want to view these as percentages, we can change how they are displayed using the following code:
summary_df.style.format('{:.2%}')
When we run the code, we get back the following dataframe which provides a more readable table and can be incorporated into a report.
Applying this styling does not change the actual values. They will still be stored as their decimal equivalents.
If we do want to change the values permanently to percentages, we can do so by multiplying the dataframe by 100.
summary_df = summary_df * 100
summary_df
Once the data is in this format, we can use it to create something similar to the infographic below, which shows the percentages for the lithologies for each of the wells.
Summary
Within this short tutorial, we have seen how we can take a large collection of well log data (from 90 plus wells) and extract and summarise a particular geological group. This allows us to understand the lithological composition of the geological group in an easy-to-read and understandable format that can be incorporated into a report or presentation.
Dataset Used in this Tutorial
Training dataset used as part of a Machine Learning competition run by Xeek and FORCE 2020 (Bormann et al., 2020). This dataset is licensed under Creative Commons Attribution 4.0 International.
The full dataset can be accessed at the following link: https://doi.org/10.5281/zenodo.4351155.
Thanks for reading. Before you go, you should definitely subscribe to my content and get my articles in your inbox. You can do that here!
Secondly, you can get the full Medium experience and support thousands of other writers and me by signing up for a membership. It only costs you $5 a month, and you have full access to all of the fantastic Medium articles, as well as the chance to make money with your writing.
If you sign up using my link, you will support me directly with a portion of your fee, and it won’t cost you more. If you do so, thank you so much for your support.