The world’s leading publication for data science, AI, and ML professionals.

GeoVec: word embeddings for geosciences

Analogies, categorisation, relatedness and spatial interpolation of word embeddings.

Recommended readings referenced in this post

Deep learning and Soil Science – Part 1

Deep learning and Soil Science – Part 2

Introduction

As I described in the first post on deep learning and soil science, part of the job of a soil scientist is performed on the field and consists of describing what we see as thoroughly as possible. Those are important records that help us to form the mental model of the study area and to remind us of details that otherwise we might forget.

Here is (part of) a typical soil description:


Ap1 – 0 to 7 inches; brown (10YR 5/3) ashy fine sandy loam, very dark grayish brown (10YR 3/2) moist; weak fine granular structure; soft, very friable, nonsticky and slightly plastic, common very fine and fine roots; few very fine interstitial and tubular pores; 15 percent sand-size pumice less than 2.0mm in diameter; neutral (pH 6.6); clear smooth boundary. (0 to 8 inches thick)

Ap2 – 7 to 9 inches; brown (10YR 5/3) ashy fine sandy loam, very dark grayish brown (10YR 3/2) moist; weak medium granular structure; slightly hard, friable, nonsticky and slightly plastic; common very fine and fine roots; common fine and very fine tubular and interstitial pores; 20 percent sand-size pumice less than 2.0mm in diameter; neutral (pH 7.0); clear wavy boundary. (0 to 7 inches thick)


We can see that it is organised by layers and contains details such as colour, presence of roots, descriptions of the pores, textural class (estimated proportion of clay, silt and sand), etc. Most of the time the descriptions follow some recommended format but they might contain more or less free-form text depending on the study.

These layer descriptions are usually accompanied by samples that are sent to the laboratory than can then be used, for example, to make maps of soil properties like the one I described on Digital Soil Mapping with CNNs. But the descriptions are almost never used by themselves although they contain important information. For instance, a layer with a dark colour could have relatively high concentration of organic matter compared with a layer with lighter colour (of course, there are exceptions).

Word embeddings

In order to use the valuable information contained in descriptions, my colleague Ignacio Fuentes and I decided to experiment with natural language processing (NLP) to generate domain-specific word embeddings. I’m not going to explain what word embeddings are but you can find good articles such as this introduction to word embeddings.

Model development

Since most language models take into account the co-occurence of words, we used a relatively large corpus of 280,764 full-text articles related to geosciences. After pro-processing the text (tokenisation, removing stopwords, etc.), we fitted a GloVe model to generate Word Embeddings that we could use in other applications.

Evaluation of word embeddings

Downloading the 280,764 articles, pro-processing them and fitting the GloVe model was actually the easy part of the process. Since we needed to evaluate the generated vector space, we created a domain-specific test suite considering three tasks, namely analogy, relatedness and categorisation.

In case those tasks don’t sound familiar, here is a brief description and examples.

  • Analogy: Given two related pairs of words, a:b and x:y, the aim of the task is to answer the question "a is to x as b is to?". An example related to minerals and their colours is "chalcanthite is to blue as malachite is to ___? (green)".
  • Relatedness: For a given pair of words (a, b), a score of 0 or 1 is assigned by a human subject if the words are unrelated or related, respectively. An example is the pair "(Andisol ,Volcano)" and since Andisols (type of soil) are related to volcanic areas, the relatedness score should be 1.
  • Categorisation: Given 2 sets of words s1 = {a, b, c, …} and s2 = {x, y, z, …}, this test should be able to correctly assign each word to its corresponding group using a clustering algorithm (see example in results).

Results

We compared our domain-specific embeddings (GeoVec) with general domain embeddings provided by the authors of GloVe and we observed an overall performance increase of 107.9%. Of course, this is an expected outcome considering the specificity of the tasks.

Analogies

If you are familiar with word embedding (or you read this introduction to word embeddings), you have probably seen plots showing the relationship between capital cities and countries, king-male/queen-female, or other groups of analogies. In this work we were able to obtain similar results but related to geosciences.

Fig. 1: Two-dimensional PCA projection of selected words. Simple syntactic
relationship between particle fraction sizes and rocks (left panel) and advanced semantic
relationship between rocks and rock types (right panel).
Fig. 1: Two-dimensional PCA projection of selected words. Simple syntactic relationship between particle fraction sizes and rocks (left panel) and advanced semantic relationship between rocks and rock types (right panel).

From the plot, any pair of related words can be expressed as an analogy. For example, from the left panel, it is possible to generate the analogy "claystone is to clay as sandstone is to ___? (sand)" and the first model output is indeed "sand". In the left panel it is possible to observe simple analogies, mostly syntactic since "claystone" contains the word "clay". The right panel presents a more advanced relationship where rock names are assigned to their corresponding rock type.

Categorisation

In the case of categorisation, the image below shows two examples where a k-means algorithm can correctly differentiate groups of embeddings. The left panel shows groups of soil classes from two different classification systems (USDA and WBR). There is only one ambiguous soil class (Vertisols) which is present in both classification systems, but the embeddings encode that relationship correctly, placing that class in between both groups. The right panel shows an example of how the embeddings encode information from different aggregation levels. The same soils that were correctly differentiated into two groups on the left panel, they form a cohesive group when compared to rock types on the right panel.

Fig. 2: Two-dimensional PCA projection of selected categorisations. Clusters
representing soil types from different soil classification systems (left panel) and a
different aggregation level where the same soil types are grouped as a single cluster
when compared with rocks (right panel).
Fig. 2: Two-dimensional PCA projection of selected categorisations. Clusters representing soil types from different soil classification systems (left panel) and a different aggregation level where the same soil types are grouped as a single cluster when compared with rocks (right panel).

Interpolating embeddings

Maybe you have seen some cool examples of latent space interpolation for images, such as the face interpolation example below, but it is not commonly used in the context of word embeddings.

Fig. 3: Linear interpolation in latent space between real images (Kingma and Dhariwal, 2018)
Fig. 3: Linear interpolation in latent space between real images (Kingma and Dhariwal, 2018)

In the case of words, we wanted to explore if the interpolated embeddings (space "between" 2 words) yield some…

In order to generate the interpolated embeddings, we obtained linear combinations of two words embeddings by using the formula:

_vint = α ∗ _va + (1 − α) ∗ _vb

where _vint is the interpolated embedding, _va and _vb are the embeddings of the two selected words. By varying the value of α in the range [0, 1], we generated a gradient of embeddings. For each intermediate embedding obtained by interpolation, we calculated the cosine similarity against all the words in the corpus and selected the closest one.

Fig. 4: Interpolated embedding in a two-dimensional PCA projection showing a size
gradient with "clay"<"silt"<"sand"<"gravel"<"cobble"<"boulder". Red and blue dots represent
selected words ("clay" and "boulder") and black dots
represent the closest word (cosine similarity) to the interpolated embeddings.
Fig. 4: Interpolated embedding in a two-dimensional PCA projection showing a size gradient with "clay"<"silt"<"sand"<"gravel"<"cobble"<"boulder". Red and blue dots represent selected words ("clay" and "boulder") and black dots represent the closest word (cosine similarity) to the interpolated embeddings.

In the image to the left you can see a gradient of embedding between the words "boulder" and "clay". Those two extreme words correspond to different particle sizes, with coarse and fine size, respectively. A more complete list of particle sizes is, in order from coarse to fine, "boulder" > "cobble" > "gravel" > "sand" > "silt" > "clay". The resulting interpolated embeddings actually correspond (are close) to those particle sizes, in the same order!

We were hoping to see that result with the interpolation (nevertheless, we were quite amazed!) since we wanted to explore the idea of spatially interpolating embeddings, which is closer to our field.

Spatial interpolation of embeddings

The idea of this project was to generate a 3D geological map based on bore hole observations distributed as shown in the map bellow. The descriptions have associated coordinates and depths, similar to the soil description at the top of this blog.

In order to perform the 3D interpolation, first we had to generate "description embeddings". Since this is a proof of concept, we decided to use the simple approach of calculating the mean of the word embeddings in each description, resulting in a single embedding with 300 components (the number of components of GeoVec) for each description.

After interpolating the embeddings, we obtained a 3D model that looks like:

Fig. 5: 3D lithological map of classes (left panel) and its associated measures of uncertainty. Uncertainties correspond to Confusion Index (CI) between the first and second most likely class (middle panel) and Entropy (right panel).
Fig. 5: 3D lithological map of classes (left panel) and its associated measures of uncertainty. Uncertainties correspond to Confusion Index (CI) between the first and second most likely class (middle panel) and Entropy (right panel).

The left panel shows the most likely class per voxel (derived from a multi-class classifier), the middle panel shows the Confusion Index between the first and second most likely class per voxel, and the right panel shows the corresponding entropy.

A complete description of the interpolation process and the multi-class classifier can be found in the corresponding publication (at the moment it is under review, but I will update this post after the publication has been accepted).

Final words

These are just the first attempts to use word embeddings in geosciences. The results were very interesting since the GloVe model seems to capture many "natural" properties (such as particle sizes) quite well. Of course, that is only possible if the corpus is large and diverse enough.

We made the embedding public so people can use them and experiment with them. We will also make the test suite available to allow people to expand it with new and more complex tests, and also to use them as a baseline for new (and better!) models.

I see great potential in using word embeddings in geosciences. There is a lot of descriptive information that has been collected over the years that could be "rescued" and used. I also think that is possible to complement numerical data with word embeddings, so I will keep experimenting and writing about it. Stay tuned!

Citations

More details about this work can be found in the corresponding papers.

  • Padarian, J. and Fuentes, I., 2019. Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts. SOIL, 5, 177–187, https://doi.org/10.5194/soil-5-177-2019.
  • Fuentes, I., Padarian, J., Iwanaga, T., Vervoort, W., 2019. 3D lithological mapping of bore descriptions using word embeddings. Computers & Geosciences (under review).

The embbedings are available at:

References

Kingma, D.P. and Dhariwal, P., 2018. Glow: Generative flow with invertible 1×1 convolutions. In Advances in Neural Information Processing Systems (pp. 10215–10224).


Related Articles