Using unsupervised machine learning to uncover hidden scientific knowledge

Word2vec learns materials science from millions of abstracts

Vahe Tshitoyan
Towards Data Science

--

Credit: Olga Kononova

It has become increasingly difficult to keep up with the pace at which new scientific literature is published. It might take months for an individual researcher to do an extensive literature review even on a single topic. What if a machine could read all of the papers ever published on the specific topic in minutes and tell the scientists the best way forward? Well, we are still very far from that, but our research described below suggests a novel approach to utilize the scientific literature for materials discovery with minimal human supervision.

In order for a computer algorithm to make use of natural language, words need to be represented in some mathematical form. In 2013, authors of the algorithm called Word2vec [1, 2] found an interesting way to automatically learn such representations from a large body of text. Words that appear in similar contexts in the text often have similar meanings. Hence, if a neural network is trained to predict the neighbouring words of a target word, it will learn similar representations for similar target words. They showed that individual words can be efficiently represented as high dimensional vectors (embeddings), and that semantic relationships between the words can be represented as linear vector operations (See here for a tutorial explaining Word2vec in more detail). One famous example of such a semantic relationship is the expression

“king” - “queen” ≈ “man” - “woman” (1),

where the subtraction is performed between the vectors of the corresponding words. This semantic relationship between the pairs of words on both sides of (1) represents the concept of gender.

Figure 1: Analogies between pairs of words are captured by linear operations between the corresponding embeddings. Figure on the right borrowed from [3].

Naturally, if instead of common text sources such as common crawl or Wikipedia we use purely scientific text, which in our case [3] was several million materials science abstracts, these vector operations embed more specialised knowledge. For instance,

“ZrO2” - “Zr” ≈ “NiO” - “Ni”,

where the above expressions represents the concepts of an oxide.

Another example of a semantic relationship is word similarity, determined by the dot product (projection) of the embeddings. In the original Word2vec model, words “large” and “big” have vectors that are close to each other (have a large dot product) but far from the vector for “Armenia”. In our specialised model, the most similar word to “LiCoO2” is “LiMn2O4” — both of these are lithium-ion battery cathode materials. In fact, if we project ≈12,000 most popular materials (more than 10 mentions in the text) on a 2D plane using t-SNE [4], we find that materials mostly cluster according to their applications and compositional similarity.

Figure 2: Materials used for similar applications as well as with similar chemical compositions cluster together. The most common elements in each “application cluster” match our materials science knowledge. Each chart on the bottom is obtained by counting chemical elements in the compositions of the materials from the corresponding application clusters. Figure borrowed from [3].

Now, we can do something even more fun and colour the “materials map” in the top left corner of Figure 2 according to a particular application. Each dot corresponding to a single material can be coloured according to the similarity of its embedding with the embedding of the application word, e.g. “thermoelectric” (a word used to describe heat to electricity conversion, and vice versa).

Figure 3: Materials “light up” according to their similarity to the application keyword. Figure borrowed from [3].

As many of you might have guessed, the brightest spots in the figure above are well known thermoelectric materials explicitly mentioned in the scientific abstracts alongside the word “thermoelectric”. However, some other bright spots have never been studied as thermoelectrics, so the algorithm is indicating a relationship that is not explicitly written in the text. The question is, can these materials be good thermoelectrics that are yet to be discovered? Surprisingly enough, the answer is YES!

One of the several ways we tested this hypothesis was by training word embeddings as if we were still in the past. We removed scientific abstracts published between years 2000 and 2018 one year at a time, and trained 18 different models. We used each of these models to rank materials according to their similarity† to the word “thermoelectric” (the intensity of the colour in Figure 3), and took the top 50 that were not studied as thermoelectrics as of that year. It turns out, many of these materials were subsequently reported as thermoelectrics in future years, shown in the figure below.

Figure 4: If we went to the past one year at a time and made prediction using only the data available at that time, many of them would have come true by now. Each grey line corresponds to the predictions for a given year, and the solid red and blue lines are averaged across all prediction years. Figure borrowed from [3].

In fact, one of the top 5 predictions in 2009 would have been CuGaTe2, which is considered to be one of the best present day thermoelectrics discovered only in 2012. Interestingly, while our manuscript [3] was in preparation and in review, 3 out of the 50 predictions we made with all of the available abstracts were also reported as good thermoelecrics.

So, how does this all work? We can get some clues by looking at the context words of the predicted materials, and see which of these context words have high similarities both with the material and the application keyword “thermoelectric”. Some of the top contributing context words for 3 of our top 5 predictions are shown below.

Figure 5: Context words for 3 of our top 5 predictions that contribute the most to the predictions. The width of the connect lines is proportional to cosine similarities between the words. Figure borrowed from [3].

Effectively, the algorithm captures context words (or, more precisely, combinations of context words) that are important for a material to be a thermoelectric. As materials scientists, we know that for instance chalcogenides (a class of materials) are often good thermoelectrics and that the presence of a band gap is crucial most of the time. We see how the algorithm has learnt this using co-occurrences of the words. The graph above captures only the first order connections, but higher order connections could also be contributing to the predictions.

For scientific applications, natural language processing (NLP) is almost always used as a tool to extract the already known facts from the literature, rather than to make predictions. This is different from other areas such as stock value predictions, where, for instance, news articles about the company are analysed to predict how the value of its stock will change in the future. But even then, most of the methods feed the features extracted from the text into other, larger models that use additional features from structured databases. We hope that the ideas described here will encourage direct, unsupervised NLP - driven inference methods for scientific discovery. Word2vec is not the most advanced NLP algorithm, so a natural next step could be its substitution with more novel, context aware embeddings such as BERT [5] and ELMo [6]. We also hope that since the methods described here require minimal human supervision, researchers from other scientific disciplines will be able to use them to accelerate machine-assisted scientific discoveries.

Notes

†A crucial step in obtaining good predictions was to use output embeddings (output layer of the Word2vec neural network) for materials and word embedding (hidden layer of the Word2vec neural network) for the application keyword. This effectively translates to predicting co-occurrences of words in the abstracts. Therefore, the algorithm is identifying potential “gaps” in the research literature, such as chemical compositions that researchers should study in the future for functional applications. See the supplementary materials of the original publication for more details.

The code we used for Word2vec training and the pre-trained embeddings are available at https://github.com/materialsintelligence/mat2vec. The default hyperparameters in the code are the ones used in this study.

Disclaimer

The work discussed here was performed while I was a postdoc at Lawrence Berkeley National Laboratory, working alongside an amazing team of researchers — John Dagdelen, Leigh Weston, Alex Dunn, Ziqin Rong, Olga Kononova, Kristin A. Persson, Gerbrand Ceder and Anubhav Jain.

Also big thanks to Ani Nersisyan for the suggested improvements to this story.

References

[1] T. Mikolov, K. Chen, G. Corrado & J. Dean, Efficient Estimation of Word Representations in Vector Space (2013), https://arxiv.org/abs/1301.3781

[2] T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J. Dean, Distributed Representations of Words and Phrases and their Compositionality (2013), https://arxiv.org/abs/1310.4546

[3] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder & A. Jain, Unsupervised word embeddings capture latent knowledge from materials science literature (2019), Nature 571, 95–98

[4] L. Maaten & G. Hinton, Visualizing Data using t-SNE (2008), Journal of Machine Learning Research

[5] J. Devlin, M.-W. Chang, K. Lee & K. Toutanova, Bert: pre-training of deep
bidirectional transformers for language understanding (2018), https://arxiv.org/abs/1810.04805

[6] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations (2018), https://arxiv.org/abs/1802.05365

--

--