Positive and Unlabeled Materials Machine Learning

How semi-supervised learning is used to accelerate materials synthesis

Nathan C. Frey, PhD
Towards Data Science

--

Nathan C. Frey

This post was co-authored with Vishnu Harshith from IIT Madras

Many real-world problems involve datasets where only some of the data is labeled and the rest is unlabeled. In this post, we discuss our implementation of semi-supervised learning for predicting the synthesizability of theoretical materials.

From Unsplash.

When we think about the materials that will enable next-generation technologies, it’s probably not the case that there is one ultimate material waiting to be found that will solve all our problems. The problems we need to solve (producing and storing clean energy, mitigating climate change, desalinating water, etc.) are complex and varied.

Even zooming in to the next-generation of electronics, computers, and nanotechnology, there probably isn’t a single perfect material to exploit in the same way that silicon has been used in all our familiar devices. We’re also assuming that we don’t currently have all the materials and technology that we need to solve these problems, because we’re not exactly living in a Star Trek future just yet.

The solution is targeted searches to identify the best materials candidate(s) for the application of interest. We’re now pretty good at simulating materials on the computer and getting an idea about whether a certain material can do what we want it to. We can even invent new, hypothetical materials and predict their properties.

Let’s say we invent a wonder material and think that, based on our simulations, it might be pretty cool. The next step is to go, hat in hand, to a friend in a laboratory and ask them nicely to try to make the thing. If the material is very similar to something else they’ve made, that might be easy. In general though, we have no idea what can be made in the lab. Synthesizing new materials is an expensive and time-consuming process that relies heavily on trial and error and “intuition,” i.e. senior scientists who have built a memory bank of what seems to work and what doesn’t. This is the Thomas Edison model of failing 1,000 times and hoping your eventual invention justifies all that effort. It seems there’s an opportunity to do a bit better.

Finding Mona Lisa materials

Our idea was to see if we could train a machine learning model to develop this sort of “intuition” by feeding it examples of materials that have been synthesized. We call these positive samples. All the other materials we dream up are unlabeled because we don’t know if they can be made or not. In this interview we compared the situation to detecting forged paintings. If a model sees many examples of original paintings (real, synthesized materials), it should be able to recognize an original versus a forgery (an unrealistic material that isn’t quite right and can’t be easily synthesized) when we show it something new.

We’re particularly interested in two-dimensional (2D) materials, so in our study we focused on a family of 2D materials called MXenes. The import thing to know is that MXenes are great because they 1) have tons of interesting properties; 2) can be made up of all kinds of elements from the periodic table so there are a lot of them; and 3) look really cool under a microscope.

Electron microscope image of MXene. From Wikimedia Commons.

Unfortunately, even though there are lots of possible MXenes, it is still time-consuming and expensive to discover new ones, so we don’t have as many positive examples as we would trying to classify paintings. This is a persistent problem in materials science: it is expensive and sometimes impossible to collect more data. So we started by looking at MAX materials, which are the 3D version of MXenes, where we have more examples of things that have already been made.

How it works

To deal with all these positive and unlabeled examples, we adapted a framework that is unsurprisingly called “positive and unlabeled learning” (PU learning) [1]. This sort of approach is more useful than you might think — it turns out lots of problems can be framed this way. Imagine you’re a data scientist and you have a list of customers who have bought your product (positive examples) and a list of potential customers who might or might not buy your product (unlabeled examples). Or maybe you’re a biologist with a list of proteins with some interesting characteristics, but you have no idea if other proteins have those properties or not.

Somehow you need to take what you know from the positive examples and apply that to all the unlabeled ones. In simple terms, with PU learning some unlabeled examples will randomly be labeled “negative.” Then a machine learning model — a decision tree classifier in our case — is trained to classify the positive and negative examples. The decision tree is made up of nodes, where each node represents an ‘if-then’ rule that splits the data according to its features. To augment the data we used bootstrapping, where samples are drawn from the data with replacement to create random subsets of the original data. This is repeated with different random sets of negative examples until the classifier becomes good at recognizing the positive examples. The model should “learn” the characteristics of the positive samples and be able to identify other positive samples in the unlabeled data.

Image by author

We calculated a bunch of properties for all the materials we were interested in, built and trained a positive and unlabeled machine learning model to recognize what is special about the synthesized materials, and then predicted which new materials should be synthesizable. Our model learned to use some information that we already know is a good indicator of synthesizability, like how much the atoms want to be bonded together. That’s good, because the model needs to be at least as good as our basic intuition that comes from physics and chemistry. But the model also found patterns in how the atoms are arranged, and how electrons are distributed in the material. More than that, the model makes decisions based on many more properties than we could reasonably think about at one time.

So first we make sure that the model captures the things we know it needs to capture from physics and chemistry, which agrees with our human drive to simplify things and explain things with one or two important factors. But we also make sure it’s doing something that we humans are bad at — looking at a more granular level of detail and considering hundreds of materials in seconds.

Our model found 18 new MXenes that we think should be synthesizable [2]. These 2D materials might have properties we haven’t seen before, and they could be used in next-generation batteries or electronic devices.

Classifying everything in the Materials Project database

Since we are now at a stage where we can determine whether a given compound is synthesizable or not, the next question is, can we make predictions without having to generate lots of data about each theoretical material? Can we instead determine the synthesizability of a compound by knowing only its formula or crystal structure?

To tackle this problem, we needed vast amounts of data. Luckily, The Materials Project database contains more than 131,000 inorganic compounds and 49,000 molecules. We separated the compounds into two major categories: f-Block compounds with f electrons and non-f-block compounds with no f electrons. This separation was done primarily because f electrons sometimes exhibit strange behavior and are hard to describe. We used Matminer to featurize these compounds to automatically get some extra information that isn’t directly available from the database.

Doing some basic pre-processing techniques and tuning the model’s parameters resulted in a True Positive Rate of 0.91 over the entire Materials Project Database — that means our model is able to correctly pick out materials that have already been synthesized 91% of the time.

These trained models are now available to predict the synthesizability of new compounds. The user can give an input in the form of a ‘Materials Project ID,’ chemical formula, or a crystal structure. The model will then predict the synthesizability score of that compound. Here are some examples:

We can predict the synthesizability of any theoretical compound in the Materials Project database by providing its unique ID:

pup = PUPredict(‘your_api_key’)print(pup.synth_score_from_mpid(‘mp-1213718’)) # theoretical Cs2TbO3print(pup.synth_score_from_mpid(‘mp-771359’)) # theoretical Cu2O3[array([0.37218361])][array([0.49403711])]

We can also provide a chemical formula or crystal structure and the algorithm will predict the synthesizability of any compounds in the Materials Project that match:

pup.synth_score_from_formula(‘Ba2Yb2Al4Si2N10O4’)[array([0.04694946]), array([0.04952542])]

There are example Jupyter notebooks in the pumml repository for exploring more features, like building your own models using new data.

What’s Next?

This is just the beginning and there’s a lot to do. We looked at one family of 2D materials but there are many, many more out there. The Materials Project example shows how pumml can be used to predict the synthesizability of any material. We need more data, more testing, and most of all, experimental validation. We hope that other researchers will use pumml on new materials systems and to solve problems outside of synthesis too, in other situations where data is incomplete. We’re excited to see the future of materials synthesis, as we try to harness the power of artificial intelligence and machine learning to go beyond the Edisonian model of discovery.

Getting in touch

If you liked this explainer or have any questions, feel free to reach out to Nathan over email or connect on LinkedIn and Twitter.

You can reach out to Vishnu Harshith over email or connect on LinkedIn.

If you’re interested in more technical details, you can read the paper here and check out pumml on GitHub.

You can find out more about Nathan’s projects and publications on his website.

References

[1] Elkan, Charles, and Keith Noto. Learning classifiers from only positive and unlabeled data. Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.

[2] Nathan C. Frey, Jin Wang, Gabriel Iván Vega Bellido, Babak Anasori, Yury Gogotsi, and Vivek B. Shenoy. Prediction of Synthesis of 2D Metal Carbides and Nitrides (MXenes) and Their Precursors with Positive and Unlabeled Machine Learning. ACS Nano 2019 13 (3), 3031–3041.

--

--

Senior ML Scientist & Group Leader @PrescientDesign • @Genentech | Co-founder @AtomicDataSciences | Prev Postdoc @MIT, NDSEG Fellow @UPenn, @Berkeley Lab