The world’s leading publication for data science, AI, and ML professionals.

Explore Better Materials Using Deep Graph Convolutional Networks and Bayesian Optimization

Distill the knowledge from theoretical calculation to design materials to minimise the number of backbreaking experiments.

Do you wanna better materials? Yes, we do. We wanna better batteries with larger capacity and a more extended lifetime. Or, we wanna better solar panels with a higher energy generation rate. Or, we wanna better semiconductors with less Joule heat loss. All of these performance limitations are derived from the materials that consist of them. So, yes, we indeed wanna better materials for any products.

Motivation

But how can we find them from pretty much infinite existing materials? One way is to exploit a simulation database such as Materials Project. This type of database has been compiling the material characteristics using DFT calculations, which is beneficial to narrow down the scope of exploration. However, poor predictability of several properties is well known, such as band gaps, so that we cannot accept the values on them at the face value. Although, according to the Crystallography Open Database, the number of accumulated crystals has reached 476,995 so far, so that checking its performance in an experimental way is impossible. Ah, I wish there were a good experimental database and a search engine for materials discovery! Sadly, there doesn’t exist. (at least at this stage.) Yeah, we should create a way to exploit the knowledge from theoretical database to experimental crystal structure database. So then, in aid of machine learning, let’s consider the methodology to explore desired materials efficiently.


Problem Setting

Let’s consider exploring the material having the desirable bandgap. When we wanna a better semiconductor, this characteristic value is the most fundamental one to guide. Per the design of the product, the desired value of the bandgap varies. As such, we can simplify the semiconductor materials discovery problem as follows:

When we set E as the desirable value of bandgap, how can we find the crystal having the closest bandgap from the crystal database as much as possible with minimal exploration trials?

Let’s make this question more concrete. Things we clarify are the following:

  1. How accurately do we need to get closer to the target value?
  2. Which metric is the most suitable?
  3. How can we construct the dataset?
  4. What extent do we need to explore?

Firstly, we should estimate the target value within 0.01 eV error. eV, electron volt, is the unit of the bandgap. Bandgap can vary on this scale depending on its manufacturing method or conditions. So, pursuing further could be to end in vain. Secondly, we can adopt MAE (mean absolute value) metric when we emphasize absolute value, not the ratio. As its name suggests, this can be obtained by the average of the absolute value of the difference between the target value and the estimated value. Next, we can use CIF (Crystal Information File) dataset. The CIF is the international format describing the crystal structures, which contains essential information but is not quantitative. Third, we can use the Materials Project database as a pseudo-experimental dataset. Suppose we can explore the material having the bandgap of E from the hypothetical experimental dataset. In that case, we can apply the same strategy to efficientise an actual experiment. Considering the actual situation, we can survey scientific papers to collect the preliminary information of the targeted values. Let’s say the number is about 100. Besides, we can exclude many candidates by considering the other undesirable aspects like synthesis cost or reactivity. This way, we can obtain 100 preliminary information and narrow down exploration candidates to about 6,000. Then, we can conceptualise the experimental material exploration problem as follows:

When we set E as the desirable value of bandgap, how can we find the crystal whose MAE is less than 0.01 eV from about 6,000 candidates as much as possible with minimal exploration trials, exploiting the 100 prior information?


Machine Learning Strategies

Bayesian Optimisation

As we can easily find out in Google search, Bayesian Optimisation seems good to an efficient exploration. Bayesian Optimisation is an algorithm designed to explore better data points with minimal trials. For instance, the Google Brain team applied this algorithm to smartly optimise chocolate chip cookie recipe. Anyway, this seems practical, yeah we should apply this to materials discovery. But wait, we require descriptors. In other words, the set of variables to identify the unique crystal. Google team used the quantitative values of each cookie cooking procedure, such as the gravimetric ratio of Tapioca starch. How can we quantify crystals?

Crystal Graph Convolutional Neural Network

One easy method is utilising the pre-trained Deep Learning model of crystals. CGCNN, Crystal Graph Convolutional Neural Networks, is a pioneering deep learning architecture in materials science. In the author’s GitHub repository, they opened the pre-trained model, and everyone can use them. When we look into the prepared model folder, we can find the bandgap model (band-gap.pth.tar). By using this model as a feature extractor, we can convert CIF files into 128 quantitative descriptors on autopilot.

Principal Component Analysis

Unfortunately, 128 descriptors are too many for Bayesian Optimisation. Although there are many cutting-edge algorithms for high-dimensional optimisation, basically, a lower dimension works better without additional efforts. Besides, these 128 descriptors are just for identifying the crystal quantitatively, so that a high dimension is not essentially necessary. Consequently, we can reduce the dimension using PCA, principal component analysis. By reducing 128 dimensions to 3, we can set much more efficient exploration space.


The Code

Python Library Requirements:

  • pymatgen
  • pytorch
  • scikit-learn
  • GPyOpt
  • GPy

Dataset Construction

We will use the Materials Project API for constructing the dataset. First, you need to create an account on the Materials Project, then get the API key. This can be done by following the official instruction. Then, we will compile the two datasets for the prior information and exploration. You should change MY_API_KEY to your key.

In this code, we search for the crystals having a bandgap between 2.3 and 2.8, resulting in 6,749 materials. Then, they are divided into two folders; "cif_prior" and "cif_experiment", containing 100 and 6,649 CIF files, respectively. Besides, the values of bandgap are stored as ‘id_prop.csv’ in each of the folders.

Convert CIF to 128 descriptors using the pre-trained CGCNN model

You can git clone the CGCNN repository by following the official instruction from here. You need to copy _atom_init.json_ to both folders of "cif_prior" and "cifexperiment". Then, you can create the feature extract code by modifying the predict.py. I created extract_feature.py_ based on the validate function in predict.py. The code is too long to write down here, so I will show the modified parts only.

Firstly, the modified parts of main function is the last part like this.

Then, modify the middle part of validate function like this.

Then, you can execute this _extract_feature.py_ with the following arguments.

python3 extract_feature.py ./pre-trained/band-gap.pth.tar ./cif_experiment

Thus you can obtain 128 descriptors as _cgcnn_features.csv_. we should create the features for both "cif_prior" and "cifexperiment". And rename cgcnn_features.csv as cgcnn_features_prior.csv and cgcnn_features_experiment.csv_.

Reduce dimensions from 128 to 3 by PCA

We will convert 128 features into 3 dimensional data. Similarly, execute the code twice for both datasets, and rename the output _cgcnn_pca.csv to cgcnn_pca_prior.csv and cgcnn_pca_experiment.csv_.

Bayesian Optimisation

Lastly, we will explore the better materials using Bayesian Optimisation. The experimental setup will be done by defining the following class instances.

All preparation has been done. Next code will automatically explore the better materials. In this setting, the target bandgap was set as E = 2.534 [eV]. Our desired materials should have within MAE error of 0.01 eV, therefore, the target range will be between 2.524 and 2.544 eV. Bayesian optimisation loop will repeat exploration for 30 times in accordance with _nexperiment. The discovered materials and the corresponding values are stored in the following self instances:

  • the explored bandgap values; exp.explored_bandgaps
  • the crystal names; exp.crystals
  • the accumulative loss curve; results

You can freely visualise or export these results by adding the code based on these instances and variables.

In my setting, within 5 times, Bayesian Optimisation can find the desired material. That is, this method seems useful in the actual materials exploration scheme. Enjoy the material discovery!


Related Articles