GSoC 2021 with ML4SCI | The NMR Project

A short summary of my Google Summer of Code-2021 project, the experience, the big takeaways and advice for upcoming young developers.

Anantha S Rao
Towards Data Science

--

This blog is a short summary of my Google Summer of Code (GSoC) 2021 project under the ML4SCI organization. GSoC is a 10 week global program focused on bringing more student developers into open source software development. This year marked the 17th anniversary of Google Summer of Code which saw 6,991 proposals from 4,795 students from 103 countries across the globe, out of whom 1,286 students were given an opportunity to work with 199 open source organizations.[1]

GSoC is all about contributing to Open Source Software | Image by Kiran Anagani from Unsplash

Project description and Motivation

This section is for a general audience. More detailed physical treatment is avaialble in the following sections.

At low temperatures, many materials transition into an electronic phase which cannot be classified as a simple metal or insulator. Novel quantum phases of matter like superconductors and spin liquids are harder to study due to their fragile nature, making non-intrusive and indirect measurements important. Scientists hence use Nuclear Magnetic resonance (NMR), a non-intrusive method to probe these materials externally. NMR is an experimental technique mainly used in quality control and scientific research to determine the molecular structure, purity, and content of any sample. The GSoC project idea was to explore the connection between electronic phases and nuclei in these materials via simulations of NMR. We analyze the time-evolution of nuclear spins due to external magnetic pulses and classify them using suitable Machine Learning models.[2]

Basic pipeline of the GSOC project. We simulate echo experiments in a computer and use the nuclear magentization curves to predict and classify the underlying the parameters and the material kind respectively.
Workflow of the Project: We simulate NMR Spin-echo experiments using different material parameters to obtain the average nuclear magnetization of the material as a function of time (as seen during an NMR experiment). We then use Machine learning on the dataset to in-turn predict and classify the parameters and type of material respectively | Image by Author

ML4SCI and the NMR Project

Image Credits : GSoC & ML4SCI

Organization description

Unlike most other organizations participating at the Google summer of code, I feel that ML4SCI is unique in both its methods and objectives. While most organizations look for developers to build up their code repositories, resolve bugs and update new features, the primary objective of ML4SCI is to solve open-ended research questions in basic sciences by developing/using free and open source software by using Machine Learning. This year the kind of projects ranged from deep learning in gravitational physics, astronomy, quantum machine learning, machine learning to fluid turbulence, novel quantum materials etc.

Project Code Repository

I have compiled my work into a single open-source repository. Titled, “Decoding Quantum States with NMR”, we have a set of tutorials to read, preprocess and extract features from the NMR simulation data. We also describe how to train a random forest classification/regression model and then analyze the feature importance to interpret the underlying features in the time-series.

Why ML4SCI ?

As a Physics major, I believe that my background, and interests in the fields of condensed matter physics, quantum mechanics, and artificial intelligence complement well with the philosophies and interests of the ML4SCI group and especially with the NMR-GSoC research project. I have previously worked on the application of deep learning in the field of basic sciences and believe that Python is the juggernaut in open-source software for science, engineering. Lastly, as an advocate for free and open-source software, it is always an absolute honor to build tools for open science and society.

Spin Echo in NMR (Some Physics)

This section is for the esoteric and can be ignored on a first reading.

Magnetic resonance occurs in quantum systems when a magnetic dipole is exposed to two external magnetic fields: (1) A static magnetic field with (2) An oscillating electromagnetic field. The oscillating field can then make the dipole transit between its energy states with a certain probability and when the frequency of the oscillating field leads to the maximum possible transition probability between any two energy states, magnetic resonance is said to have been achieved. In Nuclear Magnetic resonance, we use this fundamental phenomenon to measure various properties of the desired material. [3]

A common technique used to probe molecular properties in NMR is called the “Spin Echo”. The nuclear spins are aligned, let loose, and then refocused, making a sharp peak, or “echo”, of the original alignment. When the spins interact with each other through spin-spin coupling, the refocused echo in the magnetization can become highly distorted. This technique of studying the magnetization curve to determine electronic phases is used often by scientists and researchers. We use these simulated Nuclear Magnetization curves to train a machine learning model to predict the model parameters and the kind of the underlying material.

Although the NMR “spin echo” technique may sound complicated, the following animation created by Gavin W Morley (https://en.wikipedia.org/wiki/Spin_echo) makes it much clearer!

Modeling electronic and nuclear spins

Most materials can be classified by their electronic properties into three categories: metal, insulator, and semiconductor. These electrons interact with the nuclear spins (something very non-classical) of a material by way of the hyperfine-interactions and if the electron-nuclear coupling becomes strong enough, a non-negligible two-step process can couple the nuclei with each other throughout the material. That two-step process is when a nuclear spin couples to an electron and changes its motion, and then that electron later “scatters” off another nuclear spin elsewhere in the material. We represent this two-step scattering process by way of an effective spin-spin coupling between nuclei at position r_j and r_i, given by:

Effective coupling between two nuclear spins in a material | Image by Author

where α is the coupling strength and ξ the coupling length,. Generally, α and ξ depend on the details of the nuclear-electron coupling and the quantum state of the electrons, but in our work, we will use them to see if a spin-echo experiment can provide enough information to accurately “reverse engineer” these values from a single M(t) curve. Our simulations also include dissipation of the nuclear spins: due to couplings with the environment the spin information can be “lost”. This occurs at a time scale T(decay)≃√Γ. Our goal then will be to develop an ML model that accurately determines the above three variables (α, ξ, and Γ) from a single M(t) curve.

The Dataset (Some Data Science and some Physics)

This summer, we have worked to extend NMR to the sensing of strong electronic spin-spin correlations and non-dissipative spin-spin interactions between nuclei. To study a wide range of physical phenomena, we simulate the behavior of a typical NMR spin echo experiment on a lattice of 2D spins with different kinds of inter-nuclei interactions and coupling strengths. We employ a radial Gaussian, RKKY, and a Power-Law type kernel for inter-nuclei interactions to mimic long-range (from gapped spin excitations) and short-range (from gapless spin excitations) interactions. Since the average magnetization measured during the spin echo is a function of the local interaction types and the electronic susceptibility of the underlying material, we aim to understand the material type and predict the coupling strengths from machine learning classification and regression models based solely on the shape of the echo curve. The dataset we use is the magnitude of average nuclear magnetization as a function of time, as measured during multiple spin-echo simulations.

from NMR_ML import Dataset# Create the dataset object
gaussian_data=Dataset(data_directory_path="data//2021-08-14_gauss/")
# Load rawdata and get the desired time-window
rawdata = gaussian_data.load_data()
rscl_df, echo_pulse = gaussian_data.get_window(rawdata, center_ratio=2/3, width=150)
# Repeat the process for the Power-Law and RKKY datasets
The Dataset we use for our Machine Learning experiments. Top panel shows the entire time-series dataset. The middle panel shows the the three radial kernel types and the corresponding spin-echo curves in the region of interst. In the bottom panel, we look at the frequency distribution of the time-series in the rotating frame of the NMR experiment. | Image by Author

Methods

To achieve a higher level of interpretability, we use different feature spaces, namely, the pointwise time-series, the pointwise frequency distribution, and polynomial features. We extract specific features from the time-series by a technique called “Multiscale Polynomial Featurization”. Here, we partition the time series into bins of equal length and fit a cubic polynomial in each bin. We use these coefficients as the new features of the curve. Finally, we perform tasks like regression and classification on the new features.[4]

Overall pipeline of the methods employed. We use both pointwise and polynomial features for higher interpretability | Image by Author
Multiscale Polynomial Featurization
Multiscale Polynomial Featurization: (a) We split the time-series into equal bins of say 4, 5 or 10 parts called the “Polynomial Domains” and (b) Fit a cubic polynomial in each bin. We use the co-efficients of each such fitted polynomial as the new extracted features the dataset | Image by Author | Inspired from Torrisi, et al. “Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships.” npj Computational Materials 6.1 (2020): 1–11.
from NMR_ML import Dataset, PolynomialFeatures# Create the dataset object
gaussian_data = Dataset(data_directory_path="2021-08-14_gauss/")
# Load rawdata and get the desired time-window
rawdata = gaussian_data.load_data()
rscl_df, echo_pulse = gaussian_data.get_window(rawdata, center_ratio=2/3, width=150)
# Obtain the polynomial features dataset
polynomial_features = PolynomialFeatures(n_splits=[4,5,10], order_fits=[3,3,3])

Results

We built both classification models to classify the magnetization time-series into the three interaction types and regression models to predict interaction parameters (αx, αz, ξ) or the total kernel integral over the entire set of spins (W = ΣK(r(i,j)). We used ensemble learning and combined many weak decision tree learners to build a random-forest classifier/regressor. Furthermore, we extracted the essential features from the time-series and frequency domain data to understand the sub-sections during the echo-pulse that are most useful for understanding the material. More details about feature extraction methods and ML techniques are available in the `/Tutorial-nbs` section.

  • With knowledge of the type of interaction, we were able to predict the value of the kernel integral (W) with an R² of ~0.8. We observe that the most important features for regression are the average value (x⁰) in the pre-echo region and the slope and curvature of the magentization right at the echo-pulse (x¹, x²)
Regression results: Here, we use four versions of the dataset. We use the pointwise and polynomial features from both the time-domain and frequency domain | Image by Author
  • Using a Random Forest classifier, we were able to achieve a good F1-score of ~0.9 on classifying the echo-curves that were of the power-law vs non-power law type. Essentially, we are able to classify the type of interactions between the nuclei of a material solely by looking at the average magnetization of the sample in an NMR experiment.
Classification results: Using the confusion matrix, we see that on higher kernel integrals (W), the classifier works better at distinguishing the power law vs the non-power law type | Image by Author

Future Work

Although, we were able to achieve most of the objectives mentioned in the project proposal, we believe that there is still room for improvement and will be focussing more on the following aspects:

  • Based on the results described in the Results section, we would like to further explore if deep learning can offer better performance on classification/regression without compromising the model’s interpretability. Although this would entail the need for larger datasets and computing power, the new insights garnered will be very useful in discerning new physics and further contribute to harnessing the data revolution to build and explore novel quantum materials.
  • We would also like to explore optimizing and tuning the applied magnetic pulses in silico to design an experiment that can predict the coupling strengths with higher accuracy.

I hope this project and NMR_ML script benefits someone who is trying to solve a similar problem. Although not the perfect solution, I hope these scripts can give you an idea of how to approach the problem. If you found something helpful, do consider starring the repository, creating an issue, or just drop a message!

Final thoughts and Message

I would like to thank my project mentors Dr.Stephen Carr, Charles Snider, Prof Vesna Mitrovic, Prof Chandrashekar Ramanathan and Prof Brad Marston, the whole NMR research group and and the entire ML4SCI community for their constant support, motivation and guidance. I had a great summer working on this super-exciting project and am looking forward to continue the collaboration in pursuit of knowledge and in renewed excitement!

Finally, a few suggestions for future programmers and students who wish to participate in GSoC, please view GSoC as a project that you would want to inherently pursue and not something you would do for a certificate or a token of merit. It is a much more rewarding process if pursued in align with your long-term goals and career ambition. It is about learning, creating and giving back to a community of wonderful open-source developers and being part of original scientific research (in my case). Finding niche projects that align with your interests is not as hard as it may sound if you have the right technical-skills and background. The most important non-technical skill that one would require and master during the summer would be communication, communication, and communication!

Message from my mentor Dr Stephen Carr:

GSOC provides valuable structure and goals for students. This is vitally important for the relatively short summer research experiences, as most conventional research projects span months or years. The application process provides a large selection of passionate students with many different backgrounds and interests, helping ensure each student finds a project that is a good fit for them (and vice-versa!).

Thank you, ML4SCI, Brown University, and Google, for giving me such a wonderful learning opportunity.

References:

[1] https://opensource.googleblog.com/2021/05/google-summer-of-code-2021-students-are.html

[2] https://ml4sci.org/gsoc/2021/proposal_NMR.html

[3] Cohen-Tannoudji, C., Diu, B., & Laloe, F. (1986). Quantum Mechanics, Volume 2. Quantum Mechanics, 2, 626.

[4] Torrisi, Steven B., et al. “Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships.” npj Computational Materials 6.1 (2020): 1–11.

--

--

Physicist • Currently IISERPune + IBMResearch • Incoming UMCP Physics PhD '23 • Documenting here on (quantum) research, productivity & graduate school