Getting Started in Materials Informatics

How to get started doing research in materials informatics (data science + materials science)

Nathan C. Frey, PhD
Towards Data Science

--

Understand structure — property — performance — processing relationships in materials with data science. From Wikimedia Commons.

In this post I share resources and recommendations for getting involved in materials informatics research. As it becomes increasingly more expensive and time-consuming to discover and engineer new materials to address some of our most pressing global challenges (human health, food and water security, climate, etc.), we need materials scientists with scientific domain expertise and training in data science. Whether you want to use data science in your own research or simply have a better understanding of the state of play in the field, this post will help you on your way.

I’ve shared a list of resources on GitHub for getting started with materials informatics. The list includes papers and interactive tutorials, helpful Python libraries, blogs, newsletters, podcasts, databases, and academic materials informatics research groups.

What is materials informatics?

Informatics is the science of transforming information (data). In a sense, all materials science involves informatics because all materials science is built on data and theory that explains the data. By “materials informatics,” we specifically mean materials science that is superpowered by modern data science. This lets us accelerate material property prediction, conduct systematic searches through the space of possible materials to discover compounds with optimized properties, and even design new materials based on the properties we want.

Experimental scientists can use materials informatics to reduce the time and effort they spend in the lab on trial-and-error experiments; while computational scientists and theorists can provide better guidance on what materials to make, how to make them, and how to understand their properties. The more you understand data science and machine learning, the clearer it will be that these are tools for efficient science, not replacements for scientists and scientific thinking.

Prerequisites

Data science prerequisites: glasses, lots of monitors, night mode text editors. From Unsplash.

I’m assuming that interested readers are already working in some domain of materials science: nanomaterials, polymers, metallurgy, biomaterials, or quantum materials. To do research in (or develop an appreciation for) materials informatics, you need to start with a basic foundation in data science: statistics, a scientific computing language like Python or Julia with associated numerical computing capabilities and data structures, machine learning, and storytelling with data (visualization).

The good news is that if you’re a scientist, you probably have a lot of the necessary background. More good news is that there has never been a better time to learn data science — there are tons of free courses, tutorials, and blogs to learn from. In some ways the most challenging thing can be to filter through all those resources and find the best ones. There are plenty of guides to getting started with data science.

Depending on your learning style, you might want to start with a bottom-up approach reading textbooks and implementing methods from scratch, or a top-down approach where you pick a fun problem in your own research or a simple problem from Kaggle and start hacking away, learning as you go.

Into the rabbit hole

If you want a brief technical introduction to materials informatics, you can start with the resources listed here. From a detailed definition of materials informatics, to formalized best practices, interactive tutorials, and a three day workshop, this handful of resources will help you get started as fast as possible.

If you’re interested in keeping up with the field, either as a practitioner or a casual observer, there’s a collection of blogs, podcasts, and newsletters to help you.

Pictured here: you after supercharging your research with materials informatics. From Unsplash.

Tools of the trade

Once you’re ready to start doing your own research in materials informatics, it helps to make use of the many amazing open-source projects already available. Python is the programming language of choice in the field, and the Jupyter notebook environment is the virtual laboratory for doing materials informatics. With those tools as the substrate, there are entire software ecosystems for generating and analyzing calculated materials data, building machine learning pipelines for predicting properties and designing new materials, visualizing materials data, and sharing your research in a machine-readable way.

A key principle to strive for is to make sure your data is Findable, Accessible, Interoperable, and Reusable (FAIR). In other words, make it easy for other scientists to use your data for materials informatics! Luckily, there are great frameworks available for making clear visualizations of your data and sharing it in a way that makes it explorable through a web interface and a Python API.

It’s all about the data…

The first commandment of data science is “garbage in, garbage out.” It all comes down to obtaining a great data source with some kind of signal, processing it, and applying the Swiss Army knife of available tools to extract some insights. No machine learning method will be able to provide meaningful outputs if that first step is missing: a quality data source. Heroic folks across the world have gone to a lot of trouble calculating and measuring properties for many hundreds of thousands of materials and making that data easily available in databases.

Even now, most materials data, which takes an enormous amount of labor and resources to generate, is locked away in figures, tables, and text in PDFs. There are ongoing efforts to parse PDFs and mine the data, but you can make an immediate impact within your own research group and your field by contributing your data to a database and making it FAIR.

…and the questions

The zeroth commandment of data science is “ask the right questions.” Data science can’t tell you what your research direction should be. It remains the chief job of the scientist to figure out what the most interesting questions are to ask and investigate. Hopefully, by using these resources, you’ll gain a better understanding of what materials informatics can and can’t do.

Materials informaticians

To help students, postdocs, and faculty find all the awesome groups working in this field, I’ve started a short list of materials informatics groups (in no particular order). These groups focus on totally different areas of materials science, with the common theme that they use theory, computation, and data science in some capacity. If you work with a materials informatics group and would like to be added to the list, let me know!

Contributions welcome!

I’d love your feedback! The list of resources isn’t meant to be comprehensive, instead, it’s a curated collection of things I’ve found useful in my own work. There are a lot of cutting-edge methods that I’ve omitted (e.g., quantum machine learning, equivariant neural networks) because they aren’t quite at the maturity level needed for day-to-day use and they aren’t really appropriate for the “getting started” level. If there is a materials informatics resource with some functionality that isn’t covered by anything in the list, feel free to read the contributions guide and share it with me. I’d also like to acknowledge Pat Walters’ excellent list of resources for cheminformatics, which inspired this list.

Getting in touch

If you liked this post or have any questions, feel free to reach out over email or connect with me on LinkedIn and Twitter.

The full list of resources is available here on GitHub.

You can find out more about my projects and publications on my website or just read a bit more about me.

--

--

Senior ML Scientist & Group Leader @PrescientDesign • @Genentech | Co-founder @AtomicDataSciences | Prev Postdoc @MIT, NDSEG Fellow @UPenn, @Berkeley Lab