Happy Monday Everyone,

I saw this article referencing a review by Nature. It talks about machine learning as applied to chemistry and materials science, and thought to read the original paper (which can be found here behind a pay wall. Try sci-hub). Today we will be discussing some of the ideas in “Machine learning for molecular and materials science.”

The beginning of the article is rather concise: The Schrödinger equation was so effective at allowing the physicist to calculate, from the relative distribution of electrons, the physical and chemical properties of an element, that in 1929 Paul Dirac claimed that the underlying physical laws of chemistry are “completely known.” In the 1960s the first program was made that could apply these calculations to predict the chemical property of an element ab-initio (ab-initio means from the basic physical laws). Modern algorithms and computers allow us to do this for thousands of atoms, but the more atoms you add, there is an exponentially increasing cost to the calculation, as a result shortcuts start being desirable. Enter the world of machine learning.

Machine learning by definition is a machine coming up with a solution, an algorithm or an optimized set of parameters, without being explicitly programmed to do so. This consists translating the thing you want it to do into a “cost” function, which the machine will try to minimize. Linear regression (the tool in excel and other graphing programs that makes a trend line from a bunch of random data points) is a particularly simple example of this, where the distance between the line and the points is the cost, and the machine iterates in one or another direction until it finds a line that is the least far away from all points. Linear regression is machine learning.

For chemistry however, the article discussed X other algorithms that are common and useful:

  1. Naive Bayes Clusters – Applies Bayes’ Theorem, to find the “most probable hypothesis” using past and known data.
  2. K-nearest neighbor – Finds materials most similar to the current one in known properties to predict the physical or chemical property in question.
  3. Decision Trees – creates a logical diagram similar to a flow chart that focuses on the probability of a given outcome where each decision is a parameter. This is a favourite in predictive analytics both for it’s simplicity and for it’s for the easy way the model can be translated into a set of rules people can act on.
  4. Kernel Methods – of which one of the most popular and well-known is the Support Vector Machine (SVN). I know virtually nothing about these and will say no more about them.
  5. Neural Networks – Who knows about machine learning and hasn’t heard of these?

In materials science, the cost function could be based on having the machine try to predict the property of a material given some known data. Perhaps given it’s chemical formula and structure it tries to predict whether the material can crystallize. In organic chemistry, cost functions could be finding the most efficient way to produce X compound with Y starter materials, and in fact machine learning has been used to create/discover new organic synthesis pathways.

Once the cost function is defined, a set of training data is amassed. This can come from one of many existing databases (ChEMBL, AFLOWLIB, and the Crystallography Open Database are all examples. More are given in the Nature paper). To prevent over-fitting, usually about 20% of the training data is initially set aside and after the models have been trained, they are tested against this 20%. A model over fits when it starts memorizing particulars of the data set rather than generalizing it’s rules.

The article goes on to recommend chemists start sharing their data in a way that is not just conducive for publishing to be read by other chemists, but also formatting it into a standardized format that machines can easily read. The article speculates about the possibilities of quantum computing for powerful computational chemistry as well as the possibility of machine learning being used to discover new chemical principles, and what that might look like. I highly recommend you check it out.

On another week I would like to look at one or more of these machine learning algorithms in more detail and perhaps go over an example, applying it to an open source chemical database to generate useful conclusions.