REVIEW htps/1doi.org/10.1038/s41586-018-0337-2 Machine learning for molecular and materials science Keith T.Butler,Daniel W.Davies.Hugh Cartwright.Olexandr Isayev&Aron Walsh Here we sumr marize recent progress in machine lear ing for the chemical scienees. a fut ch the design,synth ,characterization and application of molecules and materials is he Schrodinger equation provides a powerful structure ating,testing and refining scientific models.Such technigu uitable for addres bed.Th otsoleorcanacCg tational cost e nat coul edctalin .nhe1960 lication the n es in the form calculations scale Nuts and bolts of machine learnin and a rue-dis esting and sto age,and (and potentially those that are ern c to ated (wi made in the l screening chine-lear of comp y(Part ofa the structure and beh iour of solid enabled the de lopment o ng in a ma process. Data collection the role ing or spurious in the chemical domain is SIS Facility,Ruthe Harwell Campus.Hareell.U 26 JULY 2018 I VOL 559 I NATURE 547 e2018 All rig Review https://doi.org/10.1038/s41586-018-0337-2 Machine learning for molecular and materials science Keith T. Butler1 , Daniel W. Davies2 , Hugh Cartwright3, Olexandr Isayev4* & Aron Walsh5,6* Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning techniques that are suitable for addressing research questions in this domain, as well as future directions for the field. We envisage a future in which the design, synthesis, characterization and application of molecules and materials is accelerated by artificial intelligence. The Schrödinger equation provides a powerful structure– property relationship for molecules and materials. For a given spatial arrangement of chemical elements, the distribution of electrons and a wide range of physical responses can be described. The development of quantum mechanics provided a rigorous theoretical foundation for the chemical bond. In 1929, Paul Dirac famously proclaimed that the underlying physical laws for the whole of chemistry are “completely known”1 . John Pople, realizing the importance of rapidly developing computer technologies, created a program—Gaussian 70—that could perform ab initio calculations: predicting the behaviour, for molecules of modest size, purely from the fundamental laws of physics2 . In the 1960s, the Quantum Chemistry Program Exchange brought quantum chemistry to the masses in the form of useful practical tools3 . Suddenly, experimentalists with little or no theoretical training could perform quantum calculations too. Using modern algorithms and supercomputers, systems containing thousands of interacting ions and electrons can now be described using approximations to the physical laws that govern the world on the atomic scale4–6 . The field of computational chemistry has become increasingly predictive in the twenty-first century, with activity in applications as wide ranging as catalyst development for greenhouse gas conversion, materials discovery for energy harvesting and storage, and computer-assisted drug design7 . The modern chemical-simulation toolkit allows the properties of a compound to be anticipated (with reasonable accuracy) before it has been made in the laboratory. High-throughput computational screening has become routine, giving scientists the ability to calculate the properties of thousands of compounds as part of a single study. In particular, density functional theory (DFT)8,9 , now a mature technique for calculating the structure and behaviour of solids10, has enabled the development of extensive databases that cover the calculated properties of known and hypothetical systems, including organic and inorganic crystals, single molecules and metal alloys11–13. The emergence of contemporary artificial-intelligence methods has the potential to substantially alter and enhance the role of computers in science and engineering. The combination of big data and artificial intelligence has been referred to as both the “fourth paradigm of science”14 and the “fourth industrial revolution”15, and the number of applications in the chemical domain is growing at an astounding rate. A subfield of artificial intelligence that has evolved rapidly in recent years is machine learning. At the heart of machine-learning applications lie statistical algorithms whose performance, much like that of a researcher, improves with training. There is a growing infrastructure of machine-learning tools for generating, testing and refining scientific models. Such techniques are suitable for addressing complex problems that involve massive combinatorial spaces or nonlinear processes, which conventional procedures either cannot solve or can tackle only at great computational cost. As the machinery for artificial intelligence and machine learning matures, important advances are being made not only by those in mainstream artificial-intelligence research, but also by experts in other fields (domain experts) who adopt these approaches for their own purposes. As we detail in Box 1, the resources and tools that facilitate the application of machine-learning techniques mean that the barrier to entry is lower than ever. In the rest of this Review, we discuss progress in the application of machine learning to address challenges in molecular and materials research. We review the basics of machine-learning approaches, identify areas in which existing methods have the potential to accelerate research and consider the developments that are required to enable more wide-ranging impacts. Nuts and bolts of machine learning With machine learning, given enough data and a rule-discovery algorithm, a computer has the ability to determine all known physical laws (and potentially those that are currently unknown) without human input. In traditional computational approaches, the computer is little more than a calculator, employing a hard-coded algorithm provided by a human expert. By contrast, machine-learning approaches learn the rules that underlie a dataset by assessing a portion of that data and building a model to make predictions. We consider the basic steps involved in the construction of a model, as illustrated in Fig. 1; this constitutes a blueprint of the generic workflow that is required for the successful application of machine learning in a materials-discovery process. Data collection Machine learning comprises models that learn from existing (training) data. Data may require initial preprocessing, during which missing or spurious elements are identified and handled. For example, the Inorganic Crystal Structure Database (ICSD) currently contains more than 190,000 entries, which have been checked for technical mistakes but are still subject to human and measurement errors. Identifying and removing such errors is essential to avoid machine-learning algorithms being misled. There is a growing public concern about the lack of reproducibility and error propagation of experimental data 1ISIS Facility, Rutherford Appleton Laboratory, Harwell Campus, Harwell, UK. 2Department of Chemistry, University of Bath, Bath, UK. 3Department of Chemistry, Oxford University, Oxford, UK. 4Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA. 5Department of Materials Science and Engineering, Yonsei University, Seoul, South Korea. 6Department of Materials, Imperial College London, London, UK. *e-mail: olexandr@olexandrisayev.com; a.walsh@imperial.ac.uk 26 J U LY 2018 | V OL 559 | NATUR E | 547 © 2018 Springer Nature Limited. All rights reserved