REVIEW RESEARCH valuesis ofen problematic.Consequently,the development ofauto Model optimization e and must be taken into a gm in thefirst stry equation com ith loc occurs when the model is no approach The diagnos test for e the pe6 的d高 predictiv the hest m are supportve nel ridge regr mm罗eT orms inpur andeting thein o guide laboratory eigl the values of which rep rately as pos synth the first scientists to r sofinternal van ya heuristics.Even modest changes improve or impair learning co arning techniq me chemical u estions they could answe Method Probabilistic inference ogical inference Pattern recognition Constrained optimization Algorithms include enetcaleomhm Chemical query Is my new theory valid? spnes whetopoundidl n part ot a wide 26 JULY 2018 VOL 559I NATURE I 549 A配 Review RESEARCH the tree. Both root and leaf nodes contain questions or criteria to be addressed. Branches are arrows connecting nodes, showing the flow from question to answer. Decision trees are often used in ensemble methods (meta-algorithms), which combine multiple trees into one predictive model to improve performance. Kernel methods are a class of algorithms, the best known members of which are support vector machine and kernel ridge regression27. The name ‘kernel’ comes from the use of a kernel function—a function that transforms input data into a higher-dimensional representation that makes the problem easier to solve. In a sense, a kernel is a similarity function provided by the domain expert: it takes two inputs and creates an output that quantifies how similar they are. Artificial neural networks and deep neural networks28 loosely mimic the operation of the brain, with artificial neurons (the processing unit) arranged in input, output and hidden layers. In the hidden layers, each neuron receives input signals from other neurons, integrates those signals and then uses the result in a straightforward computation. Connections between neurons have weights, the values of which represent the stored knowledge of the network. Learning is the process of adjusting the weights so that the training data are reproduced as accurately as possible. Whatever the model, most learners are not fully autonomous, requiring at least some guidance. The values of internal variables (hyperparameters) are estimated beforehand using systematic and random searches, or heuristics. Even modest changes in the values of hyperparameters can improve or impair learning considerably, and the selection of optimal values is often problematic. Consequently, the development of automatic optimization algorithms is an area of active investigation, as is their incorporation into accessible packages for non-expert users (see Table 2). Model optimization When the learner (or set of learners) has been chosen and predictions are being made, a trial model must be evaluated to allow for optimization and ultimate selection of the best model. Three principal sources of error arise and must be taken into account: model bias, model variance and irreducible errors, with the total error being the sum of these. Bias is the error from incorrect assumptions in the algorithm and can result in the model missing underlying relationships. Variance is sensitivity to small fluctuations in the training set. Even well-trained machinelearning models may contain errors due to noise in the training data, measurement limitations, calculation uncertainties, or simply outliers or missing data. Poor model performance usually indicates a high bias or a high variance, as illustrated in Fig. 2. High bias (also known as underfitting) occurs when the model is not flexible enough to adequately describe the relationship between inputs and predicted outputs, or when the data are insufficiently detailed to allow the discovery of suitable rules. High variance (or overfitting) occurs when a model becomes too complex; typically, this occurs as the number of parameters is increased. The diagnostic test for overfitting is that the accuracy of a model in representing training data continues to improve, while the performance in estimating test data plateaus or declines. The key test for the accuracy of a machine-learning model is its successful application to unseen data. A widely used method for determining the quality of a model involves withholding a randomly selected portion of data during training. This withheld dataset, known as a test set, is shown to the model once training is complete (Fig. 2). The extent to which the output data in the validation set is accurately predicted then provides a measure of the effectiveness of training. Cross-validation is reliable only when the samples used for training and validation are representative of the whole population, which may present problems if the sample size is small or if the model is applied to data from compounds that are very different to those in the original dataset. A careful selection of methods for evaluating the transferability and applicability of a model is required in such cases. Accelerating the scientific method Whether through the enumeration and analysis of experimental data or the codification of chemical intuition, the application of informatics to guide laboratory chemists is advancing rapidly. In this section, we explore how machine learning is helping to progress and to reduce the barriers between chemical and materials design, synthesis, characterization and modelling. We also describe some of the important developments in the field of artificial intelligence for data-mining existing literature. Guiding chemical synthesis Organic chemists were among the first scientists to recognize the potential of computational methods in laboratory practice. E. J. Corey’s Organic Chemical Simulation of Synthesis (OCSS) program29, developed 50 years ago, was an attempt to automate retrosynthetic analysis. In a synthetic chemistry route, the number of possible transformations Fig. 1 | Evolution of the research workflow in computational chemistry. The standard paradigm in the first-generation approach is to calculate the physical properties of an input structure, which is often performed via an approximation to the Schrödinger equation combined with local optimization of the atomic forces. In the second-generation approach, by using global optimization (for example, an evolutionary algorithm) an input of chemical composition is mapped to an output that contains predictions of the structure or ensemble of structures that the combination of elements are likely to adopt. The emerging third-generation approach is to use machine-learning techniques with the ability to predict composition, structure and properties provided that sufficient data are available and an appropriate model is trained. Four stages of training a machine-learning model with some of the common choices are listed in the bottom panel. Structure Composition Input Chemical and physical data Property Structure, property Output Composition, structure, property Local optimization algorithm First generation Structure-property calculation Second generation Crystal structure prediction Third generation Statistically driven design Global optimization algorithm Machine learning (i) Data collection • Experiment • Simulation • Databases (ii) Representation • Optimize format • Remove noise • Extract features (iii) Type of learning • Supervised • Semi-supervised • Unsupervised (iv) Model selection • Cross-validation • Ensembles • Anomaly checks Table 1 | Classes of machine-learning techniques and some chemical questions they could answer Class Bayesian Evolutionarya Symbolist Connectionist Analogist Method Probabilistic inference Evolving structures Logical inference Pattern recognition Constrained optimization Algorithms include Naive Bayes Bayesian networks Genetic algorithm Particle swarm Rules Decision trees Artifcial neural networks Back propagation Nearest neighbour Support vectors Chemical query Is my new theory valid? What molecule gives this property? How do I make this material? What compound did I synthesize? Find a structure–property relation The classes shown were chosen following ref. 97. aAlthough evolutionary algorithms are often integrated into machine-learning procedures, they form part of a wider class of stochastic search algorithms. 26 J U LY 2018 | V OL 559 | NATUR E | 549 © 2018 Springer Nature Limited. All rights reserved