RESEARCH REVIEW Boxl can be used for more general analysis and classification of data or to Learning to learn identify previously unrecognized patterns in large datasets Data representation 安 rical the for o to omcoe y L.E arning alg ec eogyTeles0onhreeacosopendataopensohwar n Table 2 The ome of the op being developed is lis es (MOCs) ol agai "by m ing them to a Oneuo and the potentia (https:/ oordin man e environmen tsfor developing and testing code Python at the ollected and e The Stanford M tt:// n a discrete set (suc com quire speciolzations/mathermaticsmachnelearmine and the on elpfu to use an ensemb a starting point The amachine-earing model may be supervised Ink-nca eighbourmethods,the distances between sample ere is an inte Near or the dat.Semi the average 548 I NATURE I VOL 559 1 26 JULY 2018 2018RESEARCH Review published in peer-reviewed scientific literature. In certain fields, such as cheminformatics, best practices and guidelines have been established to address these problems16. The training of a machine-learning model may be supervised, semi-supervised or unsupervised, depending on the type and amount of available data. In supervised learning, the training data consist of sets of input and associated output values. The goal of the algorithm is to derive a function that, given a specific set of input values, predicts the output values to an acceptable degree of fidelity. If the available dataset consists of only input values, unsupervised learning can be used in an attempt to identify trends, patterns or clustering in the data. Semisupervised learning may be of value if there is a large amount of input data, but only a limited amount of corresponding output data. Supervised learning is the most mature and powerful of these approaches, and is used in the majority of machine-learning studies in the physical sciences, such as in the mapping of chemical composition to a property of interest. Unsupervised learning is less common, but can be used for more general analysis and classification of data or to identify previously unrecognized patterns in large datasets17. Data representation Even though raw scientific data are usually numerical, the form in which data are presented often affects learning. In many types of spectroscopy, the signal is acquired in the time domain, but for interpretation it is converted to the frequency domain using the Fourier transform. Like scientists, a machine-learning algorithm might learn more effectively using one format than the other. The process of converting raw data into something more suitable for an algorithm is called featurization or feature engineering. The more suitable the representation of the input data, the more accurately can an algorithm map it to the output data. Selecting how best to represent the data could require insight into both the underlying scientific problem and the operation of the learning algorithm, because it is not always obvious which choice of representation will give the best performance; this is an active topic of research for chemical systems18. Many representations are available to encode structures and properties. One example is the Coulomb matrix19, which contains information on atomic nuclear repulsion and the potential energy of free atoms, and is invariant to molecular translations and rotation. Molecular systems also lend themselves to descriptions as graphs20. In the solid state, the conventional description of crystal structures that uses translation vectors and fractional coordinates of the atoms is not appropriate for machine learning because a lattice can be represented in an infinite number of ways by choosing a different coordinate system. Representations based on radial distribution functions21, Voronoi tessellations22 or property-labelled materials fragments23 are among the new ways in which this problem is being tackled. Choice of learner When the dataset has been collected and represented appropriately, it is time to choose a model to learn from it. A wide range of model types (or learners) exists for model building and prediction. Supervisedlearning models may predict output values within a discrete set (such as categorizing a material as a metal or an insulator) or a continuous set (such as polarizability). Building a model for the former requires classification, whereas the latter requires regression. A range of learning algorithms can be applied (see Table 1), depending on the type of data and the question posed. It may be helpful to use an ensemble of different algorithms, or of similar algorithms with different values for their internal parameters (known as ‘bagging’ or ‘stacking’), to create a more robust overall model. We outline some of the common algorithms (learners) in the following. Naive Bayes classifiers24 are a collection of classification algorithms based on Bayes’ theorem that identify the most probable hypothesis, given the data as prior knowledge about the problem. Bayes’ theorem provides a formal way of calculating the probability that a hypothesis is correct, given a set of existing data. New hypotheses can then be tested and the prior knowledge updated. In this way, the hypothesis (or model) with the highest probability of correctly representing the data can be selected. In k-nearest-neighbour25 methods, the distances between samples and training data in a descriptor hyperspace are calculated. They are so called because the output value for a prediction relies on the values of the k ‘nearest neighbours’ in the data, where k is an integer. Nearestneighbour models can be used in both classification and regression models: in classification, the prediction is determined by the class of the majority of the k nearest points; in regression, it is determined by the average of the k nearest points. Decision trees26 are flowchart-like diagrams used to determine a course of action or outcome. Each branch of the tree represents a possible decision, occurrence or reaction. The tree is structured to show how and why one choice may lead to the next, with branches indicating that each option is mutually exclusive. Decision trees comprise a root node, leaf nodes and branches. The root node is the starting point of Box 1 Learning to learn One of the most exciting aspects of machine-learning techniques is their potential to democratize molecular and materials modelling by reducing the computer power and prior knowledge required for entry. Just as Pople’s Gaussian software made quantum chemistry more accessible to a generation of experimental chemists, machine-learning approaches, if developed and implemented correctly, can broaden the routine application of computer models by non-specialists. The accessibility of machine-learning technology relies on three factors: open data, open software and open education. There is an increasing drive for open data within the physical sciences, with an ideal best practice outlined recently98,99. Some of the open software being developed is listed in Table 2. There are also many excellent open education resources available, such as massive open online courses (MOOCs). fast.ai (http://www.fast.ai) is a course that is “making neural nets uncool again” by making them accessible to a wider community of researchers. One of the advantages of this course is that users start to build working machine-learning models almost immediately. However, it is not for absolute beginners, requiring a working knowledge of computer programming and high-school-level mathematics. DataCamp (https://www.datacamp.com) ofers an excellent introduction to coding for data-driven science and covers many practical analysis tools relevant to chemical datasets. This course features interactive environments for developing and testing code and is suitable for non-coders because it teaches Python at the same time as machine learning. Academic MOOCs are useful courses for those wishing to get more involved with the theory and principles of artifcial intelligence and machine learning, as well as the practice. The Stanford MOOC (https://www.coursera.org/learn/machine-learning) is popular, with excellent alternatives available from sources such as https:// www.edx.org (see, for example, ‘Learning from data (introductory machine learning)’) and https://www.udemy.com (search for ‘Machine learning A–Z’). The underlying mathematics is the topic of a course from Imperial College London (https://www.coursera.org/ specializations/mathematics-machine-learning). Many machine-learning professionals run informative blogs and podcasts that deal with specifc aspects of machine-learning practice. These are useful resources for general interest as well as for broadening and deepening knowledge. There are too many to provide an exhaustive list here, but we recommend https:// machinelearningmastery.com and http://lineardigressions.com as a starting point. 548 | NATUR E | V OL 559 | 26 J U LY 2018 © 2018 Springer Nature Limited. All rights reserved