RESEARCH REVIEW Table 3 Publicly accessible structure and pr erty databases for molecules and solids Name Computed structures and properties 8Gtgncdeeneyreposioyitomhigh-throughputabinitiocaltculbtons Computational Materials Repostory heocanastecionae,eiealandanysotdton https://cmr.fysikdtu.dk GDB s of hypothetical small organic molecules http://gdb.unibe.ch/downloads Harvard Clean Energy Project Computed properties of candidate organic solar absorber materials https //cepdb.molecularspace.org Materials Projec Cageai5.ottnenandhypotheticalmatenascamedotusngd https://materalsprojectorg NOMAD https://nomad-repository.eu Open Ouantum Materials Databas carried out using a http://ogmd.org NREL Materials Database Computed properties of materials for r ble-energy apolication https//materials.nrelgo TEDesignLab perCmetlandomputedpropertesoadthedestgnotnewhemc http://tedesignlab.org ZINC les in 2D and 3D tormats nttos://zinc15.do ng.org Experimental structures and properties ChEMBL molecules with drug-like properties https://www.ebiac.uk/chemb nttps spider.com tal properties of m stallography Open Database t organ met organic compounds and n I Structure Da -//icsd fiz-karsruhe.de MatNav nductivity and http://.ip http://matweb.con dat d https://pubchemncbi. sapplied ctural prop perties.Machine learning hasalo nen at satisfy the 138 chi The so far have OareinforcementlG h n).As we will ow,th use of (Fig.3).Moc Bbadonobeterdnforoedgenea crat a manner ame to algorithmi Mol hip is properties(de novo design). Reclaiming the literature .1 )stapping into the vat amoun of wedge th 552 I NATURE I VOL 5591 26 JULY 2018 20185% NatuRESEARCH Review elpasolite crystal structure (ABC2D6), was applied to screen all two million possible combinations of elements that satisfy the formula, revealing chemical trends and identifying 128 new materials72. Such models are expected to become a central feature in the next generation of high-throughput virtual screening procedures. The majority of crystal-solid machine-learning studies so far have concentrated on a particular type of crystal structure. This is because of the difficulty of representing crystalline solids in a format that can be fed easily to a statistical learning procedure. By concentrating on a single structure type, the representation is inherently built into the model. Developing flexible, transferrable representations is one of the important areas of research in machine learning for crystalline solids (see subsection ‘Data representation’). As we will see below, the use of machine learning in molecular chemistry is more advanced than in the solid state, to a large extent owing to the greater ease with which molecules can be described in a manner amenable to algorithmic interpretation. Molecular science. The quantitative structure–activity relationship is now a firmly established tool for drug discovery and molecular design. With the development of massive databases of assayed and virtual molecules73,74, methods for rapid, reliable, virtual screening of these molecules for pharmacological (or other) activity are required to unlock the potential of such molecules. Models based on quantitative structure–activity relationships can be described as the application of statistical methods to the problem of finding empirical relationships of the type Pi=k′(D1, D2, …, Dn), where Pi is the property of interest, k′ is a (typically linear) mathematical transformation and Di are calculated or measured structural properties75. Machine learning has a long history in the development of quantitative structure–activity relationships, stretching back over half a century76. Molecular science is benefitting from cutting-edge algorithmic developments in machine learning such as generative adversarial networks77 and reinforcement learning for the computational design of biologically active compounds. In a generative adversarial network, two models are trained simultaneously: a generative model (or generator) captures the distribution of data while a discriminative model (or discriminator) estimates the probability that a sample came from the training set rather than the generator. The training procedure for the generator is to maximize the probability of the discriminator making an error (Fig. 3). Models based on objective-reinforced generative adversarial networks78 are capable of generating new organic molecules from scratch. Such models can be trained to produce diverse molecules that contain specific chemical features and physical responses, through a reward mechanism that resembles classical conditioning in psychology. Using reinforcement learning, newly generated chemical structures can be biased towards those with the desired physical and biological properties (de novo design). Reclaiming the literature A final area for which we consider the recent progress of machine learning (across all disciplines) is tapping into the vast amount of knowledge that already exists. Although the scientific literature provides a wealth of information to researchers, it is increasingly difficult to navigate owing to the proliferation of journals, articles and databases. Text mining has become a popular approach to identifying and extracting Table 3 | Publicly accessible structure and property databases for molecules and solids Name Description URL Computed structures and properties AFLOWLIB Structure and property repository from high-throughput ab initio calculations of inorganic materials http://afowlib.org Computational Materials Repository Infrastructure to enable collection, storage, retrieval and analysis of data from electronic-structure codes https://cmr.fysik.dtu.dk GDB Databases of hypothetical small organic molecules http://gdb.unibe.ch/downloads Harvard Clean Energy Project Computed properties of candidate organic solar absorber materials https://cepdb.molecularspace.org Materials Project Computed properties of known and hypothetical materials carried out using a standard calculation scheme https://materialsproject.org NOMAD Input and output fles from calculations using a wide variety of electronicstructure codes https://nomad-repository.eu Open Quantum Materials Database Computed properties of mostly hypothetical structures carried out using a standard calculation scheme http://oqmd.org NREL Materials Database Computed properties of materials for renewable-energy applications https://materials.nrel.gov TEDesignLab Experimental and computed properties to aid the design of new thermoelectric materials http://tedesignlab.org ZINC Commercially available organic molecules in 2D and 3D formats https://zinc15.docking.org Experimental structures and properties ChEMBL Bioactive molecules with drug-like properties https://www.ebi.ac.uk/chembl ChemSpider Royal Society of Chemistry’s structure database, featuring calculated and experimental properties from a range of sources https://chemspider.com Citrination Computed and experimental properties of materials https://citrination.com Crystallography Open Database Structures of organic, inorganic, metal–organic compounds and minerals http://crystallography.net CSD Repository for small-molecule organic and metal–organic crystal structures https://www.ccdc.cam.ac.uk ICSD Inorganic Crystal Structure Database https://icsd.fz-karlsruhe.de MatNavi Multiple databases targeting properties such as superconductivity and thermal conductance http://mits.nims.go.jp MatWeb Datasheets for various engineering materials, including thermoplastics, semiconductors and fbres http://matweb.com NIST Chemistry WebBook High-accuracy gas-phase thermochemistry and spectroscopic data https://webbook.nist.gov/chemistry NIST Materials Data Repository Repository to upload materials data associated with specifc publications https://materialsdata.nist.gov PubChem Biological activities of small molecules https://pubchem.ncbi.nlm.nih.gov 552 | NATUR E | V OL 559 | 26 J U LY 2018 © 2018 Springer Nature Limited. All rights reserved