articles Initial sequencing and analysis of the human genome A partial list of authors appears on the opposite page. Affiliations are listed at the end of the paper. The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of coordinate regulation of the genes in the clusters the 20th century-'sparked a scientific quest to understand the There appear to be about 30,000-40,000 protein-coding genes in nature and content of genetic information that has propelled the human genome-only about twice as many as in worm or fly. biology for the last hundred years. The scientific progress made However, the genes are more complex, with more alternative falls naturally into four main phases, corresponding roughly to the splicing generating a larger number of protein products. four quarters of the century. The first established the cellular basis of The full set of proteins(the proteome)encoded by the human heredity: the chromosomes. The second defined the molecular basis genome is more complex than those of invertebrates. This is due in f heredity: the dNA double helix. The third unlocked the informa- part to the presence of vertebrate-specific protein domains and tional basis of heredity, with the discovery of the biological mechan- motifs(an estimated 7% of the total), but more to the fact that ism by which cells read the information contained in genes and with vertebrates appear to have arranged pre-existing components into a the invention of the recombinant DNA technologies of cloning and richer collection of domain architectures sequencing by which scientists can do the same. Hundreds of human genes appear likely to have resulted from The last quarter of a century has been marked by a relentless drive horizontal transfer from bacteria at some point in the vertebrate to decipher first genes and then entire genomes, spawning the field ge. Dozens of genes appear to have been derived from trans of genomics. The fruits of this work already include the genome posable elements. quences of 599 viruses and viroids, 205 naturally occurring Although about half of the human genome derives from trans- plasmids, 185 organelles, 31 eubacteria, seven archaea, one posable elements, there has been a marked decline in the overall fungus, two animals and one plant activity of such elements in the hominid lineage. DNA transposons Here we report the results of a collaboration involving 20 groups appear to have become completely inactive and long-terminal from the United States, the United Kingdom, Japan, France, repeat(LTR)retroposons may also have done so Germany and China to produce a draft sequence of the human The pericentromeric and subtelomeric regions of chromosomes genome. The draft genome sequence was generated from a physical are filled with large recent segmental duplications of sequence from ap covering more than 96% of the euchromatic part of the human elsewhere in the genome. Segmental duplication is much more it covers about 94% of the human genome. The sequence was a w)isis of thea s than in yeast, fly or worm organization of Alu elements explains the long roduced over a relatively short period, with coverage rising from standing mystery of their surprising genomic distribution, and about 10% to more than 90% over roughly fifteen months. The suggests that there may be strong selection in favour of preferential sequence data have been made available without restriction and retention of Alu elements in GC-rich regions and that these selfish updated daily throughout the project. The task ahead is to produce a elements may benefit their human hosts finished sequence, by closing all gaps and resolving all ambiguities. The mutation rate is about twice as high in male as in female Already about one billion bases are in final form and the task of meiosis, showing that most mutation occurs in males bringing the vast majority of the sequence to this standard is now Cytogenetic analysis of the sequenced clones confirms sugges- tions that large GC-poor regions are strongly correlated with dark The sequence of the human genome is of interest in several G-bands in karyotypes st genome to be extensively sequenced so far, Recombination rates tend to be much higher in distal region eeing 25 times as large as any previously sequenced genome and (around 20 megabases(Mb))of chror mosomes and on shorter eight times as large as the sum of all such genomes. It is the first chromosome arms in general, in a pattern that promotes the vertebrate genome to be extensively sequenced. And, uniquely, it is occurrence of at least one crossover per chromosome arm in each Much work remains to be done to produce a complete finished More than 1.4 million single nucleotide polymorphisms(SNPs) sequence, but the vast trove of information that has become in the human genome have been identified. This collection should available through this collaborative effort allows a global perspective allow the tion of genome-wide linkage n the human genome. Although the details will change as the mapping of the genes in the human population is finished In this paper, we start by presenting background information on The genomic landscape shows marked variation in the distribu- the project and describing the generation, assembly and evaluation tion of a number of features, including genes, transposable of the draft genome sequence. We then focus on an initial analysis of elements, GC content, CpG islands and recombination rate. This the sequence itself: the broad chromosomal landscape; the repeat gives us important clues about function. For example, the devel- elements and the rich palaeontological record of evolutionary and opmentally important HOX gene clusters are the most repeat-poor biological processes that they provide; the human genes and regions of the human genome, probably reflecting the very complex proteins and their differences and similarities with those of other 860 A@2001 Macmillan Magazines Ltd NATURE VOL 4091 15 FEBRUARY 2001
Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium* * A partial list of authors appears on the opposite page. Af®liations are listed at the end of the paper. ............................................................................................................................................................................................................................................................................ The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. The rediscovery of Mendel's laws of heredity in the opening weeks of the 20th century1±3 sparked a scienti®c quest to understand the nature and content of genetic information that has propelled biology for the last hundred years. The scienti®c progress made falls naturally into four main phases, corresponding roughly to the four quarters of the century. The ®rst established the cellular basis of heredity: the chromosomes. The second de®ned the molecular basis of heredity: the DNA double helix. The third unlocked the informational basis of heredity, with the discovery of the biological mechanism by which cells read the information contained in genes and with the invention of the recombinant DNA technologies of cloning and sequencing by which scientists can do the same. The last quarter of a century has been marked by a relentless drive to decipher ®rst genes and then entire genomes, spawning the ®eld of genomics. The fruits of this work already include the genome sequences of 599 viruses and viroids, 205 naturally occurring plasmids, 185 organelles, 31 eubacteria, seven archaea, one fungus, two animals and one plant. Here we report the results of a collaboration involving 20 groups from the United States, the United Kingdom, Japan, France, Germany and China to produce a draft sequence of the human genome. The draft genome sequence was generated from a physical map covering more than 96% of the euchromatic part of the human genome and, together with additional sequence in public databases, it covers about 94% of the human genome. The sequence was produced over a relatively short period, with coverage rising from about 10% to more than 90% over roughly ®fteen months. The sequence data have been made available without restriction and updated daily throughout the project. The task ahead is to produce a ®nished sequence, by closing all gaps and resolving all ambiguities. Already about one billion bases are in ®nal form and the task of bringing the vast majority of the sequence to this standard is now straightforward and should proceed rapidly. The sequence of the human genome is of interest in several respects. It is the largest genome to be extensively sequenced so far, being 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. It is the ®rst vertebrate genome to be extensively sequenced. And, uniquely, it is the genome of our own species. Much work remains to be done to produce a complete ®nished sequence, but the vast trove of information that has become available through this collaborative effort allows a global perspective on the human genome. Although the details will change as the sequence is ®nished, many points are already clear. X The genomic landscape shows marked variation in the distribution of a number of features, including genes, transposable elements, GC content, CpG islands and recombination rate. This gives us important clues about function. For example, the developmentally important HOX gene clusters are the most repeat-poor regions of the human genome, probably re¯ecting the very complex coordinate regulation of the genes in the clusters. X There appear to be about 30,000±40,000 protein-coding genes in the human genomeÐonly about twice as many as in worm or ¯y. However, the genes are more complex, with more alternative splicing generating a larger number of protein products. X The full set of proteins (the `proteome') encoded by the human genome is more complex than those of invertebrates. This is due in part to the presence of vertebrate-speci®c protein domains and motifs (an estimated 7% of the total), but more to the fact that vertebrates appear to have arranged pre-existing components into a richer collection of domain architectures. X Hundreds of human genes appear likely to have resulted from horizontal transfer from bacteria at some point in the vertebrate lineage. Dozens of genes appear to have been derived from transposable elements. X Although about half of the human genome derives from transposable elements, there has been a marked decline in the overall activity of such elements in the hominid lineage. DNA transposons appear to have become completely inactive and long-terminal repeat (LTR) retroposons may also have done so. X The pericentromeric and subtelomeric regions of chromosomes are ®lled with large recent segmental duplications of sequence from elsewhere in the genome. Segmental duplication is much more frequent in humans than in yeast, ¯y or worm. X Analysis of the organization of Alu elements explains the longstanding mystery of their surprising genomic distribution, and suggests that there may be strong selection in favour of preferential retention of Alu elements in GC-rich regions and that these `sel®sh' elements may bene®t their human hosts. X The mutation rate is about twice as high in male as in female meiosis, showing that most mutation occurs in males. X Cytogenetic analysis of the sequenced clones con®rms suggestions that large GC-poor regions are strongly correlated with `dark G-bands' in karyotypes. X Recombination rates tend to be much higher in distal regions (around 20 megabases (Mb)) of chromosomes and on shorter chromosome arms in general, in a pattern that promotes the occurrence of at least one crossover per chromosome arm in each meiosis. X More than 1.4 million single nucleotide polymorphisms (SNPs) in the human genome have been identi®ed. This collection should allow the initiation of genome-wide linkage disequilibrium mapping of the genes in the human population. In this paper, we start by presenting background information on the project and describing the generation, assembly and evaluation of the draft genome sequence. We then focus on an initial analysis of the sequence itself: the broad chromosomal landscape; the repeat elements and the rich palaeontological record of evolutionary and biological processes that they provide; the human genes and proteins and their differences and similarities with those of other articles 860 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
articles sequence contributed, with a partial list of personnel. A full list of Gerald Nyakatura'2, Stefan Taudien"2& Andreas Rump'2 contributors at each centre is available as Supplementary Information Beijing Genomics Institute/Human Genome Center Huanming Yang 3, Jun Yu, Jian Wang 3, Guyang Huang Whitehead Institute for Biomedical Research Center for Genome Jun Gu'5 Research: Eric S Lander Lauren m. linton Bruce Birren Chad Nusbaum, Michael C Zody,, Jennifer Baldwin Multimegabase Sequencing Center, The Institute for Syst Keri Devon, Ken Dewar, Michael Doyle, william FitzHugh*, Lee Rowen, Anup Madan& Shizen Qin Roel Funke, Diane Gage, Katrina Harris, Andrew Heaford John Howland, Lisa Kann, Jessica Lehoczky, Rosie Levine Stanford Genome Technology Center: Ronald W. Davis" Paul McEwan, Kevin McKernan, James Meldrim, Jill P. Mesiroy , Nancy A Federspiel ", A Pia Abola"&Mi Cher Miranda, William Morris', Jerome Naylor Christina Raymond, Mark Rosetti, Ralph Santos' Stanford Human Genome Center: Richard M. Myers Andrew Sheridan, Carrie Sougnez, Nicole Stange-Thomann' Jeremy Schmutz, Mark Dickson, Jane Grimwood David R cox18 Nikola Stojanovic, Aravind Subramanian dudley WymaN University of Washington Genome Center: Maynard V. Olson The Sanger Centre: Jane Rogers, John Sulston?2 Rajinder Kaul& Christopher Raymor Rachael Ainscough, Stephan Beck, David Bentley, John Burton, Department of Molecular Biology, Keio University School of Christopher Clee, Nigel Carter, Alan Coulson Medicine: Nobuyoshi Shimizu Kazuhiko Kawasaki Rebecca Deadman Panos Deloukas Andrew Dunham shinsei minoshima lan Dunham, Richard Durbin*, Lisa French, Darren Grafham Simon Gregory, Tim Hubbard, Sean Humphray, Adrienne Hunt, University of Texas Southwestern Medical Center at Dallas: Matthew Jones, Christine Lloyd, Amanda McMurray? Glen A. Evans2t, Maria Athanasiou& Roger Schultz Lucy Matthews, Simon Mercer?, Sarah Milne, James C Mullikin+ Andrew Mungall, Robert Plumb, Mark Ross Ratna Shownkeen University of Oklahoma,'s Advanced Center for Genor sarah Sims Technology: Bruce A Roe, Feng Chen"& Huaqin Pan Washington University Genome Sequencing Center: Max Planck Institute for Molecular Genetics: Juliane ramser Robert H. Waterston , Richard K, Wilson LaDeana W Hillier. Hans Lehrach2& Richard Reinhardt 3 John D. McPherson. Marco A Marra. Elaine R. Mardis Lucinda A. Fulton, Asif T. Chinwalla, Kymberlie H. Pepin Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Warren R. Gish, Stephanie L. Chissoe, Michael C Wendl Center: W. Richard Mc Combie Melissa de la Bastide KimD. Delehaunty Tracie L Miner, Andrew Delehaunty' Neilay Dedhia Jason B. Kramer Lisa L Cook. Robert S Fulton Douglas L Johnson, Patrick J Minx&Sandra W. Clifton GBF-German Research Centre for Biotechnology Helmut blocker 5 Klaus hornischer25 Gabriele nordsiek25 US DOE Joint Genome Institute: Trevor Hawkins Elbert Branscomb", Paul Predki, Paul Richardson, Genome Analysis Group(listed in alphabetical order, also includes individuals listed under other headings): Sarah Wenning, Tom Slezak, Norman Doggetr, Jan-Fang Cheng, Richa Agarwala26, L. Aravind26, Jeffrey A Bai Anne Olsen, Susan Lucas, Christopher Elkin Edward Uberbacher& Marvin frazier Serafim Batzoglou, Ewan Bimey, Peer Bork230,DanielGBrown Christopher B Burge, Lorenzo Cerutti, Hsiu-Chuan Chen Baylor College of Medicine Human Genome Sequencing Center: Deanna Church Michele Clamp?, Richard R. Copley2o0 Richard A. Gibbs5. Donna M. mur Steven e schi Tobias Doerks29,30, Sean R. Eddy, Evan E Eichler, JohnB.Bouck+, Erica J.Sodergren, Kim C. Worley., Catherine M. Terrence S Furey, James Galagan James G.R. Gilbert gs Susan L. Naylor, Raju S Kucherlapati, David L. Nelson Henning Hermjakob, Karsten Hokamp 7, Wonhee Jang L Steven Johnson 2. Thomas A. Jones32 Simon Kasit a Arek Kaspryzk, Scot Kennedy, W. James Kent, Paul Kitts Eugene V Koonin, lan Korf, David Kulp, Doron Lancet Todd M. Lowe", Aoife McLysaght, Tarjei Mikkelsen John V moral cola mulder victor j. pollara Chris P. Ponting", Greg Schuler, Jorg Schultz o, Guy Slater rian F A Smit", Elia Stupka2, Joseph Szustakowki38, Roland o pe and CNRS UMR-8030: Jean Weissenbach"( Danielle Thierry-Mieg26, Jean Thierry-Mieg2, Lukas Wagner Roland Heilig", William Saurin, Francois Artiguenave John Wallis, Raymond Wheeler Alan Williams, Yuri L Wolf Philippe Brottier, Thomas Bruls", Eric Pelletier KennethH. Wolfe", Shiaw-Pyng Yang Ru-Fang Yeh 1 Catherine Roberto Patrick Wincker10 Scientific management: National Human Genome Research GTC Sequencing Center: Douglas R Smith Institute, US National institutes of Health: francis collins Lynn Doucette-Stamm", Marc Rubenfiel Keith Weinstock, Mark S. Guyer Jane Peterson", Adam Felsenfeld Hong Mei Lee"& JoAnn Dubois"1 Kris A. Wetterstrand"; Office of Science, US Department of Energy: Aristides Patrinos"; The Wellcome Trust: Michael J. Department of Genome Analysis, Institute of Molecular NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 861 Genome Sequencing Centres (Listed in order of total genomic sequence contributed, with a partial list of personnel. A full list of contributors at each centre is available as Supplementary Information.) Whitehead Institute for Biomedical Research, Center for Genome Research: Eric S. Lander1 *, Lauren M. Linton1 , Bruce Birren1 *, Chad Nusbaum1 *, Michael C. Zody1 *, Jennifer Baldwin1 , Keri Devon1 , Ken Dewar1 , Michael Doyle1 , William FitzHugh1 *, Roel Funke1 , Diane Gage1 , Katrina Harris1 , Andrew Heaford1 , John Howland1 , Lisa Kann1 , Jessica Lehoczky1 , Rosie LeVine1 , Paul McEwan1 , Kevin McKernan1 , James Meldrim1 , Jill P. Mesirov1 *, Cher Miranda1 , William Morris1 , Jerome Naylor1 , Christina Raymond1 , Mark Rosetti1 , Ralph Santos1 , Andrew Sheridan1 , Carrie Sougnez1 , Nicole Stange-Thomann1 , Nikola Stojanovic1 , Aravind Subramanian1 & Dudley Wyman1 The Sanger Centre: Jane Rogers2 , John Sulston2 *, Rachael Ainscough2 , Stephan Beck2 , David Bentley2 , John Burton2 , Christopher Clee2 , Nigel Carter2 , Alan Coulson2 , Rebecca Deadman2 , Panos Deloukas2 , Andrew Dunham2 , Ian Dunham2 , Richard Durbin2 *, Lisa French2 , Darren Grafham2 , Simon Gregory2 , Tim Hubbard2 *, Sean Humphray2 , Adrienne Hunt2 , Matthew Jones2 , Christine Lloyd2 , Amanda McMurray2 , Lucy Matthews2 , Simon Mercer2 , Sarah Milne2 , James C. Mullikin2 *, Andrew Mungall2 , Robert Plumb2 , Mark Ross2 , Ratna Shownkeen2 & Sarah Sims2 Washington University Genome Sequencing Center: Robert H. Waterston3 *, Richard K. Wilson3 , LaDeana W. Hillier3 *, John D. McPherson3 , Marco A. Marra3 , Elaine R. Mardis3 , Lucinda A. Fulton3 , Asif T. Chinwalla3 *, Kymberlie H. Pepin3 , Warren R. Gish3 , Stephanie L. Chissoe3 , Michael C. Wendl3 , Kim D. Delehaunty3 , Tracie L. Miner3 , Andrew Delehaunty3 , Jason B. Kramer3 , Lisa L. Cook3 , Robert S. Fulton3 , Douglas L. Johnson3 , Patrick J. Minx3 & Sandra W. Clifton3 US DOE Joint Genome Institute: Trevor Hawkins4 , Elbert Branscomb4 , Paul Predki4 , Paul Richardson4 , Sarah Wenning4 , Tom Slezak4 , Norman Doggett4 , Jan-Fang Cheng4 , Anne Olsen4 , Susan Lucas4 , Christopher Elkin4 , Edward Uberbacher4 & Marvin Frazier4 Baylor College of Medicine Human Genome Sequencing Center: Richard A. Gibbs5 *, Donna M. Muzny5 , Steven E. Scherer5 , John B. Bouck5 *, Erica J. Sodergren5 , Kim C. Worley5 *, Catherine M. Rives5 , James H. Gorrell5 , Michael L. Metzker5 , Susan L. Naylor6 , Raju S. Kucherlapati7 , David L. Nelson, & George M. Weinstock8 RIKEN Genomic Sciences Center: Yoshiyuki Sakaki9 , Asao Fujiyama9 , Masahira Hattori9 , Tetsushi Yada9 , Atsushi Toyoda9 , Takehiko Itoh9 , Chiharu Kawagoe9 , Hidemi Watanabe9 , Yasushi Totoki9 & Todd Taylor9 Genoscope and CNRS UMR-8030: Jean Weissenbach10, Roland Heilig10, William Saurin10, Francois Artiguenave10, Philippe Brottier10, Thomas Bruls10, Eric Pelletier10, Catherine Robert10 & Patrick Wincker10 GTC Sequencing Center: Douglas R. Smith11, Lynn Doucette-Stamm11, Marc Ruben®eld11, Keith Weinstock11, Hong Mei Lee11 & JoAnn Dubois11 Department of Genome Analysis, Institute of Molecular Biotechnology: Andre Rosenthal12, Matthias Platzer12, Gerald Nyakatura12, Stefan Taudien12 & Andreas Rump12 Beijing Genomics Institute/Human Genome Center: Huanming Yang13, Jun Yu13, Jian Wang13, Guyang Huang14 & Jun Gu15 Multimegabase Sequencing Center, The Institute for Systems Biology: Leroy Hood16, Lee Rowen16, Anup Madan16 & Shizen Qin16 Stanford Genome Technology Center: Ronald W. Davis17, Nancy A. Federspiel17, A. Pia Abola17 & Michael J. Proctor17 Stanford Human Genome Center: Richard M. Myers18, Jeremy Schmutz18, Mark Dickson18, Jane Grimwood18 & David R. Cox18 University of Washington Genome Center: Maynard V. Olson19, Rajinder Kaul19 & Christopher Raymond19 Department of Molecular Biology, Keio University School of Medicine: Nobuyoshi Shimizu20, Kazuhiko Kawasaki20 & Shinsei Minoshima20 University of Texas Southwestern Medical Center at Dallas: Glen A. Evans21², Maria Athanasiou21 & Roger Schultz21 University of Oklahoma's Advanced Center for Genome Technology: Bruce A. Roe22, Feng Chen22 & Huaqin Pan22 Max Planck Institute for Molecular Genetics: Juliane Ramser23, Hans Lehrach23 & Richard Reinhardt23 Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center: W. Richard McCombie24, Melissa de la Bastide24 & Neilay Dedhia24 GBFÐGerman Research Centre for Biotechnology: Helmut BloÈ cker25, Klaus Hornischer25 & Gabriele Nordsiek25 * Genome Analysis Group (listed in alphabetical order, also includes individuals listed under other headings): Richa Agarwala26, L. Aravind26, Jeffrey A. Bailey27, Alex Bateman2 , Sera®m Batzoglou1 , Ewan Birney28, Peer Bork29,30, Daniel G. Brown1 , Christopher B. Burge31, Lorenzo Cerutti28, Hsiu-Chuan Chen26, Deanna Church26, Michele Clamp2 , Richard R. Copley30, Tobias Doerks29,30, Sean R. Eddy32, Evan E. Eichler27, Terrence S. Furey33, James Galagan1 , James G. R. Gilbert2 , Cyrus Harmon34, Yoshihide Hayashizaki35, David Haussler36, Henning Hermjakob28, Karsten Hokamp37, Wonhee Jang26, L. Steven Johnson32, Thomas A. Jones32, Simon Kasif38, Arek Kaspryzk28, Scot Kennedy39, W. James Kent40, Paul Kitts26, Eugene V. Koonin26, Ian Korf3 , David Kulp34, Doron Lancet41, Todd M. Lowe42, Aoife McLysaght37, Tarjei Mikkelsen38, John V. Moran43, Nicola Mulder28, Victor J. Pollara1 , Chris P. Ponting44, Greg Schuler26, JoÈrg Schultz30, Guy Slater28, Arian F. A. Smit45, Elia Stupka28, Joseph Szustakowki38, Danielle Thierry-Mieg26, Jean Thierry-Mieg26, Lukas Wagner26, John Wallis3 , Raymond Wheeler34, Alan Williams34, Yuri I. Wolf26, Kenneth H. Wolfe37, Shiaw-Pyng Yang3 & Ru-Fang Yeh31 Scienti®c management: National Human Genome Research Institute, US National Institutes of Health: Francis Collins46*, Mark S. Guyer46, Jane Peterson46, Adam Felsenfeld46* & Kris A. Wetterstrand46; Of®ce of Science, US Department of Energy: Aristides Patrinos47; The Wellcome Trust: Michael J. Morgan48 © 2001 Macmillan Magazines Ltd
articles organisms; and the history of genomic segments (Comparisons (4)The development of random shotgun sequencing of comple- are drawn throughout with the genomes of the budding yeast mentary DNA fragments for high-throughput gene discovery by Saccharomyces cerevisiae, the nematode worm Caenorhabditis Schimmeland Schimmel and Sutcliffe, later dubbed expressed elegans, the fruitfly Drosophila melanogaster and the mustard weed sequence tags(ESTs)and pursued with automated sequencing by Arabidopsis thaliana; we refer to these for convenience simply as Venter and others- yeast, worm, fly and mustard weed. Finally, we discuss applications The idea of sequencing the entire human genome was first of the sequence to biology and medicine and describe next steps in proposed in discussions at scientific meetings organized by the the project. A full description of the methods is provided as US Department of Energy and others from 1984 to 1986(refs 21 epplementaryInformationonNature'swebsite(http://www.22).AcommitteeappointedbytheUsNationalResearchCouncil endorsed the concept in its 1988 report", but recommer ded a We recognize that it is impossible to provide a comprehensive broader programme, to include: the creation of genetic, physical analysis of this vast dataset, and thus our goal is to illustrate the and sequence maps of the human genome; parallel efforts in key ange of insights that can be gleaned from the human genome and model organisms such as bacteria, yeast, worms, flies and mice; the ereby to sketch a research agenda for the future development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human Background to the human Genome Project genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of The Human Genome Project arose from two key insights that Health. In other countries, the UK Medical Research Council and emerged in the early 1980s: that the ability to take global views of the Wellcome Trust supported genomic research in Britain; the genomes could greatly accelerate biomedical research, by allowing Centre d'Etude du Polymorphisme Humain and the French Mus- researchers to attack problems in a comprehensive and unbiased cular Dystrophy Association launched mapping efforts in france: fashion; and that the creation of such global views would require a government agencies, including the Science and Technology Agency communal effort in infrastructure building, unlike anything pre- and the Ministry of Education, Science, Sports and Culture sup ously attempted in biomedical research. Several key projects ported genomic research efforts in Japan; and the European Com elped to crystallize these insights, including: munity helped to launch several international efforts, notably the (1) The sequencing of the bacterial viruses pX174"and lambda, the programme to sequence the yeast genome. By late 1990, the Human animal virus SV40 and the human mitochondrion between 1977 Genome Project had been launched, with the creation of genome and 1982. These projects proved the feasibility of assembling small centres in these countries. Additional participants subsequently sequence fragments into complete genomes, and showed the value joined the effort, notably in Germany and China. In addition, the their inheritance patterns, launched by Botstein and colleagues in of the Human Genome Project O)was founded to provide a of complete catalogues of genes and other functional elements. Human Genome Organization(HUGo)was founded to provide a (2 ible to locatd:e to create a human genetic map to make it forum for international coordination of genomic research.Several ease genes of unknown function based solely on books"- provide a more comprehensive discussion of the genesis 980(ref.9) Through 1995, work progressed rapidly on two fronts( Fig. 1) (3)The programmes to create physical maps of clones covering the The first was construction of genetic and physical maps of the yeastand worm" genomes to allow isolation of genes and regions human and mouse genomes-, providing key tools for identifica- based solely on their chromosomal position, launched by Olson and tion of disease genes and anchoring points for genomic sequence. Sulston in the mid-1980s The second was sequencing of the yeast and worm"genomes, as 1984 199019911992199319941995199619971998199920002001 Discussion and debate in scientific community E co S cerevisiae sequencing A thaliana sequ Genetic maps Microsatellites SNPs cDNA sequencing Genomic sequencing Genetic maps Microsatellites CDNA sequence Genomic sequencing Pilot project, 15%6 9 Finishing.-100% Figure 1 Timeline of large-scale genomic analyses Shown are selected components of (green) from 1990; earlier projects are described in the text SNPs, single nucleotide work on several non-vertebrate model organisms(red), the mouse(blue)and the human polymorphisms; ESTS, expressed sequence tags. 862 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
organisms; and the history of genomic segments. (Comparisons are drawn throughout with the genomes of the budding yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruit¯y Drosophila melanogaster and the mustard weed Arabidopsis thaliana; we refer to these for convenience simply as yeast, worm, ¯y and mustard weed.) Finally, we discuss applications of the sequence to biology and medicine and describe next steps in the project. A full description of the methods is provided as Supplementary Information on Nature's web site (http://www. nature.com). We recognize that it is impossible to provide a comprehensive analysis of this vast dataset, and thus our goal is to illustrate the range of insights that can be gleaned from the human genome and thereby to sketch a research agenda for the future. Background to the Human Genome Project The Human Genome Project arose from two key insights that emerged in the early 1980s: that the ability to take global views of genomes could greatly accelerate biomedical research, by allowing researchers to attack problems in a comprehensive and unbiased fashion; and that the creation of such global views would require a communal effort in infrastructure building, unlike anything previously attempted in biomedical research. Several key projects helped to crystallize these insights, including: (1) The sequencing of the bacterial viruses FX1744,5 and lambda6 , the animal virus SV407 and the human mitochondrion8 between 1977 and 1982. These projects proved the feasibility of assembling small sequence fragments into complete genomes, and showed the value of complete catalogues of genes and other functional elements. (2) The programme to create a human genetic map to make it possible to locate disease genes of unknown function based solely on their inheritance patterns, launched by Botstein and colleagues in 1980 (ref. 9). (3) The programmes to create physical maps of clones covering the yeast10 and worm11 genomes to allow isolation of genes and regions based solely on their chromosomal position, launched by Olson and Sulston in the mid-1980s. (4) The development of random shotgun sequencing of complementary DNA fragments for high-throughput gene discovery by Schimmel12 and Schimmel and Sutcliffe13, later dubbed expressed sequence tags (ESTs) and pursued with automated sequencing by Venter and others14±20. The idea of sequencing the entire human genome was ®rst proposed in discussions at scienti®c meetings organized by the US Department of Energy and others from 1984 to 1986 (refs 21, 22). A committee appointed by the US National Research Council endorsed the concept in its 1988 report23, but recommended a broader programme, to include: the creation of genetic, physical and sequence maps of the human genome; parallel efforts in key model organisms such as bacteria, yeast, worms, ¯ies and mice; the development of technology in support of these objectives; and research into the ethical, legal and social issues raised by human genome research. The programme was launched in the US as a joint effort of the Department of Energy and the National Institutes of Health. In other countries, the UK Medical Research Council and the Wellcome Trust supported genomic research in Britain; the Centre d'Etude du Polymorphisme Humain and the French Muscular Dystrophy Association launched mapping efforts in France; government agencies, including the Science and Technology Agency and the Ministry of Education, Science, Sports and Culture supported genomic research efforts in Japan; and the European Community helped to launch several international efforts, notably the programme to sequence the yeast genome. By late 1990, the Human Genome Project had been launched, with the creation of genome centres in these countries. Additional participants subsequently joined the effort, notably in Germany and China. In addition, the Human Genome Organization (HUGO) was founded to provide a forum for international coordination of genomic research. Several books24±26 provide a more comprehensive discussion of the genesis of the Human Genome Project. Through 1995, work progressed rapidly on two fronts (Fig. 1). The ®rst was construction of genetic and physical maps of the human and mouse genomes27±31, providing key tools for identi®cation of disease genes and anchoring points for genomic sequence. The second was sequencing of the yeast32 and worm33 genomes, as articles 862 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 1984 1990 1991 1992 1993 1994 1995 1996 1997 1998 2000 1999 2001 Bacterial genome sequencing H. flu E. coli 39 species S. cerevisiae sequencing C. elegans sequencing D. melanogaster sequencing A. thaliana sequencing Microsatellites ESTs cDNA sequencing Genetic maps Physical maps Genetic maps Physical maps Genomic sequencing cDNA sequencing Genomic sequencing Full length ESTs Full length SNPs Microsatellites Pilot project,15% Chromosome 22 Chromosome 21 Working draft, 90% SNPs Pilot sequencing Finishing, ~100% Discussion and debate in scientific community NRC report Other organisms Mouse Human Figure 1 Timeline of large-scale genomic analyses. Shown are selected components of work on several non-vertebrate model organisms (red), the mouse (blue) and the human (green) from 1990; earlier projects are described in the text. SNPs, single nucleotide polymorphisms; ESTs, expressed sequence tags. © 2001 Macmillan Magazines Ltd
articles well as targeted regions of mammalian genomes"-. These projects libraries with more uniform representation. The practice of sequen- showed that large-scale sequencing was feasible and developed the cing from both ends of double-stranded clones(double-barrelled two-phase paradigm for genome sequencing. In the first, 'shotgun, shotgun sequencing) was introduced by Ansorge and others"in phase, the genome is divided into appropriately sized segments and 1990, allowing the use of linking information between sequence each segment is covered to a high degree of redundancy(typically, fragments 35t to tenfold) through the sequencing of randomly selected The application of shotg was also extended ubfragments. The second is a'finishing'phase, in which sequence applying it to larger and larger DNA molecules--from plasm gaps are closed and remaining ambiguities are resolved through (4 kilobases(kb))to cosmid clones(40 kb), to artificial chro directed analysis. The results also showed that complete genomic mosomes cloned in bacteria and yeast(100-500 kb)and bacterial equence provided information about genes, regulatory regions and genomes(1-2 megabases(Mb). In principle, a genome of arbi In 1995, genome scientists considered a proposals that would formly sampled at random. beated s by the shotgun method, chromosome structure that was not readily obtainable from cDNA trary size may be directly sequenced by the shotgun method, studies alone genome in a first phase and then returning to finish the sequence in one detects overlaps by consulting an alphabetized look-up table of second phase. After vigorous debate, it was decided that such a all k-letter words in the data). Mathematical analysis of the plan was premature for several reasons. These included the need first expected number of gaps as a function of coverage is similarly to prove that high-quality, long-range finished sequence could be straightforward?. produced from most parts of the complex, repeat-rich human Practical difficulties arise because of repeated sequences and genome; the sense that many aspects of the sequencing process cloning bias. Small amounts of repeated sequence pose little were still rapidly evolving; and the desirability of further decreasing problem for shotgun sequencing. For example, one can readily costs assemble typical bacterial genomes(about 1. 5% repeat)or the Instead, pilot projects were launched to demonstrate the feasi- euchromatic portion of the fly genome(about 3% repeat). By bility of cost-effective, large-scale sequencing, with a target comple- contrast, the human genome is filled(> 50%) with repeated tion date of March 1999. The projects successfully produced sequences, including interspersed repeats derived from transposable finished sequence with 99.99% accuracy and no gaps. They also elements, and long genomic regions that have been duplicated in introduced bacterial artificial chromosomes( BACs)", a new large- tandem, palindromic or dispersed fashion(see below). These insert cloning system that proved to be more stable than the cosmids include large duplicated segments(50-500 kb) with high sequence and yeast artificial chromosomes(YACs) that had been used identity(98-99.9%), at which mispairing during recombination eviously. The pilot projects drove the maturation and conver- creates deletions responsible for genetic syndromes. Such features gence of sequencing strategies, while producing 15% of the human complicate the assembly of a correct and finished genome sequence genome sequence. With successful completion of this phase, the There are two approaches for sequencing large repeat-rich human genome sequencing effort moved into full-scale production genomes. The first is a whole-genome shotgun sequencing in march 1999 approach, as has been used for the repeat-poor genomes of viruses, The idea of first producing a draft genome sequence was revived bacteria and flies, using linking information and computational at this time, both because the ability to finish such a sequence was no longer in doubt and because there was great hunger in the scientific ommunity for human sequence data. In addition, some scientists Hierarchical shotgun sequencing favoured prioritizing the production of a draft genome sequence over regional finished sequence because of concerns about com- I that might be subject to undesirable restrictions on use" quence Genomic DNA nercial plans to generate proprietary databases of huma The consortium focused on an initial goal of producing, in a first production phase lasting until June 2000, a draft genome sequence overing most of the genome. Such a draft genome sequence, BAC library although not completely finished, would rapidly allow investigators dORseY to begin to extract most of the information in the human sequence Experiments showed that sequencing clones covering about 90% of organ the human genome to a redundancy of about four-to fivefold Chalf- clone contigs oal has been achieved as described belo The second sequence production phase is now under way. Its BAC to be aims are to achieve full-shotgun coverage of the existing clones sequenced during 2001, to obtain clones to fill the remaining gaps in the physical map, and to produce a finished sequence(apart from Shotgun regions that cannot be cloned or sequenced with currently available clones techniques)no later than 2003 Shotgun ..Ac Strategic issues TGATCATGCTTAAAcO AACCCTGTGCATCCTACTG oly .. ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG Hierarchical shotgun sequencing the fundamental method for ln as introduc ncing methods 7. s, the Figure 2 idealized representation of the hierarchical shotgun sequencing strategy. A Soon after the invention of dna it has remained library is constructed by fragmenting the target genome and cloning it into a large- genome sequ the past 20 years. The approach has been refined and ext lake it more efficient. For example, improved prote for clones are selected and sequenced by the random shotgun strategy. Finally,the clone fragmenting and cloning DNA allowed construction of shotgun sequences are assembled to reconstruct the sequence of the genome NATURE VOL 409 15 FEBRUARY 200 .nature. com A⊙2 mcmillan Magazines Ltd
well as targeted regions of mammalian genomes34±37. These projects showed that large-scale sequencing was feasible and developed the two-phase paradigm for genome sequencing. In the ®rst, `shotgun', phase, the genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight- to tenfold) through the sequencing of randomly selected subfragments. The second is a `®nishing' phase, in which sequence gaps are closed and remaining ambiguities are resolved through directed analysis. The results also showed that complete genomic sequence provided information about genes, regulatory regions and chromosome structure that was not readily obtainable from cDNA studies alone. In 1995, genome scientists considered a proposal38 that would have involved producing a draft genome sequence of the human genome in a ®rst phase and then returning to ®nish the sequence in a second phase. After vigorous debate, it was decided that such a plan was premature for several reasons. These included the need ®rst to prove that high-quality, long-range ®nished sequence could be produced from most parts of the complex, repeat-rich human genome; the sense that many aspects of the sequencing process were still rapidly evolving; and the desirability of further decreasing costs. Instead, pilot projects were launched to demonstrate the feasibility of cost-effective, large-scale sequencing, with a target completion date of March 1999. The projects successfully produced ®nished sequence with 99.99% accuracy and no gaps39. They also introduced bacterial arti®cial chromosomes (BACs)40, a new largeinsert cloning system that proved to be more stable than the cosmids and yeast arti®cial chromosomes (YACs)41 that had been used previously. The pilot projects drove the maturation and convergence of sequencing strategies, while producing 15% of the human genome sequence. With successful completion of this phase, the human genome sequencing effort moved into full-scale production in March 1999. The idea of ®rst producing a draft genome sequence was revived at this time, both because the ability to ®nish such a sequence was no longer in doubt and because there was great hunger in the scienti®c community for human sequence data. In addition, some scientists favoured prioritizing the production of a draft genome sequence over regional ®nished sequence because of concerns about commercial plans to generate proprietary databases of human sequence that might be subject to undesirable restrictions on use42±44. The consortium focused on an initial goal of producing, in a ®rst production phase lasting until June 2000, a draft genome sequence covering most of the genome. Such a draft genome sequence, although not completely ®nished, would rapidly allow investigators to begin to extract most of the information in the human sequence. Experiments showed that sequencing clones covering about 90% of the human genome to a redundancy of about four- to ®vefold (`halfshotgun' coverage; see Box 1) would accomplish this45,46. The draft genome sequence goal has been achieved, as described below. The second sequence production phase is now under way. Its aims are to achieve full-shotgun coverage of the existing clones during 2001, to obtain clones to ®ll the remaining gaps in the physical map, and to produce a ®nished sequence (apart from regions that cannot be cloned or sequenced with currently available techniques) no later than 2003. Strategic issues Hierarchical shotgun sequencing Soon after the invention of DNA sequencing methods47,48, the shotgun sequencing strategy was introduced49±51; it has remained the fundamental method for large-scale genome sequencing52±54 for the past 20 years. The approach has been re®ned and extended to make it more ef®cient. For example, improved protocols for fragmenting and cloning DNA allowed construction of shotgun libraries with more uniform representation. The practice of sequencing from both ends of double-stranded clones (`double-barrelled' shotgun sequencing) was introduced by Ansorge and others37 in 1990, allowing the use of `linking information' between sequence fragments. The application of shotgun sequencing was also extended by applying it to larger and larger DNA moleculesÐfrom plasmids (, 4 kilobases (kb)) to cosmid clones37 (40 kb), to arti®cial chromosomes cloned in bacteria and yeast55 (100±500 kb) and bacterial genomes56 (1±2 megabases (Mb)). In principle, a genome of arbitrary size may be directly sequenced by the shotgun method, provided that it contains no repeated sequence and can be uniformly sampled at random. The genome can then be assembled using the simple computer science technique of `hashing' (in which one detects overlaps by consulting an alphabetized look-up table of all k-letter words in the data). Mathematical analysis of the expected number of gaps as a function of coverage is similarly straightforward57. Practical dif®culties arise because of repeated sequences and cloning bias. Small amounts of repeated sequence pose little problem for shotgun sequencing. For example, one can readily assemble typical bacterial genomes (about 1.5% repeat) or the euchromatic portion of the ¯y genome (about 3% repeat). By contrast, the human genome is ®lled (. 50%) with repeated sequences, including interspersed repeats derived from transposable elements, and long genomic regions that have been duplicated in tandem, palindromic or dispersed fashion (see below). These include large duplicated segments (50±500 kb) with high sequence identity (98±99.9%), at which mispairing during recombination creates deletions responsible for genetic syndromes. Such features complicate the assembly of a correct and ®nished genome sequence. There are two approaches for sequencing large repeat-rich genomes. The ®rst is a whole-genome shotgun sequencing approach, as has been used for the repeat-poor genomes of viruses, bacteria and ¯ies, using linking information and computational articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 863 Genomic DNA BAC library Organized mapped large clone contigs BAC to be sequenced Shotgun clones Assembly Shotgun sequence ...ACCGTAAATGGGCTGATCATGCTTAAA ...ACCGTAAATGGGCTGATCATGCTTAAACCCTGTGCATCCTACTG... TGATCATGCTTAAACCCTGTGCATCCTACTG... Hierarchical shotgun sequencing Figure 2 Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by fragmenting the target genome and cloning it into a largefragment cloning vector; here, BAC vectors are shown. The genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct the sequence of the genome. © 2001 Macmillan Magazines Ltd
articles analysis to attempt to avoid misassemblies. The second is the for clone-based information. Such analysis may help to refine hierarchical shotgun sequencing approach( Fig. 2), also referred sequencing strategies for other large genomes to as map-based,BAC-based or clone-by-clone. This approach Technology for large-scale sequencing (typically 100-200 kb each) covering the genome and separately improvements in the production and analysis of se -ay technological involves generating and organizing a set of large-insert clones Sequencing the human genome depended on ma data. Ke y erforming shotgun sequencing on appropriately chosen clones. innovations were developed both within and outside the Human Because the sequence information is local, the issue of long-range Genome Project. Laboratory innovations included four-colour misassembly is eliminated and the risk of short-range misassembly fluorescence-based sequence detection, improved fluorescent is reduced. One caveat is that some large-insert clones may suffer dyes-ce, dye-labelled terminators, polymer rearrangement, although this risk can be reduced by appropriate designed for sequencing6-7, cycle sequencing" and capillary gel uality-control measures involving clone fingerprints(see below). electrophoresis"-4. These studies contributed to substantial The two methods are likely to entail similar costs for producing improvements in the automation, quality and throughput of nished sequence of a mammalian genome. The hierarchical collecting raw DNA sequence?. 6. There were also important approach has a higher initial cost than the whole-genome approach, advances in the development of software packages for the analysis owing to the need to create a map of clones(about 1% of the total of sequence data. The PHRED software package".introduced the ost of sequencing)and to sequence overlaps between clones. On concept of assigning a base-quality score to each base, on the basis the other hand, the whole-genome approach is likely to require of the probability of an erroneous call. These quality scores make it nuch greater work and expense in the final stage of producing a possible to monitor raw data quality and also assist in determining finished sequence, because of the challenge of resolving misassem- whether two similar sequences truly overlap. The PHRAP computer bliesBothmethodsmustalsodealwithcloningbiasesresultinginpackage(http://bozeman.mbt.washington.edu/phrap.docs/phrap under-representation of some regions in either large-insert or html) then systematically assembles the sequence data using the small-insert clone libraries base-quality scores. The program assigns 'assembly-quality scores There was lively scientific debate over whether the human to each base in the assembled sequence, providing an objective archical shotgun sequencing. Weber and Myers stimulated these on and validated by extensive experimental dat scores were based genome sequencing effort should employ whole-genome or hier- criterion to guide sequence finishing. The qualit discussions with a specific proposal for a whole-genome shotgun Another key innovation for scaling up sequencing was the approach, together with an analysis suggesting that the method development by several centres of automated methods for sample could work and be more efficient. Green challenged these conclu- preparation. This typically involved creating new biochemical sions and argued that the potential benefits did not outweigh the protocols suitable for automation, followed by construction of likely risks appropriate robotic systems. In the end, we concluded that the human genome seq Coordination and public data sharing effort should employ the hierarchical approach for several reasons. The Human Genome Project adopted two important principles First, it was prudent to use the approach for the first project to with regard to human sequencing. The first was that the collabora- sequence a repeat-rich genome With the hierarchical approach, the tion would be open to centres from any nation. Although potentially ultimate frequency of misassembly in the finished product would less efficient, in a narrow economic sense, than a centralized probably be lower than with the whole-genome approach, in which approach involving a few large factories, the inclusive approach it would be difficult to identify regions in which the assembly was strongly favoured because we felt that the human sequence is the common heritage of all humanity and the work .. Second, it was prudent to use the approach in dealing with an should transcend national boundaries, and we believed that Itbred organism, such as the human. In the whole-genome shot- scientific progress was best assured by a diversity of approaches gun method, sequence would necessarily come from two different The collaboration was coordinated through periodic international ies of the human genome. Accurate sequence assembly could be meetings(referred to as ' Bermuda meetings after the venue of the uence variation between these two copies-both first three gatherings)and regular telephone conferences. Work was SNPs(which occur at a rate of I per 1, 300 bases)and scale shared flexibly among the centres, with some groups focusing on structural heterozygosity(which has been documented in human particular chromosomes and others contributing in a genome-wide chromosomes). In the hierarchical shotgun method, each large- fashion. insert clone is derived from a single haplotype. The second principle was rapid and unrestricted data release. The Third, the hierarchical method would be better able to deal with centres adopted a policy that all genomic sequence data should be inevitable cloning biases, because it would more readily allow made publicly available without restriction within 24 hours of argeting of additional sequencing to under-represented regions. assembly". Pre-publication data releases had been pioneered And fourth, it was better suited to a project shared among members mapping projects in the wormand mouse genomes"s and were of a diverse international consortium, because it allowed work and prominently adopted in the sequencing of the worm, providing a responsibility to be easily distributed. As the ultimate goal has direct model for the human sequencing efforts. We believed that always been to create a high-quality, finished sequence to serve as a scientific progress would be most rapidly advanced by immediate foundation for biomedical research, we reasoned that the advan- and free availability of the human genome sequence. The explosion tages of this more conservative approach outweighed the additional of scientific work based on the publicly available sequence data in cost, if any. oth academia and industry has confirmed this judgement. a biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own Generating the draft genome sequence efforts to sequence the human genome. Their plan obl uses a ixed strategy, involving combining some coverage with whole- Generating a draft sequence of the human genome involved three publicly available hierarchical shotgun data generated by the Inter- and assembling the individual sequenced clones into an overall draf national Human Genome Sequencing Consortium. If the raw genome sequence. a glossary of terms related to genome sequencing sequence reads from the whole-genome shot omponent are and assembly is provided in Box 1 made available, it may be possible to evaluate the extent to which the The draft genome sequence is a dynamic product, which is sequence of the human genome can be assembled without the need regularly updated as additional data accumulate en route to the A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
analysis to attempt to avoid misassemblies. The second is the `hierarchical shotgun sequencing' approach (Fig. 2), also referred to as `map-based', `BAC-based' or `clone-by-clone'. This approach involves generating and organizing a set of large-insert clones (typically 100±200 kb each) covering the genome and separately performing shotgun sequencing on appropriately chosen clones. Because the sequence information is local, the issue of long-range misassembly is eliminated and the risk of short-range misassembly is reduced. One caveat is that some large-insert clones may suffer rearrangement, although this risk can be reduced by appropriate quality-control measures involving clone ®ngerprints (see below). The two methods are likely to entail similar costs for producing ®nished sequence of a mammalian genome. The hierarchical approach has a higher initial cost than the whole-genome approach, owing to the need to create a map of clones (about 1% of the total cost of sequencing) and to sequence overlaps between clones. On the other hand, the whole-genome approach is likely to require much greater work and expense in the ®nal stage of producing a ®nished sequence, because of the challenge of resolving misassemblies. Both methods must also deal with cloning biases, resulting in under-representation of some regions in either large-insert or small-insert clone libraries. There was lively scienti®c debate over whether the human genome sequencing effort should employ whole-genome or hierarchical shotgun sequencing. Weber and Myers58 stimulated these discussions with a speci®c proposal for a whole-genome shotgun approach, together with an analysis suggesting that the method could work and be more ef®cient. Green59 challenged these conclusions and argued that the potential bene®ts did not outweigh the likely risks. In the end, we concluded that the human genome sequencing effort should employ the hierarchical approach for several reasons. First, it was prudent to use the approach for the ®rst project to sequence a repeat-rich genome. With the hierarchical approach, the ultimate frequency of misassembly in the ®nished product would probably be lower than with the whole-genome approach, in which it would be more dif®cult to identify regions in which the assembly was incorrect. Second, it was prudent to use the approach in dealing with an outbred organism, such as the human. In the whole-genome shotgun method, sequence would necessarily come from two different copies of the human genome. Accurate sequence assembly could be complicated by sequence variation between these two copiesÐboth SNPs (which occur at a rate of 1 per 1,300 bases) and larger-scale structural heterozygosity (which has been documented in human chromosomes). In the hierarchical shotgun method, each largeinsert clone is derived from a single haplotype. Third, the hierarchical method would be better able to deal with inevitable cloning biases, because it would more readily allow targeting of additional sequencing to under-represented regions. And fourth, it was better suited to a project shared among members of a diverse international consortium, because it allowed work and responsibility to be easily distributed. As the ultimate goal has always been to create a high-quality, ®nished sequence to serve as a foundation for biomedical research, we reasoned that the advantages of this more conservative approach outweighed the additional cost, if any. A biotechnology company, Celera Genomics, has chosen to incorporate the whole-genome shotgun approach into its own efforts to sequence the human genome. Their plan60,61 uses a mixed strategy, involving combining some coverage with wholegenome shotgun data generated by the company together with the publicly available hierarchical shotgun data generated by the International Human Genome Sequencing Consortium. If the raw sequence reads from the whole-genome shotgun component are made available, it may be possible to evaluate the extent to which the sequence of the human genome can be assembled without the need for clone-based information. Such analysis may help to re®ne sequencing strategies for other large genomes. Technology for large-scale sequencing Sequencing the human genome depended on many technological improvements in the production and analysis of sequence data. Key innovations were developed both within and outside the Human Genome Project. Laboratory innovations included four-colour ¯uorescence-based sequence detection62, improved ¯uorescent dyes63±66, dye-labelled terminators67, polymerases speci®cally designed for sequencing68±70, cycle sequencing71 and capillary gel electrophoresis72±74. These studies contributed to substantial improvements in the automation, quality and throughput of collecting raw DNA sequence75,76. There were also important advances in the development of software packages for the analysis of sequence data. The PHRED software package77,78 introduced the concept of assigning a `base-quality score' to each base, on the basis of the probability of an erroneous call. These quality scores make it possible to monitor raw data quality and also assist in determining whether two similar sequences truly overlap. The PHRAP computer package (http://bozeman.mbt.washington.edu/phrap.docs/phrap. html) then systematically assembles the sequence data using the base-quality scores. The program assigns `assembly-quality scores' to each base in the assembled sequence, providing an objective criterion to guide sequence ®nishing. The quality scores were based on and validated by extensive experimental data. Another key innovation for scaling up sequencing was the development by several centres of automated methods for sample preparation. This typically involved creating new biochemical protocols suitable for automation, followed by construction of appropriate robotic systems. Coordination and public data sharing The Human Genome Project adopted two important principles with regard to human sequencing. The ®rst was that the collaboration would be open to centres from any nation. Although potentially less ef®cient, in a narrow economic sense, than a centralized approach involving a few large factories, the inclusive approach was strongly favoured because we felt that the human genome sequence is the common heritage of all humanity and the work should transcend national boundaries, and we believed that scienti®c progress was best assured by a diversity of approaches. The collaboration was coordinated through periodic international meetings (referred to as `Bermuda meetings' after the venue of the ®rst three gatherings) and regular telephone conferences. Work was shared ¯exibly among the centres, with some groups focusing on particular chromosomes and others contributing in a genome-wide fashion. The second principle was rapid and unrestricted data release. The centres adopted a policy that all genomic sequence data should be made publicly available without restriction within 24 hours of assembly79,80. Pre-publication data releases had been pioneered in mapping projects in the worm11 and mouse genomes30,81 and were prominently adopted in the sequencing of the worm, providing a direct model for the human sequencing efforts. We believed that scienti®c progress would be most rapidly advanced by immediate and free availability of the human genome sequence. The explosion of scienti®c work based on the publicly available sequence data in both academia and industry has con®rmed this judgement. Generating the draft genome sequence Generating a draft sequence of the human genome involved three steps: selecting the BAC clones to be sequenced, sequencing them and assembling the individual sequenced clones into an overall draft genome sequence. A glossary of terms related to genome sequencing and assembly is provided in Box 1. The draft genome sequence is a dynamic product, which is regularly updated as additional data accumulate en route to the articles 864 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
articles ultimate goal of a completely finished sequence. The results below partial digestion of genomic DNA with restriction enzymes. are based on the map and sequence data available on 7 October Together, they represent around 65-fold coverage(redundant sam- 2000, except as otherwise noted. At the end of this section, we pling) of the genome. Libraries based on other vectors, such as provide a brief update of key data cosmids, were also used in early stages of the project. Clone selection The libraries(Table 1)were prepared from DNA obtained from e hierarchical shotgun method involves the sequencing of over- anonymous human donors in accordance with US Federal R lapping large-insert clones spanning the genome. For the Human lations for the Protection of Human Subjects in Research Genome Project, clones were largely chosen from eight large-insert (45CFR46)and following full review by an Institutional Review libraries containing BAC or Pl-derived artificial chromosome Board. Briefly, the opportunity to donate DNA for this purpose was (PAC)clones(Table 1; refs 82-88). The libraries were made by broadly advertised near the two laboratories engaged in library BoX Sequence Sequenced-clone contigs Contigs produced by merging over Raw sequence Individual unassembled sequence reads, produced lapping sequenced clones by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a ing sequenced-clone contigs on the basis of linking information. cloned insert in any vector, such as a plasmid or bacterial artificial Draft genome sequence The sequence produced by combining mosor the information from the individual sequenced clones (by creating Finished sequence Complete sequence of a clone or genome, with merged sequence contigs and then employing linking information to an accuracy of at least 99.99% and no gaps create scaffolds)and positioning the sequence along the physical map ot Coverage (or depth) The average number of times a nucleotide is the chromosomes. represented by a high-quality base in a collection of random raw N50 length A measure of the contig length (or scaffold length) equence. Operationally, a high-quality base is defined as one with an containing a 'typical nucleotide. Specifically, it is the maximum length L accuracy of at least 99%(corresponding to a PHRED score of at least 20). such that 50%of all nucleotides lie in contigs (or scaffolds)of size at least L Full shotgun coverage The coverage in random raw sequence Computer programs and databases centres but is typically 8-10-fold. Clones with full shotgun to produce a 'base call with an associated quality score'for eachCs needed from a large-insert clone to ensure that it is ready for finishing; this PHRED Awidely used computer program that analyses raw sequence coverage can usually be assembled with only a handful of gaps per position in the sequence. A PHRED quality score of X corresponds to an 00kb. error probability of approximately 10. Thus, a PHRED quality score of Half shotgun coverage Half the amount of full shotgun coverage 30 corresponds to 99.9% accuracy for the base call in the raw read (typically, 4-5-fold random coverage PHRAP A widely used computer program that assembles raw ce contigs and assigns to each position in the BAC clone Bacterial artificial chromosome vector carying a genomic sequence an associated 'quality score, on the basis of the PHRED DNA insert, typically 100-200 kb. Most of the large-insert clones scores of the raw sequence reads A PHRAP quality score of X sequenced in the project were BAC clones. orresponds to an error probability of approximately 10.Thus, a Finished clone A large-insert clone that is entirely represented by PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in finished sequence. the assembled sequence Full shotgun clone A large-insert clone for which full shotgun GigAssembler A computer program developed during this project equence has been produced. for merging the information from individual sequenced clones into a draft Draft clone A large-insert clone for which roughly half-shotgun genome sequence. sequence has been produced. Operationally, the collection of draft Public sequence databases The three coordinated international clones produced by each centre was required to have an average sequence databases: GenBank, the EMBL data library and DDBJ coverage of fourfold for the entire set and a minimum coverage of Map features threefold for each clone STS Sequence tagged site, corresponding to a short (typically less Predraft clone A large-insert clone for which some shotgun than 500 bp) unique genomic locus for which a polymerase chain sequence is available, but which does not meet the standards for reaction assay has been developed inclusion in the collection of draft clones EST Expressed sequence tag, obtained by performing a single raw Contigs and scaffolds uence read from a random complementary DNA clone. ontig The result of joining an overlapping collection of sequences or SsR Simple sequence repeat, a sequence consisting largely of a ones tandem repeat of a specific k-mer(such as(CA)15). Many SSRs are caffold The result of connecting contigs by linking infomation from polymorphic and have been widely used in genetic mapping and oriented with respect to one another. present at appreciable frequency(traditionally, at least 1%)in the human Fingerprint clone contigs Contigs produced by joining clones population ferred to overlap on the basis of their restriction digest fingerprints Genetic map A genome map in which polymorphic loci are Sequenced-clone layout Assignment of sequenced clones to the positioned relative to one another on the basis of the frequency with nap of fingerprint clone which they recombine during meiosis. The unit of distance is Initial sequence contigs Contigs produced by merging over centimorgans (cM), denoting a 1% chance of recombination ping sequence reads obtained from a single clone, in a process called Radiation hybrid ( RH)map A genome map in which STSs are positioned relative to one another on the basis of the frequency with erged sequence contigs Contigs produced by taking the initial which they are separated by radiation-induced breaks. The frequency is sequence contigs contained in overlapping clones and merging those assayed by analysing a panel of human-hamster hybrid cell lines, each found to overlap. These are also referred to simply as sequence contigs oduced by lethally irradiating human cells and fusing them with where no confusion will result pient hamster cells such that each cames a collection of human Sequence-contig scaffolds Scaffolds pre onnect ing hromosomal fragments. The unit of distance is centirays (cR), denoting sequence contigs on the basis of linking inform a 1% chance of a break occuring between two loci NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
ultimate goal of a completely ®nished sequence. The results below are based on the map and sequence data available on 7 October 2000, except as otherwise noted. At the end of this section, we provide a brief update of key data. Clone selection The hierarchical shotgun method involves the sequencing of overlapping large-insert clones spanning the genome. For the Human Genome Project, clones were largely chosen from eight large-insert libraries containing BAC or P1-derived arti®cial chromosome (PAC) clones (Table 1; refs 82±88). The libraries were made by partial digestion of genomic DNA with restriction enzymes. Together, they represent around 65-fold coverage (redundant sampling) of the genome. Libraries based on other vectors, such as cosmids, were also used in early stages of the project. The libraries (Table 1) were prepared from DNA obtained from anonymous human donors in accordance with US Federal Regulations for the Protection of Human Subjects in Research (45CFR46) and following full review by an Institutional Review Board. Brie¯y, the opportunity to donate DNA for this purpose was broadly advertised near the two laboratories engaged in library articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 865 Box 1 Genome glossary Sequence Raw sequence Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a cloned insert in any vector, such as a plasmid or bacterial arti®cial chromosome. Finished sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps. Coverage (or depth) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a `high-quality base' is de®ned as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Full shotgun coverage The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for ®nishing; this varies among centres but is typically 8±10-fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb. Half shotgun coverage Half the amount of full shotgun coverage (typically, 4±5-fold random coverage). Clones BAC clone Bacterial arti®cial chromosome vector carrying a genomic DNA insert, typically 100±200 kb. Most of the large-insert clones sequenced in the project were BAC clones. Finished clone A large-insert clone that is entirely represented by ®nished sequence. Full shotgun clone A large-insert clone for which full shotgun sequence has been produced. Draft clone A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each centre was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone. Predraft clone A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones. Contigs and scaffolds Contig The result of joining an overlapping collection of sequences or clones. Scaffold The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. Fingerprint clone contigs Contigs produced by joining clones inferred to overlap on the basis of their restriction digest ®ngerprints. Sequenced-clone layout Assignment of sequenced clones to the physical map of ®ngerprint clone contigs. Initial sequence contigs Contigs produced by merging overlapping sequence reads obtained from a single clone, in a process called sequence assembly. Merged sequence contigs Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simply as `sequence contigs' where no confusion will result. Sequence-contig scaffolds Scaffolds produced by connecting sequence contigs on the basis of linking information. Sequenced-clone contigs Contigs produced by merging overlapping sequenced clones. Sequenced-clone-contig scaffolds Scaffolds produced by joining sequenced-clone contigs on the basis of linking information. Draft genome sequence The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. N50 length A measure of the contig length (or scaffold length) containing a `typical' nucleotide. Speci®cally, it is the maximum length L suchthat 50% of all nucleotides lie in contigs (or scaffolds) of size at least L. Computer programs and databases PHRED A widely used computer program that analyses raw sequence to produce a `base call' with an associated `quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read. PHRAP A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated `quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence. GigAssembler A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence. Public sequence databases The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ. Map features STS Sequence tagged site, corresponding to a short (typically less than 500 bp) unique genomic locus for which a polymerase chain reaction assay has been developed. EST Expressed sequence tag, obtained by performing a single raw sequence read from a random complementary DNA clone. SSR Simple sequence repeat, a sequence consisting largely of a tandem repeat of a speci®c k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping. SNP Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population. Genetic map A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination. Radiation hybrid (RH) map A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human±hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci. © 2001 Macmillan Magazines Ltd
articles construction. Volunteers of diverse backgrounds were accepted on a RPCI-13 and CalTech D libraries(Table 1). DNA from each BAC first-come, first-taken basis Samples were obtained after discussion clone was digested with the restriction enzyme HindIll, and the sizes ith a genetic counsellor and written informed consent. The of the resulting fragments were measured by agarose gel electro- samples were made anonymous as follows: the sampling laboratory phoresis. The pattern of restriction fragments provides a ' finger stripped all identifiers from the samples, applied random numeric print for each BAC, which allows different BACs to be distinguished labels, and transferred them to the processing laboratory, which and the degree of overlaps to be assessed. We used these restriction- hen removed all labels and relabelled the samples. All records of the fragment fingerprints to determine clone overlaps, and thereby labelling were destroyed. The processing laboratory chose samples assembled the BACs into fingerprint clone contigs at random from which to prepare DNA and immortalized cell lines. The fingerprint clone contigs were positioned along the chromo- Around 5-10 samples were collected for every one that was somes by anchoring them with STS markers from existing genetic ventually used. Because no link was retained between donor and and physical maps. Fingerprint clone contigs were tied to specific DNA sample, the identity of the donors for the libraries is not STSs initially by probe hybridization and later by direct search of the known, even by the donors themselves. A more complete descrip- sequenced clones. To localize fingerprint clone contigs that did not tioncanbefoundathttp://www.nhgri.nih.gov/grant_info/fuNd-containknownmarkersnewStsSweregeneratedandplacedonto ing/Statements/RFA/human_subjects. htmL. chromosomes.Representative clones were also positioned by fluor- During the pilot phase, centres showed that sequence-tagged sites escence in situ hybridization(FISH)(ref. 86 and C. McPherson, (STSs)from previously constructed genetic and physical maps unpublished) t data were dditional probes from flow sorting of chromosomes to obtain reviewed.g to evaluate overlaps and to assess cove rage of specific chromosomes or chromosomal bias against rearranged clones,). STS content information and regions BAC end sequence information were also used. Where possible, For the large-scale sequence production phase, a genome-wide we tried to select a minimally overlapping set spanning a region hysical map of overlapping clones was also cor ted by sys- However, because the genome-wide physical map was constructed tematic analysis of BAC clones representing 20-fold coverage of the concurrently with the sequencing, continuity in many regions wa human genome Most clones came from the first three sections of low in early stages. These small fingerprint clone contigs were the RPCI-11 library, supplemented with clones from sections of the nonetheless useful in identifying validated, nonredundant clones Table 1 Key large-insert genome-wide libraries Library name" GenBank Vector Source DNA Lit umber Number of abbrevation type om日 the draft genome Number Total bases fraction af library BAC Hind‖ 0021 Caltech D1 TD BAC Human 3811,36718560043 2,566-267 3,000-3253EcoF RPC1-1 3.388 RPCI- 267,931379773 ECoRI 321312 252413.9089 0916 eight libraries Total all Bbraries 354510 2984,2605 nds, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence BAC-end amia nstitute of Technology and the University of Washington High Throughput Sequencing cente fortheTablewerehttp://www.ncbi.nm.nihgow/ganome/clone/ sthesEaretheclonesinthesequenced-clonelayoutmaphttp://genome.wustl.edw/gsc/human/apping/index.shtmlthatwerepredraftdraftorfinished ojects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the avail equence from those chromosomes was used in the assembly f The number reported is the tot 866 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
construction. Volunteers of diverse backgrounds were accepted on a ®rst-come, ®rst-taken basis. Samples were obtained after discussion with a genetic counsellor and written informed consent. The samples were made anonymous as follows: the sampling laboratory stripped all identi®ers from the samples, applied random numeric labels, and transferred them to the processing laboratory, which then removed all labels and relabelled the samples. All records of the labelling were destroyed. The processing laboratory chose samples at random from which to prepare DNA and immortalized cell lines. Around 5±10 samples were collected for every one that was eventually used. Because no link was retained between donor and DNA sample, the identity of the donors for the libraries is not known, even by the donors themselves. A more complete description can be found at http://www.nhgri.nih.gov/Grant_info/Funding/Statements/RFA/human_subjects.html. During the pilot phase, centres showed that sequence-tagged sites (STSs) from previously constructed genetic and physical maps could be used to recover BACs from speci®c regions. As sequencing expanded, some centres continued this approach, augmented with additional probes from ¯ow sorting of chromosomes to obtain long-range coverage of speci®c chromosomes or chromosomal regions89±94. For the large-scale sequence production phase, a genome-wide physical map of overlapping clones was also constructed by systematic analysis of BAC clones representing 20-fold coverage of the human genome86. Most clones came from the ®rst three sections of the RPCI-11 library, supplemented with clones from sections of the RPCI-13 and CalTech D libraries (Table 1). DNA from each BAC clone was digested with the restriction enzyme HindIII, and the sizes of the resulting fragments were measured by agarose gel electrophoresis. The pattern of restriction fragments provides a `®ngerprint' for each BAC, which allows different BACs to be distinguished and the degree of overlaps to be assessed. We used these restrictionfragment ®ngerprints to determine clone overlaps, and thereby assembled the BACs into ®ngerprint clone contigs. The ®ngerprint clone contigs were positioned along the chromosomes by anchoring them with STS markers from existing genetic and physical maps. Fingerprint clone contigs were tied to speci®c STSs initially by probe hybridization and later by direct search of the sequenced clones. To localize ®ngerprint clone contigs that did not contain known markers, new STSs were generated and placed onto chromosomes95. Representative clones were also positioned by ¯uorescence in situ hybridization (FISH) (ref. 86 and C. McPherson, unpublished). We selected clones from the ®ngerprint clone contigs for sequencing according to various criteria. Fingerprint data were reviewed86,90 to evaluate overlaps and to assess clone ®delity (to bias against rearranged clones83,96). STS content information and BAC end sequence information were also used91,92. Where possible, we tried to select a minimally overlapping set spanning a region. However, because the genome-wide physical map was constructed concurrently with the sequencing, continuity in many regions was low in early stages. These small ®ngerprint clone contigs were nonetheless useful in identifying validated, nonredundant clones articles 866 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 1 Key large-insert genome-wide libraries Library name* GenBank abbreviation Vector type Source DNA Library segment or plate numbers Enzyme digest Average insert size (kb) Total number of clones in library Number of ®ngerprinted clones² BAC-end sequence (ends/clones/ clones with both ends sequenced)³ Number of clones in genome layout§ Sequenced clones used in construction of the draft genome sequence Numberk Total bases (Mb)¶ Fraction of total from library Caltech B CTB BAC 987SK cells All HindIII 120 74,496 16 2/1/1 528 518 66.7 0.016 Caltech C CTC BAC Human sperm All HindIII 125 263,040 144 21,956/ 14,445/ 7,255 621 606 88.4 0.021 Caltech D1 (CITB-H1) CTD BAC Human sperm All HindIII 129 162,432 49,833 403,589/ 226,068/ 156,631 1,381 1,367 185.6 0.043 Caltech D2 (CITB-E1) BAC Human sperm All 2,501±2,565 EcoRI 202 24,960 2,566±2,671 EcoRI 182 46,326 3,000±3,253 EcoRI 142 97,536 RPCI-1 RP1 PAC Male, blood All MboI 110 115,200 3,388 1,070 1,053 117.7 0.028 RPCI-3 RP3 PAC Male, blood All MboI 115 75,513 644 638 68.5 0.016 RPCI-4 RP4 PAC Male, blood All MboI 116 105,251 889 881 95.5 0.022 RPCI-5 RP5 PAC Male, blood All MboI 115 142,773 1,042 1,033 116.5 0.027 RPCI-11 RP11 BAC Male, blood All 178 543,797 267,931 379,773/ 243,764/ 134,110 19,405 19,145 3,165.0 0.743 1 EcoRI 164 108,499 2 EcoRI 168 109,496 3 EcoRI 181 109,657 4 EcoRI 183 109,382 5 MboI 196 106,763 Total of top eight libraries 1,482,502 321,312 805,320/ 484,278/ 297,997 25,580 25,241 3,903.9 0.916 Total all libraries 354,510 812,594/ 488,017/ 100,775 30,445 29,298 4,260.5 1 ................................................................................................................................................................................................................................................................................................................................................................... * For the CalTech libraries82, see http://www.tree.caltech.edu/lib_status.html; for RPCI libraries83, see http://www.chori.org/bacpac/home.htm. ² For the FPC map and ®ngerprinting84±86, see http://genome.wustl.edu/gsc/human/human_database.shtml. ³ The number of raw BAC end sequences (clones/ends/clones with both ends sequenced) available for use in human genome sequencing. Typically, for clones in which sequence was obtained from both ends, more than 95% of both end sequences contained at least 100 bp of nonrepetitive sequence. BAC-end sequencing of RPCI-11 and of the CalTech libraries was done at The Institute for Genomic Research, the California Institute of Technology and the University of Washington High Throughput Sequencing Center. The sources for the Table were http://www.ncbi.nlm.nih.gov/genome/clone/ BESstat.shtml and refs 87, 88. § These are the clones in the sequenced-clone layout map (http://genome.wustl.edu/gsc/human/Mapping/index.shtml) that were pre-draft, draft or ®nished. k The number of sequenced clones used in the assembly. This number is less than that in the previous column owing to removal of a small number of obviously contaminated, combined or duplicated projects; in addition, not all of the clones from completed chromosomes 21 and 22 were included here because only the available ®nished sequence from those chromosomes was used in the assembly. ¶ The number reported is the total sequence from the clones indicated in the previous column. Potential overlap between clones was not removed here, but Ns were excluded. © 2001 Macmillan Magazines Ltd
articles h of new regions. The small clone, several centres routinely examined an initial sample of 96 raw or merged with others as sequence reads from each subclone library to evaluate possible the map matured. overlap with previously sequenced clones. The clones that make up the draft genome sequence therefore do Sequencing not constitute a minimally overlapping set-there is overlap and The selected clones were subjected to shotgun sequencing. Although redundancy in places. The cost of using suboptimal overlaps was the basic approach of shotgun sequencing is well established, the justified by the benefit of earlier availability of the draft genome details of implementation varied among the centres. For example, lence data. Minimizing the overlap between adjacent clones there were differences in the average insert size of the shotgun would have required completing the physical map before under- libraries, in the use of single-stranded or double-stranded cloning taking large-scale sequencing. In addition, the overlaps between vectors, and in sequencing from one end or both ends of each insert. BAC clones provide a rich collection of SNPs. More than 1. 4 million Centres differed in the fluorescent labels employed and in the degree SNPs have already been identified from clone overlaps and other to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices Because the sequencing project was shared among twenty centres Detailed protocols are available on the web sites of many of the insixcountriesitwasimportanttocoordinateselectionofclonesindividualcentres(urlscanbefoundatwww.nhgri.nih.gov/ across the centres. Most centres focused on particular chromosomes genomehub). The extent of automation also varied greatly or, in some cases, larger regions of the genome. We also maintained among the centres, with the most aggressive automation efforts a clone registry to track selected clones and their progress. In later resulting in factory-style systems able to process more than 100,000 phases, the global map provided an integrated view of the data from sequencing reactions in 12 hours(Fig. 3). In addition, centres ll centres, facilitating the distribution of effort to maximize cover- differed in the amount of raw sequence data typically obtained for age of the genome Before performing extensive sequencing on a each clone(so-called half-shotgun, full shotgun and finished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were Lm L analysed by a common computational procedure. Raw sequenc traces were processed and assembled with the PHRED and PHRAP software packages".(P. Green, unpublished). All assembled con- tigs of more than 2 kb were deposited in public databases within The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequence acity and output rose approx eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate(ratio of useable reads human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public A version of the draft genome sequence was prepared on the basis Figure 3 The automated production line for sample preparation at the whitehead of the map and sequence data available on 7 October 2000. For this Institute,Center for Genome Research. The system consists of custom-designed factory. version, the mapping effort had assembled the fingerprinted BACs style conveyor belt robots that perform all functions from purifying DNA from bacterial into 1, 246 fingerprint clone contigs. The sequencing effort had cultures through setting up and purifying sequencing reactions sequenced and assembled 29, 298 overlapping BACs and other large insert clones(Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the 4,500 Finished genome(including both draft and finished sequence). The various Unfinished(draft and pre-d contributions to the total amount of sequence deposited in the HTGS division of Gen Bank are given in Table 3 Table 2 Total genome sequence from 2500 sequence status Sequent umber of Total clon number depth sequence(Mb) nis number di Figure 4 Total amount of human sequence in the High Throughput Genome Sequer sequencing centre. The average varies among the centres, and the number may rGS)division of GenBank. The total is the sum of finished sequence(red) and unfinished vary considerably for clones with the same sequencing status. For draft clones in the public draft plus predraft sequence yellow) NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd
that were used to `seed' the sequencing of new regions. The small ®ngerprint clone contigs were extended or merged with others as the map matured. The clones that make up the draft genome sequence therefore do not constitute a minimally overlapping setÐthere is overlap and redundancy in places. The cost of using suboptimal overlaps was justi®ed by the bene®t of earlier availability of the draft genome sequence data. Minimizing the overlap between adjacent clones would have required completing the physical map before undertaking large-scale sequencing. In addition, the overlaps between BAC clones provide a rich collection of SNPs. More than 1.4 million SNPs have already been identi®ed from clone overlaps and other sequence comparisons97. Because the sequencing project was shared among twenty centres in six countries, it was important to coordinate selection of clones across the centres. Most centres focused on particular chromosomes or, in some cases, larger regions of the genome. We also maintained a clone registry to track selected clones and their progress. In later phases, the global map provided an integrated view of the data from all centres, facilitating the distribution of effort to maximize coverage of the genome. Before performing extensive sequencing on a clone, several centres routinely examined an initial sample of 96 raw sequence reads from each subclone library to evaluate possible overlap with previously sequenced clones. Sequencing The selected clones were subjected to shotgun sequencing. Although the basic approach of shotgun sequencing is well established, the details of implementation varied among the centres. For example, there were differences in the average insert size of the shotgun libraries, in the use of single-stranded or double-stranded cloning vectors, and in sequencing from one end or both ends of each insert. Centres differed in the ¯uorescent labels employed and in the degree to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices. Detailed protocols are available on the web sites of many of the individual centres (URLs can be found at www.nhgri.nih.gov/ genome_hub). The extent of automation also varied greatly among the centres, with the most aggressive automation efforts resulting in factory-style systems able to process more than 100,000 sequencing reactions in 12 hours (Fig. 3). In addition, centres differed in the amount of raw sequence data typically obtained for each clone (so-called half-shotgun, full shotgun and ®nished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were analysed by a common computational procedure. Raw sequence traces were processed and assembled with the PHRED and PHRAP software packages77,78 (P. Green, unpublished). All assembled contigs of more than 2 kb were deposited in public databases within 24 hours of assembly. The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequencing capacity and output rose approximately eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate (ratio of useable reads to attempted reads). By June 2000, the centres were producing raw sequence at a rate equivalent to onefold coverage of the entire human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public databases (Fig. 4). A version of the draft genome sequence was prepared on the basis of the map and sequence data available on 7 October 2000. For this version, the mapping effort had assembled the ®ngerprinted BACs into 1,246 ®ngerprint clone contigs. The sequencing effort had sequenced and assembled 29,298 overlapping BACs and other largeinsert clones (Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the genome (including both draft and ®nished sequence). The various contributions to the total amount of sequence deposited in the HTGS division of GenBank are given in Table 3. articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 867 Figure 3 The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factorystyle conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions. 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Jan-96 Apr-96 Jul-96 Oct-96 Jan-97 Apr-97 Jul-97 Oct-97 Jan-98 Apr-98 Jul-98 Oct-98 Jan-99 Apr-99 Jul-99 Oct-99 Jan-00 Apr-00 Jul-00 Oct-00 Sequence (Mb) Finished Unfinished (draft and pre-draft) Month Figure 4 Total amount of human sequence in the High Throughput Genome Sequence (HTGS) division of GenBank. The total is the sum of ®nished sequence (red) and un®nished (draft plus predraft) sequence (yellow). Table 2 Total genome sequence from the collection of sequenced clones, by sequence status Sequence status Number of clones Total clone length (Mb) Average number of sequence reads per kb* Average sequence depth² Total amount of raw sequence (Mb) Finished 8,277 897 20±25 8±12 9,085 Draft 18,969 3,097 12 4.5 13,395 Predraft 2,052 267 6 2.5 667 Total 23,147 ............................................................................................................................................................................. * The average number of reads per kb was estimated based on information provided by each sequencing centre. This number differed among sequencing centres, based on the actual protocols used. ² The average depth in high quality bases ($99% accuracy) was estimated from information provided by each sequencing centre. The average varies among the centres, and the number may vary considerably for clones with the same sequencing status. For draft clones in the public databases (keyword: HTGS_draft), the number can be computed from the quality scores listed in the database entry. © 2001 Macmillan Magazines Ltd
articles By agreement among the centres, the collection of draft clones In addition to sequencing large-insert clones, three centres produced by each centre was required to have fourfold average generated a large collection of random raw sequence reads from sequence coverage, with no clone below threefold. For this pur- whole-genome shotgun libraries (Table 4; ref. 98). These 5.77 pose, sequence coverage was defined as the average number of times million successful sequences contained 2. 4 Gb of high-quality that each base was independently read with a base-quality score bases; this corres to about 0.75-fold coverage and woul orresponding to at least 99%accuracy. ) We attained an overall statistically expected to include about 50% of the nucleotides in the averageof4.5-foldcoverageacrossthegenomefordraftclonesahumangenome(dataavailableathttp://snp.cshl.org/data).the few of the sequenced clones fell below the minimum of threefold primary objective of this work was to discover SNPs, by comparing s meeting draft standards; these are referred to as predraft(Table 2). uals) with the draft genome sequence. However, many of these raw Some of these are clones that span remaining gaps in the draft sequences were obtained from both ends of plasmid clones and genome sequence and were in the process of being sequenced on 7 thereby also provided valuable linking information that was used October 2000; a few are old submissions from centres that are no in sequence assembly. In addition, the random raw sequences longer active. provide sequence coverage of about half of the nucleotides not yet The lengths of the initial sequence contigs in the draft clones vary represented in the sequenced large-insert clones; these can be used a function of coverage, but half of all nucleotides reside in initial as probes for portions of the genome not yet recovered. nce contigs of at least 21.7 kb(see below ) Various properties Assembly of the draft genome sequence of the draft clones can be assessed from instances in which there was We then set out to assemble the sequences from the individual large substantial overlap between a draft clone and a finished (or nearly insert clones into an integrated draft sequence of the human the sequence alignments in the genome. The assembly process had to resolve problems arising overlap regions, we estimated that the initial sequence contigs in a from the draft nature of much of the sequence, from the variety of draft sequence clone cover an average of about 96% of the clone and clone sources, and from the high fraction of repeated sequences in are separated by gaps with an average size of about 500 bp the human genome. This process involved three steps: filtering, Although the main emphasis was on producing a draft genome layout and merging. sequence, the centres also maintained sequence finishing activities The entire data set was filtered uniformly to eliminate contam- during this period, leading to a twofold increase in finished ination from nonhuman sequences and other artefacts that had not sequence from June 1999 to June 2000(Fig. 4). The total amount already been removed by the individual centres (Information about of human sequence in this final form stood at more than 835 Mb on contamination was also sent back to the centres, which are updating 7 October 2000, or more than 25% of the human genome. This the individual entries in the public databases. )We also identified havebequences of chromosomes 21 and 22 (refs 93, instances in which the sequ data from one bac clone was 94). As centres have begun to shift from draft to finished sequene ubstantially contaminated with sequence data from another in the last quarter of 2000, the production of finished sequence has (human or nonhuman) clone. The problems were resolved in increased to an annualized rate of I Gb per year and is continuing to most instances; 231 clones remained unresolved, and these were eliminated from the assembly reported here. Instances of lower levels of cross-contamination(for example, a single 96-well micro- plate misassigned to the wrong BAC) are more difficult to detect Table 3 Total human sequence deposited in the htGs division of GenBank some undoubtedly remain and may give rise to small spurious Total human fnished human sequence contigs in the draft genome sequence. Such issues 阶 Center for Genome Researe1212da but they necessitate some caution in certain applications of the The sequenced clones were then associated with specific clones on Baylor Collage of Medicine Human Genome Sequencing 345, 125 ne physical map to produce a 'layout. In pri clones that correspond to fingerprinted BACs could be directly assigned by name to fingerprint clone contigs on the fingerprint- 7014 based physical map. In practice, however, laboratory mixups occa- epartment of Genome Analysis, nstitute of Molecular sionally resulted in incorrect assignments. To eliminate such pro- B297 blems, sequenced clones were associated with the fingerprint clone Systems 9.6876 contigs in the physical map by using the sequence data to calculate a 3,530 Read pairs Size range of inserts uthmwestern Medical Center at Dalas University of Oklahoma Advanced Center for Genome 9,155 eared 08-4.7 2,94 Total 766907 1,916294 GBF -German Research Centre for Biotechnology Cold Spring Harbor Laboratory Lita Annenberg Hazen 2 ymous 4338,224 mples are not id entiled. fomed consent知 les to the dna ers of the Intemational Human genome plus predraft is shown in the second co adding characters and of some clones doned fragment was determined and used in this study as linking information. 868 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011
By agreement among the centres, the collection of draft clones produced by each centre was required to have fourfold average sequence coverage, with no clone below threefold. (For this purpose, sequence coverage was de®ned as the average number of times that each base was independently read with a base-quality score corresponding to at least 99% accuracy.) We attained an overall average of 4.5-fold coverage across the genome for draft clones. A few of the sequenced clones fell below the minimum of threefold sequence coverage or have not been formally designated by centres as meeting draft standards; these are referred to as predraft (Table 2). Some of these are clones that span remaining gaps in the draft genome sequence and were in the process of being sequenced on 7 October 2000; a few are old submissions from centres that are no longer active. The lengths of the initial sequence contigs in the draft clones vary as a function of coverage, but half of all nucleotides reside in initial sequence contigs of at least 21.7 kb (see below). Various properties of the draft clones can be assessed from instances in which there was substantial overlap between a draft clone and a ®nished (or nearly ®nished) clone. By examining the sequence alignments in the overlap regions, we estimated that the initial sequence contigs in a draft sequence clone cover an average of about 96% of the clone and are separated by gaps with an average size of about 500 bp. Although the main emphasis was on producing a draft genome sequence, the centres also maintained sequence ®nishing activities during this period, leading to a twofold increase in ®nished sequence from June 1999 to June 2000 (Fig. 4). The total amount of human sequence in this ®nal form stood at more than 835 Mb on 7 October 2000, or more than 25% of the human genome. This includes the ®nished sequences of chromosomes 21 and 22 (refs 93, 94). As centres have begun to shift from draft to ®nished sequencing in the last quarter of 2000, the production of ®nished sequence has increased to an annualized rate of 1 Gb per year and is continuing to rise. In addition to sequencing large-insert clones, three centres generated a large collection of random raw sequence reads from whole-genome shotgun libraries (Table 4; ref. 98). These 5.77 million successful sequences contained 2.4 Gb of high-quality bases; this corresponds to about 0.75-fold coverage and would be statistically expected to include about 50% of the nucleotides in the human genome (data available at http://snp.cshl.org/data). The primary objective of this work was to discover SNPs, by comparing these random raw sequences (which came from different individuals) with the draft genome sequence. However, many of these raw sequences were obtained from both ends of plasmid clones and thereby also provided valuable `linking' information that was used in sequence assembly. In addition, the random raw sequences provide sequence coverage of about half of the nucleotides not yet represented in the sequenced large-insert clones; these can be used as probes for portions of the genome not yet recovered. Assembly of the draft genome sequence We then set out to assemble the sequences from the individual largeinsert clones into an integrated draft sequence of the human genome. The assembly process had to resolve problems arising from the draft nature of much of the sequence, from the variety of clone sources, and from the high fraction of repeated sequences in the human genome. This process involved three steps: ®ltering, layout and merging. The entire data set was ®ltered uniformly to eliminate contamination from nonhuman sequences and other artefacts that had not already been removed by the individual centres. (Information about contamination was also sent back to the centres, which are updating the individual entries in the public databases.) We also identi®ed instances in which the sequence data from one BAC clone was substantially contaminated with sequence data from another (human or nonhuman) clone. The problems were resolved in most instances; 231 clones remained unresolved, and these were eliminated from the assembly reported here. Instances of lower levels of cross-contamination (for example, a single 96-well microplate misassigned to the wrong BAC) are more dif®cult to detect; some undoubtedly remain and may give rise to small spurious sequence contigs in the draft genome sequence. Such issues are readily resolved as the clones progress towards ®nished sequence, but they necessitate some caution in certain applications of the current data. The sequenced clones were then associated with speci®c clones on the physical map to produce a `layout'. In principle, sequenced clones that correspond to ®ngerprinted BACs could be directly assigned by name to ®ngerprint clone contigs on the ®ngerprintbased physical map. In practice, however, laboratory mixups occasionally resulted in incorrect assignments. To eliminate such problems, sequenced clones were associated with the ®ngerprint clone contigs in the physical map by using the sequence data to calculate a articles 868 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 3 Total human sequence deposited in the HTGS division of GenBank Sequencing centre Total human sequence (kb) Finished human sequence (kb) Whitehead Institute, Center for Genome Research* 1,196,888 46,560 The Sanger Centre* 970,789 284,353 Washington University Genome Sequencing Center* 765,898 175,279 US DOE Joint Genome Institute 377,998 78,486 Baylor College of Medicine Human Genome Sequencing Center 345,125 53,418 RIKEN Genomic Sciences Center 203,166 16,971 Genoscope 85,995 48,808 GTC Sequencing Center 71,357 7,014 Department of Genome Analysis, Institute of Molecular Biotechnology 49,865 17,788 Beijing Genomics Institute/Human Genome Center 42,865 6,297 Multimegabase Sequencing Center; Institute for Systems Biology 31,241 9,676 Stanford Genome Technology Center 29,728 3,530 The Stanford Human Genome Center and Department of Genetics 28,162 9,121 University of Washington Genome Center 24,115 14,692 Keio University 17,364 13,058 University of Texas Southwestern Medical Center at Dallas 11,670 7,028 University of Oklahoma Advanced Center for Genome Technology 10,071 9,155 Max Planck Institute for Molecular Genetics 7,650 2,940 GBF ± German Research Centre for Biotechnology 4,639 2,338 Cold Spring Harbor Laboratory Lita Annenberg Hazen Genome Center 4,338 2,104 Other 59,574 35,911 Total 4,338,224 842,027 ............................................................................................................................................................................. Total human sequence deposited in GenBank by members of the International Human Genome Sequencing Consortium, as of 8 October 2000.The amount of total sequence (®nished plus draft plus predraft) is shown in the second column and the amount of ®nished sequence is shown in the third column. Total sequence differs from totals in Tables 1 and 2 because of inclusion of padding characters and of some clones not used in assembly. HTGS, high throughput genome sequence. *These three centres produced an additional 2.4 Gb of raw plasmid paired-end reads (see Table 4), consisting of 0.99 Gb from Whitehead Institute, 0.66 Gb from The Sanger Centre and 0.75 Gb from Washington University. Table 4 Plasmid paired-end reads Total reads deposited* Read pairs² Size range of inserts (kb) Random-sheared 3,227,685 1,155,284 1.8±6 Enzyme digest 2,539,222 761,010 0.8±4.7 Total 5,766,907 1,916,294 ............................................................................................................................................................................. The plasmid paired-end reads used a mixture of DNA from a set of 24 samples from the DNA Polymorphism Discovery Resource (http://locus.umdnj.edu/nigms/pdr.html). This set of 24 anonymous US residents contains samples from European-Americans, African-Americans, MexicanAmericans, Native Americans and Asian-Americans, although the ethnicities of the individual samples are not identi®ed. Informed consent to contribute samples to the DNA Polymorphism Discovery Resource was obtained from all 450 individuals who contributed samples. Samples from the European-American, African-American and Mexican-American individuals came from NHANES (http://www.cdc.gov/nchs/nhanes.htm); individuals were recontacted to obtain their consent for the Resource project. New samples were obtained from Asian-Americans whose ancestry was from a variety of East and South Asian countries. New samples were also obtained for the Native Americans; tribal permission was obtained ®rst, and then individual consents. See http:// www.nhgri.nih.gov/Grant_info/Funding/RFA/discover_polymorphisms.html and ref. 98. *Re¯ects data deposited with and released by The SNP Consortium (see http://snp.cshl.org/data). ² Read pairs represents the number of cases in which sequence from both ends of a genomic cloned fragment was determined and used in this study as linking information. © 2001 Macmillan Magazines Ltd
articles partial list of restriction fragments in silico and comparing that list cHromosome ith the experimental database of BAC fingerprints. The compari on was feasible because the experim ing of restriction gments was highly accurate(to within 0.5-1.5% of the true ize, for 95% of fragments from 600 to 12, 000 base pairs(bp)54.ss. Reliable matching scores could be obtained for 16, 193 of the clones e remaining sequenced clones could not be placed on the map by this method because they were too short, or they contained too many small initial sequence contigs to yield enough restriction ragments, or possibly because their sequences were not represented in the fingerprint database. An independent approach to placing sequenced clones on the physical map used the database of end sequences from fingerprint BACs(Table 1). Sequenced clones could typically be reliably mapped if they contained multiple matches to BAC ends, with all corresponding to clones from a single genomic region(multiple matches were required as a safeguard against errors known to exist in the BAC end database and against repeated sequences). Thi approach provided useful placement information for 22, 566 Altogether, we could assign 25, 403 sequenced clones to finger print clone contigs by combining in silico digestion and BAC end sequence match data. To place most of the remaining sequenced clones, we exploited information about sequence overlap or BAC nd paired links of these clones with already positioned clones. This left only a few, mostly small, sequenced clones that could not be laced (152 sequenced clones containing 5. 5 Mb of sequence out of 29, 298 sequenced clones containing more than 4, 260 Mb of equence); these are being localized by radiation hybrid mapi f STSs derived from thei The fingerprint clone contigs were then mapped to chromosomal locations, using sequence matche %o..0 mapped STSs from four human radiation hybrid maps., 0, one YAC and radiation vo genetic maps gether with data from FISH,,o. The mapping was iteratively refined by comparing the order and orientation of the STSs in the fingerprint clone contigs nd the various STS-based maps, to identify and refine discrepan- cies(Fig. 5). Small fingerprint clone contigs(< 1 Mb)were difficult to orient and, sometimes, to order using these methods. In all, 942 fingerprint clone contigs contained sequenced clones. (An addi- tional 304 of the 1, 246 fingerprint clone contigs did not contain Figure 5 Positions of markers on previous maps of the genome(the Genethon'ogenetic lancedclonesbutthesetendedtobeextremelysmallandmapandMarshfieldgeneticmap(http://research.marshfieldclinic.org/genetics/ together contain less than 1% of the mapped clones. About one- genotyping_service/mgsver2 htm), the GeneMap99 radiation hybrid map 00, and the third have been targeted for sequencing. A few derive from the Y Whitehead YAC and radiation hybrid map2) plotted against their derived position on the chromosome, for which the map was constructed separately". Most draft sequence for chromosome 2. The horizontal units are Mb but the vertical units of of the remainder are fragments of other larger contigs or represent each map vary (CM, cR and so on) and thus all were scaled so that the entire map spans other artefacts. These are being eliminated in subsequent versions of the full vertical range Markers that map to other chromosomes are shown in the the database )Of these 942 contigs with sequenced clones, 852 chromosome lines at the top. The data sets generally follow the diagonal, indicating that (90%, containing 99.2% of the total sequence) were localized to order and orientation of the marker sets on the different maps largely agree(note that the specific chromosome locations in this way. An additional 51 two genetic maps are completely superimposed). In a, there are two segments(bars)that fingerprint clone contigs, containing 0.5% of the sequence, could are inverted in an earlier version draft sequence relative to all the other maps. b, The same be assigned to a specific chromosome but not to a precise position. chromosome after the information was used to reorient those two segments end-to-end middle only: not OK Figure 6 The key steps (a-d in assembling individual sequenced clones into the draft genome sequence. A1-A5 represent initial sequence contigs derived from shotgun sequencing of clone A, and B1-B6 are from clone b NatuReVoL409115FebRuAry2001www.nature.comAe2001MacmillanMagazinesLtd
partial list of restriction fragments in silico and comparing that list with the experimental database of BAC ®ngerprints. The comparison was feasible because the experimental sizing of restriction fragments was highly accurate (to within 0.5±1.5% of the true size, for 95% of fragments from 600 to 12,000 base pairs (bp))84,85. Reliable matching scores could be obtained for 16,193 of the clones. The remaining sequenced clones could not be placed on the map by this method because they were too short, or they contained too many small initial sequence contigs to yield enough restriction fragments, or possibly because their sequences were not represented in the ®ngerprint database. An independent approach to placing sequenced clones on the physical map used the database of end sequences from ®ngerprinted BACs (Table 1). Sequenced clones could typically be reliably mapped if they contained multiple matches to BAC ends, with all corresponding to clones from a single genomic region (multiple matches were required as a safeguard against errors known to exist in the BAC end database and against repeated sequences). This approach provided useful placement information for 22,566 sequenced clones. Altogether, we could assign 25,403 sequenced clones to ®ngerprint clone contigs by combining in silico digestion and BAC end sequence match data. To place most of the remaining sequenced clones, we exploited information about sequence overlap or BACend paired links of these clones with already positioned clones. This left only a few, mostly small, sequenced clones that could not be placed (152 sequenced clones containing 5.5 Mb of sequence out of 29,298 sequenced clones containing more than 4,260 Mb of sequence); these are being localized by radiation hybrid mapping of STSs derived from their sequences. The ®ngerprint clone contigs were then mapped to chromosomal locations, using sequence matches to mapped STSs from four human radiation hybrid maps95,99,100, one YAC and radiation hybrid map29, and two genetic maps101,102, together with data from FISH86,90,103. The mapping was iteratively re®ned by comparing the order and orientation of the STSs in the ®ngerprint clone contigs and the various STS-based maps, to identify and re®ne discrepancies (Fig. 5). Small ®ngerprint clone contigs (, 1 Mb) were dif®cult to orient and, sometimes, to order using these methods. In all, 942 ®ngerprint clone contigs contained sequenced clones. (An additional 304 of the 1,246 ®ngerprint clone contigs did not contain sequenced clones, but these tended to be extremely small and together contain less than 1% of the mapped clones. About onethird have been targeted for sequencing. A few derive from the Y chromosome, for which the map was constructed separately89. Most of the remainder are fragments of other larger contigs or represent other artefacts. These are being eliminated in subsequent versions of the database.) Of these 942 contigs with sequenced clones, 852 (90%, containing 99.2% of the total sequence) were localized to speci®c chromosome locations in this way. An additional 51 ®ngerprint clone contigs, containing 0.5% of the sequence, could be assigned to a speci®c chromosome but not to a precise position. articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 869 50 100 150 200 250 Chromosome 2 50 100 150 200 250 Map location Map location Chromosome 2 Chromosome 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y b Chromosome 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y a Genethon map Gene map Marshfield map YAC map Genethon map Gene map Marshfield map YAC map Figure 5 Positions of markers on previous maps of the genome (the Genethon101 genetic map and Marsh®eld genetic map (http://research.marsh®eldclinic.org/genetics/ genotyping_service/mgsver2.htm), the GeneMap99 radiation hybrid map100, and the Whitehead YAC and radiation hybrid map29) plotted against their derived position on the draft sequence for chromosome 2. The horizontal units are Mb but the vertical units of each map vary (cM, cR and so on) and thus all were scaled so that the entire map spans the full vertical range. Markers that map to other chromosomes are shown in the chromosome lines at the top.The data sets generally follow the diagonal, indicating that order and orientation of the marker sets on the different maps largely agree (note that the two genetic maps are completely superimposed). In a, there are two segments (bars) that are inverted in an earlier version draft sequence relative to all the other maps. b, The same chromosome after the information was used to reorient those two segments. A1 A1 A2 A2 A1 B1 A3 B3 A4 B6A5 B2 B4 B5 A2 A3 A4 A4 A5 A5 B1 B1 A3 B2 B2 B3 B3 B4 B4 B5 B5 B6 B6 a d b c end-to-end alignment : OK alignment in middle only : not OK Figure 6 The key steps (a±d) in assembling individual sequenced clones into the draft genome sequence. A1±A5 represent initial sequence contigs derived from shotgun sequencing of clone A, and B1±B6 are from clone B. © 2001 Macmillan Magazines Ltd