Science The Sequence of the Human Genome J Craig Venter, et al. Science291,1304(2001) Do|:10.1126/ scIence.10580 NAAAS he following resources related to this article are available online at www.sciencemag.org(thisinformationiscurrentasofSeptember27,2009: A correction has been published for this article at http://www.sciencemag.orglcgilcontent/fulllsci;295/5559/1466b Updated information and services, including high-resolution figures, can be found in the online version of this article at http://ww.sciencemag.org/cgi/content/full/291/5507/1304 Supporting Online Material can be found at http://www.sciencemag.org/cgi/content/full/291/5507/1304/dc1 A list of selected additional articles on the science Web sites related to this article can be found at http://www.sciencemag.org/cgi/content/full/291/5507/1304#trelated-content This article cites 152 articles. 52 of which can be accessed for free http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles This article has been cited by 4774 article(s)on the ISI Web of Science This article has been cited by 80 articles hosted by HighWire Press;see http://www.sciencemag.org/cgi/content/ful/29175 04*otherarticles 8NNNEaoomo This article appears in the following subject collections Genetics http://www.sciencemag.org/cgi/collection/genetics Information about obtaining reprints of this article or about obtaining permission to reproduce this article in whole or in part can be found at http://www.sciencemag.org/about/permissions.dtl 5ogEgco3sE9品 Science(print ISSN 0036-8075: online ISSN 1095-9203)is published weekly, except the last week in December, by the American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright 2001 by the American Association for the Advancement of Science; all rights reserved. The title Science is a egistered trademark of AAAS
DOI: 10.1126/science.1058040 Science 291, 1304 (2001); J. Craig Venter, et al. The Sequence of the Human Genome www.sciencemag.org (this information is current as of September 27, 2009 ): The following resources related to this article are available online at http://www.sciencemag.org/cgi/content/full/sci;295/5559/1466b A correction has been published for this article at: http://www.sciencemag.org/cgi/content/full/291/5507/1304 version of this article at: Updated information and services, including high-resolution figures, can be found in the online http://www.sciencemag.org/cgi/content/full/291/5507/1304/DC1 Supporting Online Material can be found at: found at: A list of selected additional articles on the Science Web sites related to this article can be http://www.sciencemag.org/cgi/content/full/291/5507/1304#related-content http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles This article cites 152 articles, 52 of which can be accessed for free: This article has been cited by 4774 article(s) on the ISI Web of Science. http://www.sciencemag.org/cgi/content/full/291/5507/1304#otherarticles This article has been cited by 80 articles hosted by HighWire Press; see: http://www.sciencemag.org/cgi/collection/genetics Genetics This article appears in the following subject collections: http://www.sciencemag.org/about/permissions.dtl this article in whole or in part can be found at: Information about obtaining reprints of this article or about obtaining permission to reproduce registered trademark of AAAS. 2001 by the American Association for the Advancement of Science; all rights reserved. The title Science is a American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME The Sequence of the Human Genome J. Craig Venter, Mark D. Adams, Eugene W. Myers, Peter W. Li,'Richard ].Mural, Granger G. Sutton, Hamilton O. Smith, Mark Yandell, 'Cheryl A. Evans, Robert A.Holt, T Jeannine D. Gocayne, Peter Amanatides, Richard M. Ballew, Daniel H. Hus Jennifer Russo Wortman, 'Qing Zhang, 'Chinnappa D. Kodira, ' Xiangqun H. Zheng, 'Lin Chen, Marian Skupski, 'Gangadharan Subramanian, 'Paul D. Thomas, 'Jinghui Zhang, George L. Gabor Miklos,2 Catherine Nelson, Samuel Broder, ' Andrew G. Clark, Joe Nadeau,' Victor A McKusick, Norton Zinder, Arnold J. Levine,Richard J. Roberts, Mel Simor Carolyn Slayman, 0 Michael Hunkapiller, 1 Randall Bolanos, Arthur Delcher, ' lan Dew, Daniel Fasulo, Michael Flanigan, Liliana Florea, Aaron Halpern, Sridhar Hannenhalli, Saul Kravitz, Samuel Levy. Clark Mobarry, 'Knut Reinert, Karin Remington, Jane Abu-Threideh, ELLen Beasley, ' Kendra Biddick,' Vivien Bonazzi, Rhonda Brandon, Michele Cargill, Ishwar Chandramouliswaran, ' Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, 'Valentina Di Francesco, 'Patrick Dunn, ' Karen Eilbeck, Carlos Evangelista, ' Andrei E Gabrielian, Weiniu Gan, Wangmao Ge, 'Fangcheng Gong, Zhiping Gu, Gennady V Merkulov, Natalia Milshina, Helen M Moore, Ashwinikumar K Naik, Vaibhav A Narayan, Beena Neelam, Deborah Nusskern, Douglas B Rusch, Steven Salzberg Wei Shao, 'Bixiong Shue, Jingtao Sun, Zhen Yuan Wang, Aihui Wang, Xin Wang, Jian Wang, Ming-Hui Wei, 'Ron Wides, cHunlin Xiao, Chunhua Yan, Alison Yao, Jane Ye, Ming Zhan Weiqing Zhang, Hongyu Zhang, Qi Zhao, Liansheng Zheng, Fei Zhong, Wenyan Zhong, hiaoping C Zhu, Shaying Zhao, Dennis Gilbert, Suzanna Baumhueter, Gene Spier, Christine Carter, Anibal Cravchik, Trevor Woodage, Feroze ALi, Huijin An, Aderonke Awe, Danita Baldwin, Holly Baden, Mary Barnstead, lan Barrow, Karen Beeson, Dana Busam, Amy Carver, Angela Center, Ming Lai Cheng, Liz Curry, Steve Danaher, ' Lionel Davenport, Raymond Desilets, Susanne Dietz, Kristina Dodson, Lisa Doup, Steven Ferriera, ' Neha Garg, Damon Hostin, Jarrett Houck, Timothy Howland, Chinyere lbegwam, Jeffery Johnson,'un,7 Andres GLuecksmann, Brit Hart, Jason Haynes, Charles Haynes, Cheryl Heiner, Suzanne Hlac ooogE5go33E Francis Kalush, Lesley Kline, Shashi Koduru, Amy Love, Felecia Mann, David May, teven McCawley, Tina McIntosh, Ivy McMullen, Mee Moy, Linda Moy, Brian Murphy. Keith Nelson, ' Cynthia Pfannkoch, Eric Pratts, Vinita Puri, Hina Qureshi, Matthew Reardon, Robert Rodriguez, Yu-Hui Rogers, ' Deanna Romblad, Bob Ruhfel, Richard Scott, Cynthia Sitter, Michelle Smallwood, Erin Stewart, Renee Strong, ELlen Suh, Reginald Thomas, NiNi Tint, Sukyee Tse, 'Claire Vech, Gary Wang, Jeremy Wetter, sherita Williams, Monica Williams, Sandra Windsor, Emily Winn-Deen, Keriellen Wolfe, Jayshree Zaveri, Karena Zaveri, Josep F. Abril, 4 Roderic Guigo. 4 Michael J. Campbell, Kimmen V. Sjolander, ' Brian Karlak, 1 Anish Kejariwal, Huaiyu Mi, 'Betty Lazareva, ' Thomas Hatton, Apurva Narechania, 'Karen Diemer, Anushya Muruganujan, 'Nan Guo, Shinji Sato, Vineet Bafna, Sorin Istrail, ' Ross Lippert, T Russell Schwartz, Brian Walenz, Shibu Yooseph, David Allen, Anand Basu, James Baxendale, Louis Blick, Marcelo Caminha, John Carnes-Stine, ' Parris Caulk, Yen-Hui Chiang, My Coyne, Carl Dahlke, 'Anne Deslattes Mays, Maria Dombroski, ' Michael Donnelly, Dale Ely, Shiva Sparham, Carl Fosler, Harold Gire, Stephen Glanowski, Kenneth Glasser, Anna Glodek, Mark Gorokhov Ken Graham, Barry Gropman, ' Michael Harris, Jeremy Heil, Scott Henderson, Jeffrey Hoover, Donald Jennings, Catherine Jordan, James Jordan, John Kasha, Leonid Kagan, Cheryl Kraft, Alexander Levitsky, Mark Lewis, Xiangjun Liu, ' John Lopez, 'Daniel Ma, William Majoros, T Joe McDaniel, 'Sean Murphy, Matthew Newman, Trung Nguyen, 'Ngoc Nguyen, Marc NodelL, Sue Pan, Jim Peck, Marshall Peterson, william Rowe, Robert Sanders, John Scott, Michael Simpson, Thomas Smith, Arlan Sprague, ' Timothy Stockwell, Russell Turner, 'ELi Venter, Mei Wang 'Meiyuan Wen, 'David Wu, 'Mitchell Wu, Ashley Xia, ' Ali Zandieh, Xiaohong Zhu 1304 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
The Sequence of the Human Genome J. Craig Venter,1 * Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1 Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1 Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1 Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1 Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1 George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5 Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9 Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1 Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1 Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1 Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1 Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1 Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1 Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1 Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1 Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1 Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12 Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1 Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1 Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1 Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1 Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1 Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1 Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1 Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1 Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1 Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1 Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1 Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1 Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1 Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1 Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1 Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1 Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1 Josep F. Abril,14 Roderic Guigo´,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1 Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1 Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1 Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1 Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1 Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1 Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1 Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1 Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1 Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1 Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1 Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1 Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1 Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1 T H E H UMAN G ENOME 1304 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME A 2.91-billion base pair(bp) consensus sequence of the euchromatic portion of DNA using chain-terminating nucleotide ana- the human genome was generated by the whole-genome shotgun sequencing logs(3). In the same year, the first human gene ethod. The 148-billion bp DNA sequence was generated over 9 months from was isolated and sequenced(4). In 1986. Hood 27, 271, 853 high-quality sequence reads(5.11-fold coverage of the genome and co-workers (5) described an improvement from both ends of plasmid clones made from the DNA of five individuals. Two in the Sanger sequencing method that included assembly strategies-a whole-genome assembly and a regional chromosome attaching fluorescent dyes to the nucleotides assembly-were used, each combining sequence data from Celera and the hich permitted them to be sequentially read publicly funded genome effort. The public data were shredded into 550-bp by a computer. The first automated DNA se- egments to create a 2.9-fold coverage of those genome regions that had been quencer, developed by Applied Biosystems in quenced, without including biases inherent in the cloning and assembly Califonia in 1987. was shown to be successful erage in the asser the publicly funded group. This brought the effective cov- when the sequences of two genes were obtained with this new technology(6). From early se- the final assembly over what would be obtained with 5. 11-fold coverage. The quencing of human genomic regions (7),it two assembly strategies yielded very similar results that largely agree with became clear that cDNA sequences(which are independent mapping data. The assemblies effectively cover the euchromatic reverse-transcribed from RNA) would be es regions of the human chromosomes. More than 90% of the genome is in sential to annotate and validate gene predictions scaffold assemblies of 100,000 bp or more, and 25% of the genome is in in the human genome. These studies were the scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed basis in part for the development of the ex- 26, 588 protein-encoding transcripts for which there was strong corroborating pressed sequence tag (EST)method of gene evidence and an additional, 000 computationally derived genes with mouse dentification(8), which is a random selection matches or other weak supporting evidence. Although gene-dense clusters are very high throughput sequencing approach to obvious, almost half the genes are dispersed in low G+C sequence separated characterize cDNA libraries. The ESt method by large tracts of apparently noncoding sequence. Only 1.1% of the genome led to the rapid discovery and mapping of hu- o is spanned by exons, whereas 24% is in introns, with 75% of the genome being man genes(9). The increasing numbers of hu- intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- man EST sequences necessitated the develop- nosomal lengths, are abundant throughout the genome and reveal a complex ment of new computer algorithms to analyze o evolutionary history. Comparative genomic analysis indicates vertebrate ex large amounts of sequence data, and in 1993 at ansions of genes associated with neuronal function, with tissue-specific de- elopmental regulation, and with the hemostasis and immune systems. DNA algorithm was developed that permitted assem- 5 sequence comparisons between the consensus sequence and publicly funded bly and analysis of hundreds of thousands of genome data provided locations of 2.1 million single-nucleotide polymorphisms ESTS. This algorithm permitted characteriza- (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per tion and annotation of human genes on the basis o 1250 on average, but there was marked heterogeneity in the level of poly of 30,000 EST assemblies(10) morphism across the genome. Less than 1% of all SNPs resulted in variation The complete 49-kbp bacteriophage lamb- proteins, but the task of determining which SNPs have functional consequences remains an open challenge shotgun restriction digest method in 1982 (11). When considering methods for sequenc- Decoding of the dNa that constitutes the derstanding human evolution, the causation ing the smallpox virus genome in 1991(12) human genome has been widely anticipated of disease, and the interplay between the a whole-genome shotgun sequencing method for the contribution it will make toward un- environment and heredity in defining the hu- was discussed and subsequently rejected ow- a Celera Genomics, 45 West Gude Drive, Rockville, MD determining the complete nucleotide se- for genome assembly. However, in 1994, 8 20850. USA "Genetixxpress, 78 Pacific Road. palm quence of the human genome was first for- when a microbial genome-sequencing project Genome Project, University of California, Berkeley, cA years, the idea met with mixed reactions in shotgun sequencing approach was considered 3 4720, USA. " Department of Biology, Penn State Uni- the scientific community(2). However, in possible with the TIGR EST assembly algo- rsity. 208 Mueller Lab, University Park, PA 16802, 1990, the Human Genome Project(HGP)was rithm. In 1995, the 1.8-Mbp Haemophilus officially initiated in the United States under influenzae genome ed by 44106,USA. Johns Hopkins the direction of the National Institutes of whole-genome shotgun sequencing method Avenue, Cleveland,OH 44106, USA 6yo0okins Hospi- Health and the U.S. Department of Energy (13). The experience with several subsequent mD 21287 22 us a rgeckefeler unive situ mre with a I5-year, $3 billion plan for completing genome-sequencing efforts established the York Avenue. New York. NY 10021-6399, USA Ne land BioLabs, 32 Tozer Road, Beverly, MA 01915, Ir intention to build a unique genome- A key feature of the sequencing approach USA. Division of Biology 147-75, California Institute sequencing facility, to determine the se- used for these megabase-size and larger ge- Technology, 1200 East California Boulevard, Pasa- quence of the human genome over a 3-year nomes was the use of paired-end sequences dena, CA 91125, USA. Yale University School of period. Here we report the penultimate mile- (also called mate pairs), derived from sub- Haven, CT 06520-8000, USA "Applied Biosystems, stone along the path toward that goal, a nearly clone libraries with distinct insert sizes and 850 Lincoln Centre Drive, Foster City. CA 94404, USA complete sequence of the euchromatic por- cloning characteristics. Paired-end sequences "The Institute for Genomic Research, 9712 Medical tion of the human genome. The sequencing are sequences 500 to 600 bp in length from rmatica Medica, In- shotgun method with subsequent assembly of prescribed lengths. The success of using end stitut Municipal vestigacio Medica, Universitat sequences from long segments(18 to 20 kbp) ompeu Fabra, 08003-Barcelona, Catalonia, Spain. The modern history of DNA sequencing of DNA cloned into bacteriophage lambda in To whom correspondence should be addressed. E- began in 1977, when Sanger reported his meth- assembly of the microbial genomes led to the mailhumangenome@celera.com od for determining the order of nucleotides of suggestion(16)of an approach to simulta- www.sciencemagorgSciEnceVol29116FebRuarY2001 1305
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ;12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G1C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge. Decoding of the DNA that constitutes the human genome has been widely anticipated for the contribution it will make toward understanding human evolution, the causation of disease, and the interplay between the environment and heredity in defining the human condition. A project with the goal of determining the complete nucleotide sequence of the human genome was first formally proposed in 1985 (1). In subsequent years, the idea met with mixed reactions in the scientific community (2). However, in 1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the U.S. Department of Energy with a 15-year, $3 billion plan for completing the genome sequence. In 1998 we announced our intention to build a unique genomesequencing facility, to determine the sequence of the human genome over a 3-year period. Here we report the penultimate milestone along the path toward that goal, a nearly complete sequence of the euchromatic portion of the human genome. The sequencing was performed by a whole-genome random shotgun method with subsequent assembly of the sequenced segments. The modern history of DNA sequencing began in 1977, when Sanger reported his method for determining the order of nucleotides of DNA using chain-terminating nucleotide analogs (3). In the same year, the first human gene was isolated and sequenced (4). In 1986, Hood and co-workers (5) described an improvement in the Sanger sequencing method that included attaching fluorescent dyes to the nucleotides, which permitted them to be sequentially read by a computer. The first automated DNA sequencer, developed by Applied Biosystems in California in 1987, was shown to be successful when the sequences of two genes were obtained with this new technology (6). From early sequencing of human genomic regions (7), it became clear that cDNA sequences (which are reverse-transcribed from RNA) would be essential to annotate and validate gene predictions in the human genome. These studies were the basis in part for the development of the expressed sequence tag (EST) method of gene identification (8), which is a random selection, very high throughput sequencing approach to characterize cDNA libraries. The EST method led to the rapid discovery and mapping of human genes (9). The increasing numbers of human EST sequences necessitated the development of new computer algorithms to analyze large amounts of sequence data, and in 1993 at The Institute for Genomic Research (TIGR), an algorithm was developed that permitted assembly and analysis of hundreds of thousands of ESTs. This algorithm permitted characterization and annotation of human genes on the basis of 30,000 EST assemblies (10). The complete 49-kbp bacteriophage lambda genome sequence was determined by a shotgun restriction digest method in 1982 (11). When considering methods for sequencing the smallpox virus genome in 1991 (12), a whole-genome shotgun sequencing method was discussed and subsequently rejected owing to the lack of appropriate software tools for genome assembly. However, in 1994, when a microbial genome-sequencing project was contemplated at TIGR, a whole-genome shotgun sequencing approach was considered possible with the TIGR EST assembly algorithm. In 1995, the 1.8-Mbp Haemophilus influenzae genome was completed by a whole-genome shotgun sequencing method (13). The experience with several subsequent genome-sequencing efforts established the broad applicability of this approach (14, 15). A key feature of the sequencing approach used for these megabase-size and larger genomes was the use of paired-end sequences (also called mate pairs), derived from subclone libraries with distinct insert sizes and cloning characteristics. Paired-end sequences are sequences 500 to 600 bp in length from both ends of double-stranded DNA clones of prescribed lengths. The success of using end sequences from long segments (18 to 20 kbp) of DNA cloned into bacteriophage lambda in assembly of the microbial genomes led to the suggestion (16) of an approach to simulta- 1 Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA. 2 GenetixXpress, 78 Pacific Road, Palm Beach, Sydney 2108, Australia. 3 Berkeley Drosophila Genome Project, University of California, Berkeley, CA 94720, USA. 4 Department of Biology, Penn State University, 208 Mueller Lab, University Park, PA 16802, USA. 5 Department of Genetics, Case Western Reserve University School of Medicine, BRB-630, 10900 Euclid Avenue, Cleveland, OH 44106, USA. 6 Johns Hopkins University School of Medicine, Johns Hopkins Hospital, 600 North Wolfe Street, Blalock 1007, Baltimore, MD 21287–4922, USA. 7 Rockefeller University, 1230 York Avenue, New York, NY 10021–6399, USA. 8 New England BioLabs, 32 Tozer Road, Beverly, MA 01915, USA. 9 Division of Biology, 147-75, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA. 10Yale University School of Medicine, 333 Cedar Street, P.O. Box 208000, New Haven, CT 06520–8000, USA. 11Applied Biosystems, 850 Lincoln Centre Drive, Foster City, CA 94404, USA. 12The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA. 13Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, 52900 Israel. 14Grup de Recerca en Informa`tica Me`dica, Institut Municipal d’Investigacio´ Me`dica, Universitat Pompeu Fabra, 08003-Barcelona, Catalonia, Spain. *To whom correspondence should be addressed. Email: humangenome@celera.com T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1305 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME map and sequence the human ge- coverage and to use the unordered and unori- 1 Sources of DNA and Sequencing This section discusses the rationale donor selection to changes fied whol posed to do genome over a inogeographi terim assembled se An Overview of the Predicted Protein-group.I terly. The modifications included a plan to Coding Genes in the Human Genome ood was form random shotgun sequencing to om males, 30 ml of whole, heparinized blood was 1306 16 february 2001 vol 291 science www.sciencemag.org
neously map and sequence the human genome by means of end sequences from 150- kbp bacterial artificial chromosomes (BACs) (17, 18). The end sequences spanned by known distances provide long-range continuity across the genome. A modification of the BAC end-sequencing (BES) method was applied successfully to complete chromosome 2 from the Arabidopsis thaliana genome (19). In 1997, Weber and Myers (20) proposed whole-genome shotgun sequencing of the human genome. Their proposal was not well received (21). However, by early 1998, as less than 5% of the genome had been sequenced, it was clear that the rate of progress in human genome sequencing worldwide was very slow (22), and the prospects for finishing the genome by the 2005 goal were uncertain. In early 1998, PE Biosystems (now Applied Biosystems) developed an automated, highthroughput capillary DNA sequencer, subsequently called the ABI PRISM 3700 DNA Analyzer. Discussions between PE Biosystems and TIGR scientists resulted in a plan to undertake the sequencing of the human genome with the 3700 DNA Analyzer and the whole-genome shotgun sequencing techniques developed at TIGR (23). Many of the principles of operation of a genome-sequencing facility were established in the TIGR facility (24). However, the facility envisioned for Celera would have a capacity roughly 50 times that of TIGR, and thus new developments were required for sample preparation and tracking and for wholegenome assembly. Some argued that the required 150-fold scale-up from the H. influenzae genome to the human genome with its complex repeat sequences was not feasible (25). The Drosophila melanogaster genome was thus chosen as a test case for whole-genome assembly on a large and complex eukaryotic genome. In collaboration with Gerald Rubin and the Berkeley Drosophila Genome Project, the nucleotide sequence of the 120-Mbp euchromatic portion of the Drosophila genome was determined over a 1-year period (26–28). The Drosophila genome-sequencing effort resulted in two key findings: (i) that the assembly algorithms could generate chromosome assemblies with highly accurate order and orientation with substantially less than 10-fold coverage, and (ii) that undertaking multiple interim assemblies in place of one comprehensive final assembly was not of value. These findings, together with the dramatic changes in the public genome effort subsequent to the formation of Celera (29), led to a modified whole-genome shotgun sequencing approach to the human genome. We initially proposed to do 10-fold sequence coverage of the genome over a 3-year period and to make interim assembled sequence data available quarterly. The modifications included a plan to perform random shotgun sequencing to ;5-fold coverage and to use the unordered and unoriented BAC sequence fragments and subassemblies published in GenBank by the publicly funded genome effort (30) to accelerate the project. We also abandoned the quarterly announcements in the absence of interim assemblies to report. Although this strategy provided a reasonable result very early that was consistent with a whole-genome shotgun assembly with eightfold coverage, the human genome sequence is not as finished as the Drosophila genome was with an effective 13-fold coverage. However, it became clear that even with this reduced coverage strategy, Celera could generate an accurately ordered and oriented scaffold sequence of the human genome in less than 1 year. Human genome sequencing was initiated 8 September 1999 and completed 17 June 2000. The first assembly was completed 25 June 2000, and the assembly reported here was completed 1 October 2000. Here we describe the whole-genome random shotgun sequencing effort applied to the human genome. We developed two different assembly approaches for assembling the ;3 billion bp that make up the 23 pairs of chromosomes of the Homo sapiens genome. Any GenBank-derived data were shredded to remove potential bias to the final sequence from chimeric clones, foreign DNA contamination, or misassembled contigs. Insofar as a correctly and accurately assembled genome sequence with faithful order and orientation of contigs is essential for an accurate analysis of the human genetic code, we have devoted a considerable portion of this manuscript to the documentation of the quality of our reconstruction of the genome. We also describe our preliminary analysis of the human genetic code on the basis of computational methods. Figure 1 (see fold-out chart associated with this issue; files for each chromosome can be found in Web fig. 1 on Science Online at www.sciencemag.org/cgi/content/full/291/ 5507/1304/DC1) provides a graphical overview of the genome and the features encoded in it. The detailed manual curation and interpretation of the genome are just beginning. To aid the reader in locating specific analytical sections, we have divided the paper into seven broad sections. A summary of the major results appears at the beginning of each section. 1 Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome Evolution 6 A Genome-Wide Examination of Sequence Variations 7 An Overview of the Predicted ProteinCoding Genes in the Human Genome 8 Conclusions 1 Sources of DNA and Sequencing Methods Summary. This section discusses the rationale and ethical rules governing donor selection to ensure ethnic and gender diversity along with the methodologies for DNA extraction and library construction. The plasmid library construction is the first critical step in shotgun sequencing. If the DNA libraries are not uniform in size, nonchimeric, and do not randomly represent the genome, then the subsequent steps cannot accurately reconstruct the genome sequence. We used automated high-throughput DNA sequencing and the computational infrastructure to enable efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9 billion bp of sequence). Sequencing and tracking from both ends of plasmid clones from 2-, 10-, and 50-kbp libraries were essential to the computational reconstruction of the genome. Our evidence indicates that the accurate pairing rate of end sequences was greater than 98%. Various policies of the United States and the World Medical Association, specifically the Declaration of Helsinki, offer recommendations for conducting experiments with human subjects. We convened an Institutional Review Board (IRB) (31) that helped us establish the protocol for obtaining and using human DNA and the informed consent process used to enroll research volunteers for the DNA-sequencing studies reported here. We adopted several steps and procedures to protect the privacy rights and confidentiality of the research subjects (donors). These included a two-stage consent process, a secure random alphanumeric coding system for specimens and records, circumscribed contact with the subjects by researchers, and options for off-site contact of donors. In addition, Celera applied for and received a Certificate of Confidentiality from the Department of Health and Human Services. This Certificate authorized Celera to protect the privacy of the individuals who volunteered to be donors as provided in Section 301(d) of the Public Health Service Act 42 U.S.C. 241(d). Celera and the IRB believed that the initial version of a completed human genome should be a composite derived from multiple donors of diverse ethnic backgrounds Prospective donors were asked, on a voluntary basis, to self-designate an ethnogeographic category (e.g., African-American, Chinese, Hispanic, Caucasian, etc.). We enrolled 21 donors (32). Three basic items of information from each donor were recorded and linked by confidential code to the donated sample: age, sex, and self-designated ethnogeographic group. From females, ;130 ml of whole, heparinized blood was collected. From males, ;130 ml of whole, heparinized blood was T H E H UMAN G ENOME 1306 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME collected, as well as five specimens of semen, the dideoxy sequencing method (35), which throug collected over a 6-week period. Permanent typically yields only 500 to 750 bp of sequence central laboratory information management lymphoblastoid cell lines were created by per reaction. This limitation on read length has system(LIMS)tracked all sample plates by Epstein-Barr virus immortalization. DNA made monumental gains in throughput a pre- unique bar code identifiers. The facility was from five subjects was selected for genomic requisite for the analysis of large eukaryotic supported by a quality control team that per- DNA sequencing: two males and three fe- genomes. We accomplished this at the Celera formed raw material and in-process testing males--one African-American, one Asian- facility, which occupies about 30,000 square and a quality assurance group with responsi- Chinese, one Hispanic-Mexican, and two feet of laboratory space and produces sequence bilities including document control, valida- Caucasians(see Web fig. 2 on Science Online data continuously at a rate of 175,000 total tion, and auditing of the facility. Critical to atwww.sciencemag.org/cgi/content/291/5507/readsperdayTheDna-sEquenCingfacilityisthesuccessofthescale-upwasthevalidation 1304/DC1). The decision of whose DNA to supported by a high-performance computation- of all software and instrumentation before well as technical issues such as the quality of ular by design and automa the dNa libraries and availability of immortal- sample backlogs allowed four principal 1.2 Trace processing ized cell lines modules to operate independently: (i)li- An automated trace-processing pipeline has rary transformation, plating, and colony been developed to process each sequence file 1.1 Library construction and picking; (ii) DNA template preparation; (37). After quality and vector trimming, the sequencing eaction set-up average trimmed sequence length was 543 ing process is preparation of high-quality plas- mination with the ABI PRISM 3700 DNA nentially distributed with a mean of 99.5% 8 Central to the whole-genome shotgun sequenc- and purification; and (iv) sequence deter- bp, and the sequencing accuracy was expo mid libraries in a variety of insert sizes so that Analyzer. Because the inputs and outputs and with less than 1 in 1000 reads being less one read from both ends of each plasmid insert. matched and sample backlogs are continu- quence was screened for matches to contam- High-quality libraries have an equal representa- ously managed, sequencing has proceeded inants including sequences of vector alone, E. on of all parts of the genome, a small number without a single day s interruption since the coli genomic DNA, and human mitochondri- o from such sources as the mitochondrial genome 1999. The abl 3700 is a fully automated with a significant match to a contaminant was o and Escherichia coli genomic DNA. DNA from capillary array sequencer and as such can discarded. A total of 713 reads matched E. s each donor was used to construct plasmid librar- be operated with a minimal amount of coli genomic DNA and 21 14 reads matched ies in one or more of three size classes: 2 kbp, 10 hands-on time, currently estimated at about the human mitochondrial genome cbp, and 50 kbp (Table 1)(33) 15 min per day. The capillary system also c In designing the DNA-sequencing pro- facilitates correct associations of sequenc- 1.3 Quality assessment and control cess, we focused on developing a simple ing traces with samples through the elimi- The importance of the base-pair level ac- ystem that could be implemented in a robust nation of manual sample loading and lane- curacy of the sequence data increases as the and reproducible manner and monitored ef- tracking errors associated with slab gels. size and repetitive nature of the genome to fectively(Fig. 2)(34). About 65 production staff were hired and be sequenced increases. Each sequence Current sequencing protocols are based on trained, and were rotated on a regular basis read must be placed uniquely in the ge- Table 1. Celera-generated data input into assembly Number of reads for different insert libraries Individual Total number of Total No of sequencing reads 2,767,357 2.767, 11.736,757 7467.755 66930 27 10464393006 853819 942,164,187 952523 19993 1.085640534 1,498607 Total 13543099 10894467 2834,287 27,271,853 14808616,179 Fold sequence coverage (2.9-Gb genome) 220 0.37 0.28 old clone coverag 18.39 18.39 11.26 Total 16.43 3868 sert size*(mean nsert size*(SD) 8.10% Mates Averag 8080 75.60 "insert size and SD are calculated from assembly of mates on contigs. t% Mates is based on laboratory tracking of sequencing runs www.sciencemagorgSciEnceVol29116FebRuarY2001 1307
collected, as well as five specimens of semen, collected over a 6-week period. Permanent lymphoblastoid cell lines were created by Epstein-Barr virus immortalization. DNA from five subjects was selected for genomic DNA sequencing: two males and three females—one African-American, one AsianChinese, one Hispanic-Mexican, and two Caucasians (see Web fig. 2 on Science Online at www.sciencemag.org/cgi/content/291/5507/ 1304/DC1). The decision of whose DNA to sequence was based on a complex mix of factors, including the goal of achieving diversity as well as technical issues such as the quality of the DNA libraries and availability of immortalized cell lines. 1.1 Library construction and sequencing Central to the whole-genome shotgun sequencing process is preparation of high-quality plasmid libraries in a variety of insert sizes so that pairs of sequence reads (mates) are obtained, one read from both ends of each plasmid insert. High-quality libraries have an equal representation of all parts of the genome, a small number of clones without inserts, and no contamination from such sources as the mitochondrial genome and Escherichia coli genomic DNA. DNA from each donor was used to construct plasmid libraries in one or more of three size classes: 2 kbp, 10 kbp, and 50 kbp (Table 1) (33). In designing the DNA-sequencing process, we focused on developing a simple system that could be implemented in a robust and reproducible manner and monitored effectively (Fig. 2) (34). Current sequencing protocols are based on the dideoxy sequencing method (35), which typically yields only 500 to 750 bp of sequence per reaction. This limitation on read length has made monumental gains in throughput a prerequisite for the analysis of large eukaryotic genomes. We accomplished this at the Celera facility, which occupies about 30,000 square feet of laboratory space and produces sequence data continuously at a rate of 175,000 total reads per day. The DNA-sequencing facility is supported by a high-performance computational facility (36). The process for DNA sequencing was modular by design and automated. Intermodule sample backlogs allowed four principal modules to operate independently: (i) library transformation, plating, and colony picking; (ii) DNA template preparation; (iii) dideoxy sequencing reaction set-up and purification; and (iv) sequence determination with the ABI PRISM 3700 DNA Analyzer. Because the inputs and outputs of each module have been carefully matched and sample backlogs are continuously managed, sequencing has proceeded without a single day’s interruption since the initiation of the Drosophila project in May 1999. The ABI 3700 is a fully automated capillary array sequencer and as such can be operated with a minimal amount of hands-on time, currently estimated at about 15 min per day. The capillary system also facilitates correct associations of sequencing traces with samples through the elimination of manual sample loading and lanetracking errors associated with slab gels. About 65 production staff were hired and trained, and were rotated on a regular basis through the four production modules. A central laboratory information management system (LIMS) tracked all sample plates by unique bar code identifiers. The facility was supported by a quality control team that performed raw material and in-process testing and a quality assurance group with responsibilities including document control, validation, and auditing of the facility. Critical to the success of the scale-up was the validation of all software and instrumentation before implementation, and production-scale testing of any process changes. 1.2 Trace processing An automated trace-processing pipeline has been developed to process each sequence file (37). After quality and vector trimming, the average trimmed sequence length was 543 bp, and the sequencing accuracy was exponentially distributed with a mean of 99.5% and with less than 1 in 1000 reads being less than 98% accurate (26). Each trimmed sequence was screened for matches to contaminants including sequences of vector alone, E. coli genomic DNA, and human mitochondrial DNA. The entire read for any sequence with a significant match to a contaminant was discarded. A total of 713 reads matched E. coli genomic DNA and 2114 reads matched the human mitochondrial genome. 1.3 Quality assessment and control The importance of the base-pair level accuracy of the sequence data increases as the size and repetitive nature of the genome to be sequenced increases. Each sequence read must be placed uniquely in the geTable 1. Celera-generated data input into assembly. Individual Number of reads for different insert libraries Total number of base pairs 2 kbp 10 kbp 50 kbp Total No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851 B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006 C 853,819 881,290 0 1,735,109 942,164,187 D 952,523 1,046,815 0 1,999,338 1,085,640,534 F 0 1,498,607 0 1,498,607 813,743,601 Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179 Fold sequence coverage A 0 0 0.52 0.52 (2.9-Gb genome) B 2.20 1.40 0.01 3.61 C 0.16 1.17 0 0.32 D 0.18 0.20 0 0.37 F 0 0.28 0 0.28 Total 2.54 2.04 0.53 5.11 Fold clone coverage A 0 0 18.39 18.39 B 2.96 11.26 0.44 14.67 C 0.22 1.33 0 1.54 D 0.24 1.58 0 1.82 F 0 2.26 0 2.26 Total 3.42 16.43 18.84 38.68 Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bp Insert size* (SD) Average 6.10% 8.10% 14.90% % Mates† Average 74.50 80.80 75.60 *Insert size and SD are calculated from assembly of mates on contigs. †% Mates is based on laboratory tracking of sequencing runs. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1307 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME nome, and even a modest error rate can entire human genome in a single facility, dent, nonbiased view of the genome. The sec- reduce the effectiveness of assembly. In we were able to ensure uniform quality ond approach involves clustering all of the frag- ddition, maintaining the validity of mate- standards and the cost advantages associat- ments to a region or chromosome on the basis pair information is absolutely critical for ed with automation, an economy of scale, of mapping information. The clustered data the algorithms described below. Procedural and process consistency were then shredded and subjected to computa- controls were established for maintaining tional assembly. Both approaches provided es- the validity of sequence mate-pairs as se- 2 Genome Assembly Strategy and ntially the same reconstruction of assembled quencing reactions proceeded through the Characterization process, including strict rules built into the Summary. We describe in this section the two DNA sequence win proper order and orienta- LIMS. The accuracy of sequence data pro- approaches that we used to assemble the ge- greater sequence coverage(fewer gaps) and duced by the Celera process was validated nome. One method involves the computational was the principal sequence used for the analysi in the course of the Drosophila genome combination of all sequence reads with shred- phase. In addition, we document the complete- project (26). By collecting data for the ded data from Gen Bank to generate an indepen- ness and correctness of this assembly process Potential Entry Points Potential Exit Points oces Human Sample Workflow Process - sample screening Tissue Samples DNA/RNA Extraction QC: size and clarity DNA/RNA (DNA Resources] /DNA Resources, (DNA Resources) QC, size concentration QC, insert size DNA/RNA(External) Libraries (DNA Resource Library Construction library complexity /DNA Resources/ (DNA Resources] 8cNEam5o Libraries QC: titer functional test Pre-Sequencing Fluorescently Labeled DNA Resource (Pre-Sequencing Labl C: monitor statistical Fluorescently Labeled Sequencing summary data Trace Files [NT] Sequencing Lab (Pre-Sequencing Lab) (Sequencing Lab/ vector contaminant Trace Files [UNIX load QCDS quality info Post-Sequencing creening [Content Systems/ 33sE= QC: byte count, External Fragments emove duplicates Proces Pre-Assembly IContent Systems·EDA Content Systems] /Content Systemsj QC: "gatekee External Trimmed syntax, duplicates Fragments Proto I/O File Generation_ quality values. Proto l/o files y Chromosome Proto l/o Files gatekeeper"run again Assembly Team QA review Assemblies [Informatics Research/ R/C Fig. 2. Flow diagram for sequencing pipeli lected, and processed in compliance with standard operating proc and da tau ith ot Maternal and extemal entities ac t dures, with a focus on quality within and across departments. Each ntrol measures, and responsible parties are indicated and are process has defined inputs and outputs with the capability to exchang further in the text. 1308 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
nome, and even a modest error rate can reduce the effectiveness of assembly. In addition, maintaining the validity of matepair information is absolutely critical for the algorithms described below. Procedural controls were established for maintaining the validity of sequence mate-pairs as sequencing reactions proceeded through the process, including strict rules built into the LIMS. The accuracy of sequence data produced by the Celera process was validated in the course of the Drosophila genome project (26). By collecting data for the entire human genome in a single facility, we were able to ensure uniform quality standards and the cost advantages associated with automation, an economy of scale, and process consistency. 2 Genome Assembly Strategy and Characterization Summary. We describe in this section the two approaches that we used to assemble the genome. One method involves the computational combination of all sequence reads with shredded data from GenBank to generate an independent, nonbiased view of the genome. The second approach involves clustering all of the fragments to a region or chromosome on the basis of mapping information. The clustered data were then shredded and subjected to computational assembly. Both approaches provided essentially the same reconstruction of assembled DNA sequence with proper order and orientation. The second method provided slightly greater sequence coverage (fewer gaps) and was the principal sequence used for the analysis phase. In addition, we document the completeness and correctness of this assembly process Fig. 2. Flow diagram for sequencing pipeline. Samples are received, selected, and processed in compliance with standard operating procedures, with a focus on quality within and across departments. Each process has defined inputs and outputs with the capability to exchange samples and data with both internal and external entities according to defined quality guidelines. Manufacturing pipeline processes, products, quality control measures, and responsible parties are indicated and are described further in the text. T H E H UMAN G ENOME 1308 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME d provide a comparison to the public gend 2.1 Assembly data sets equences. In the past 2 years the PFP has sequence, which was reconstructed largely by We used two independent sets of data for our focused on a product of lower quality and com- an independent BAC-by-BAC approach Our assemblies. The first was a random shotgun pleteness, but on a faster time-course, by con- assemblies effectively covered the euchromatic data set of 27.27 million reads of average length centrating on the production of Phase I data regions of the human chromosomes. More than 543 bp produced at Celera. This consisted from a 3X to 4X light-shotgun of each BAC 90% of the genome was in scaffold assemblies largely of mate-pair reads from 16 libraries clone of 100,000 bp or greater, and 25% of the ge. constructed from DNA samples taken from five We screened the bactig sequences for con- nome was in scaffolds of 10 million bp or different donors. Libraries with insert sizes of 2, taminants by using the blast algorithm 0, and 50 kbp were used. By looking at how against three data sets: (i) vector sequences mate pairs from a library were positioned inin Univec core (38), filtered for a 25-bl Shotgun sequence assembly is a classic known sequenced stretches of the genome, we match at 98% sequence identity at the ends example of an inverse problem: given a set were able to characterize the range of insert of the sequence and a 30-bp match internal of reads omly sampled from a target sizes in each library and determine a mean and to the sequence; (ii) the nonhuman portion sequence, reconstruct the order and the po- standard deviation. Table 1 details the number of the High Throughput Genomic(HTG) sition of those reads in the target Genome of reads, sequencing coverage, and clone cov- Seqences division of Gen Bank (39), fil- assembly algorithms developed for Dro- erage achieved by the data set. The clone cov- tered at 200 bp at 98%; and (iii)the non- sophila have now been extended to assemble erage is the coverage of the genome in cloned redundant nucleotide sequences from Gen- the "25-fold larger human genome. Celera as- DNA, considering the entire insert of each Bank without primate and human virus en- semblies consist of a set of contigs that are clone that has sequence from both ends. The tries, filtered at 200 bp at 98%.Whenever mapped to chromosomal locations by using amount of physical DNA coverage of the ge- 50 bp of the end of a contig, the ip up to 8 ordered and oriented into scaffolds that are then clone coverage provides a measure of the 25 bp or more of vector was found within lection of overlapping sequence reads that pro- Celera trimmed sequences gave a 51X cover- these criteria we removed 2.6 Mbp of pos- N vide a consensus reconstruction for a contigu- age of the genome, and clone coverage was sible contaminant and vector from the ous interval of the genome Mate pairs are a 3.42X, 1640X, and 18 84X for the 2, 10, and Phase 3 data, 61.0 Mbp from the Phase 1 entral component of the assembly strategy. 50-kbp libraries, respectively, for a total of and 2 data, and 16 1 Mbp from the Phase a size of gaps between consecutive contigs is The second data set was from the publicly 4363. 7 Mbp of PFP sequence data 20% ( known with reasonable precision. This is ac- funded Human Genome Project(PFP)and is finished, 75% rough-draft(Phase 1 and 2) arily derived from BAC clones (30). The and 5% single sequencing reads(Phase 0) one of which is in one contig, and the other of BAC data input to the assemblies came from a An additional 104, 018 BAC end-sequence which is in another, implies an orientation and download of GenBank on I September 2000 mate pairs were also downloaded and in- o distance between the two contigs(Fig 3). Fi-(Table 2) totaling 4443. 3 Mbp of sequence. cluded in the data sets for both assembly E nally, our assemblies did not incorporate all The data for each BAC is deposited at one of processes(18) reads into the final set of reported scaffolds. four levels of completion. Phase 0 data are a set This set of unincorporated reads is termed of generally unassembled sequencing reads 2. 2 Assembly strategies chaff, " and typically consisted of reads from from a very light shotgun of the BAC, typically Two different approaches to assembly were within highly repetitive regions, data from other less than 1x. Phase I data are unordered as- pursued. The first was a whole-genome as- organisms introduced through various routes as semblies of contigs, which we call BAC contigs sembly process that used Celera data and the found in many genome projects, and data of or bactigs. Phase 2 data are ordered assemblies PFP data in the form of additional synthetic a poor quality or with untrimmed vector. was a compart STS tioned the Celera and pfp data into sets o Mappe Genome localized to large chromosomal segments and 9 then performed ab initio shotgun assembly on 3 each set. Figure 4 gives a schematic of the overall process flow For the whole-genome assembly, the PFP data was first disassembled or shredded into a synthetic shotgun data set of 550-bp reads that form a perfect 2X covering of the bactigs. This Gap(mean std. dev. Known resulted in 16.05 million "faux reads that were sufficient to cover the genome 2.96X because of redundancy in the Bac data set, without Contig corporating the biases inherent in the PFP Consensus assembly process. The combined data set of Reads(of several haplotypes) 43.32 million reads(8X), and all associated mate-pair information, were then subjected to SNPS our whole-genome assembly algorithm to pro- BAC Fragments duce a reconstruction of the genome. Neither the location of a bac in the ernally derived reads ive different individuals(black lines)are combined to produce a assembly of bactigs was used in this process. contig and a ce (green line). Contigs are connected into scaffolds(red) by using Bactigs were are then mapped to the genome (gray line)with STS (blue star) found strong evidence that 2. 13% of them were misassembled (40). Furthermore, BAC location www.sciencemagorgSciEnceVol29116FebRuarY2001 1309
and provide a comparison to the public genome sequence, which was reconstructed largely by an independent BAC-by-BAC approach. Our assemblies effectively covered the euchromatic regions of the human chromosomes. More than 90% of the genome was in scaffold assemblies of 100,000 bp or greater, and 25% of the genome was in scaffolds of 10 million bp or larger. Shotgun sequence assembly is a classic example of an inverse problem: given a set of reads randomly sampled from a target sequence, reconstruct the order and the position of those reads in the target. Genome assembly algorithms developed for Drosophila have now been extended to assemble the ;25-fold larger human genome. Celera assemblies consist of a set of contigs that are ordered and oriented into scaffolds that are then mapped to chromosomal locations by using known markers. The contigs consist of a collection of overlapping sequence reads that provide a consensus reconstruction for a contiguous interval of the genome. Mate pairs are a central component of the assembly strategy. They are used to produce scaffolds in which the size of gaps between consecutive contigs is known with reasonable precision. This is accomplished by observing that a pair of reads, one of which is in one contig, and the other of which is in another, implies an orientation and distance between the two contigs (Fig. 3). Finally, our assemblies did not incorporate all reads into the final set of reported scaffolds. This set of unincorporated reads is termed “chaff,” and typically consisted of reads from within highly repetitive regions, data from other organisms introduced through various routes as found in many genome projects, and data of poor quality or with untrimmed vector. 2.1 Assembly data sets We used two independent sets of data for our assemblies. The first was a random shotgun data set of 27.27 million reads of average length 543 bp produced at Celera. This consisted largely of mate-pair reads from 16 libraries constructed from DNA samples taken from five different donors. Libraries with insert sizes of 2, 10, and 50 kbp were used. By looking at how mate pairs from a library were positioned in known sequenced stretches of the genome, we were able to characterize the range of insert sizes in each library and determine a mean and standard deviation. Table 1 details the number of reads, sequencing coverage, and clone coverage achieved by the data set. The clone coverage is the coverage of the genome in cloned DNA, considering the entire insert of each clone that has sequence from both ends. The clone coverage provides a measure of the amount of physical DNA coverage of the genome. Assuming a genome size of 2.9 Gbp, the Celera trimmed sequences gave a 5.13 coverage of the genome, and clone coverage was 3.423, 16.403, and 18.843 for the 2-, 10-, and 50-kbp libraries, respectively, for a total of 38.73 clone coverage. The second data set was from the publicly funded Human Genome Project (PFP) and is primarily derived from BAC clones (30). The BAC data input to the assemblies came from a download of GenBank on 1 September 2000 (Table 2) totaling 4443.3 Mbp of sequence. The data for each BAC is deposited at one of four levels of completion. Phase 0 data are a set of generally unassembled sequencing reads from a very light shotgun of the BAC, typically less than 13. Phase 1 data are unordered assemblies of contigs, which we call BAC contigs or bactigs. Phase 2 data are ordered assemblies of bactigs. Phase 3 data are complete BAC sequences. In the past 2 years the PFP has focused on a product of lower quality and completeness, but on a faster time-course, by concentrating on the production of Phase 1 data from a 33 to 43 light-shotgun of each BAC clone. We screened the bactig sequences for contaminants by using the BLAST algorithm against three data sets: (i) vector sequences in Univec core (38), filtered for a 25-bp match at 98% sequence identity at the ends of the sequence and a 30-bp match internal to the sequence; (ii) the nonhuman portion of the High Throughput Genomic (HTG) Seqences division of GenBank (39), filtered at 200 bp at 98%; and (iii) the nonredundant nucleotide sequences from GenBank without primate and human virus entries, filtered at 200 bp at 98%. Whenever 25 bp or more of vector was found within 50 bp of the end of a contig, the tip up to the matching vector was excised. Under these criteria we removed 2.6 Mbp of possible contaminant and vector from the Phase 3 data, 61.0 Mbp from the Phase 1 and 2 data, and 16.1 Mbp from the Phase 0 data (Table 2). This left us with a total of 4363.7 Mbp of PFP sequence data 20% finished, 75% rough-draft (Phase 1 and 2), and 5% single sequencing reads (Phase 0). An additional 104,018 BAC end-sequence mate pairs were also downloaded and included in the data sets for both assembly processes (18). 2.2 Assembly strategies Two different approaches to assembly were pursued. The first was a whole-genome assembly process that used Celera data and the PFP data in the form of additional synthetic shotgun data, and the second was a compartmentalized assembly process that first partitioned the Celera and PFP data into sets localized to large chromosomal segments and then performed ab initio shotgun assembly on each set. Figure 4 gives a schematic of the overall process flow. For the whole-genome assembly, the PFP data was first disassembled or “shredded” into a synthetic shotgun data set of 550-bp reads that form a perfect 23 covering of the bactigs. This resulted in 16.05 million “faux” reads that were sufficient to cover the genome 2.963 because of redundancy in the BAC data set, without incorporating the biases inherent in the PFP assembly process. The combined data set of 43.32 million reads (83), and all associated mate-pair information, were then subjected to our whole-genome assembly algorithm to produce a reconstruction of the genome. Neither the location of a BAC in the genome nor its assembly of bactigs was used in this process. Bactigs were shredded into reads because we found strong evidence that 2.13% of them were misassembled (40). Furthermore, BAC location Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and internally derived reads from five different individuals (black lines) are combined to produce a contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) physical map information. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1309 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME information was ignored because some BACs at least 2.2% of the BACs contained sequence (see below). In short, we performed a true, ab were not correctly placed on the PFP physical data that were not part of the given BAC (41), initio whole-genome assembly in which we map and because we found strong evidence that possibly as a result of sample- tracking errors took the expedient of deriving additional se- quence coverage, but not mate pairs, assembled Table 2. Gen Bank data input into assembly bactigs, or genome locality, from some exter- ally generated data. Completion phase sequence compartmentalized shotgun Center Statistics (CSA), Celera and PFP data were partitioned 0 1 and 2 3 into the largest possible chromosomal segments Whitehead Institute/ Number of accession records 2,825 363 or"components"that could be determined with MIT Center for lumber of contigs 243.786 3 confidence, and then shotgun assembly was ap- Genome Research, 194, 490, 158 1,083, 848, 245 48,829, 358 plied to each partitioned subset wherein the 1553,5 13,654 875, 618 2,202 bactig data were again shredded into faux reads 4417,055 to ensure an independent ab initio assembly of Average contig length(bp) the component. By subsetting the data in this ashington University, Number of accession records way, the overall computational effort was re- 3,232 USA 61812 duced and the effect of interchromosomal dupli- Total base pai 1,195.732561,171.788 cations was ameliorated This also resulted in a Total vector masked(bp) reconstruction of the genome that was relatively Total contaminant ma 224691,476,141 independent of the whole-genome assembly re- Average contig length(bp) 9,079 126,319 pared for consistency The quality of the parti College of Number of accession records 363 tioning into components was crucial so that N Total base pairs Total vector masked(bp) 0 1.,784,70 * 3960 gether. We constructed components from()the 218.769 Total contaminant masked Average contig length(bp) 5. 919 135.033 to Celera's data set. The BaC assemblies were Production Sequencing Number of accession records 754 obtained by a combining assembler that used the s Number of contigs 34938 4 bactigs and the 5X Celera data mapped to those Genome Institute Total base 8.680,214294.249,631 60975.328 bactigs as input. This effort was undertaken as Total vector masked(bp) 7,274 an interim step solely because the more accurate g 665,818 4,642372 118 387 and complete the scaffold for a given sequence E stretch, the more accurately one one can tile these age contig length(bp) 8422 scaffolds into contiguous components on the he Institute of Physical Number of accession records 1.149 Number of contigs 300 basis of sequence overlap and mate-pair infor- esearch(RIKEN), Total ba 018281227520093,926 mation. We further visually inspected and cu- Japan Total vector masked(bp) 2371 rated the scaffold tiling of the components to Total contaminant masked(bp) 308426 27781 further increase its acc uracy. For the final Csa Average contig length(bp) 7,093 66,978 assembly, all but the partitioning was ignored, a nger Centre, UK Number of accession records d an independent, ab initio reconstruction of Number of contigs 0 74324 2,599 the sequence in each component was obtained 8 0 689,059, 692 246, 118,000 by applying our whole-genome assembly algo- Total vector masked(bp) 427326 25,054 rithm to the partitioned, relevant Celera data and 9 otal contaminan verage contig length(bp) 0 2,066, 305 374 561 the shredded, faux reads of the partitioned, rel- 3 94697 evant bactig data. Others* Number of accession records 42 458 Number of contigs 3,458 2.3 Whole-genome assembly Total base pai 5.564879283358877246,474.157 Total vector masked 57448 27947 32136 The algorithms used for whole-genome as ..665 1791.849 sembly (WGA) of the human genome were enhancements to those used to produce the erage contig length(bp) e of the drosophila All centers combined Number of accession records Number of contigs 409628 9,137 The WGA assembler consists of a pipeline 3360047574835,72226 mposed of five prin ages: Screener Total vector masked(bp) 2438575 82, 284 Overlapper, Unitigger, Scaffolder, and Repeat Total contaminant masked 16311,664 365230 Average contig length(bp) 811 66 and marks all microsatellite repeats with less uting serang tots o shung mologyr Keio University School of Medicine: lawrence ing Alu, Line, and ribosomal DNA.Marked 914 than a 6-bp element, and screens out all H: Genome Therapeutics Corporation; GENOSCOPE me Center: known interspersed repeat elements, includ- Livermore National Laboratory: Cold Spring Harbor Laboratory: Los ALamos National Laboratory: Max-Planck Institut fuer regions get searched search; The Institute of Physical and Cherechnology Corporation: Stanford University: The Institute for Genomic screened regions do not get searched, but can rsity of Oklahoma: Universi Southwestern Medical Center, University of washingto tThe 4,405,, 825 bases contributed by all centers were be part of an overlap that involves unscreened shredded into faux reads resulting in 2.96x coverage of the genome atching segments 1310 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
information was ignored because some BACs were not correctly placed on the PFP physical map and because we found strong evidence that at least 2.2% of the BACs contained sequence data that were not part of the given BAC (41), possibly as a result of sample-tracking errors (see below). In short, we performed a true, ab initio whole-genome assembly in which we took the expedient of deriving additional sequence coverage, but not mate pairs, assembled bactigs, or genome locality, from some externally generated data. In the compartmentalized shotgun assembly (CSA), Celera and PFP data were partitioned into the largest possible chromosomal segments or “components” that could be determined with confidence, and then shotgun assembly was applied to each partitioned subset wherein the bactig data were again shredded into faux reads to ensure an independent ab initio assembly of the component. By subsetting the data in this way, the overall computational effort was reduced and the effect of interchromosomal duplications was ameliorated. This also resulted in a reconstruction of the genome that was relatively independent of the whole-genome assembly results so that the two assemblies could be compared for consistency. The quality of the partitioning into components was crucial so that different genome regions were not mixed together. We constructed components from (i) the longest scaffolds of the sequence from each BAC and (ii) assembled scaffolds of data unique to Celera’s data set. The BAC assemblies were obtained by a combining assembler that used the bactigs and the 53 Celera data mapped to those bactigs as input. This effort was undertaken as an interim step solely because the more accurate and complete the scaffold for a given sequence stretch, the more accurately one can tile these scaffolds into contiguous components on the basis of sequence overlap and mate-pair information. We further visually inspected and curated the scaffold tiling of the components to further increase its accuracy. For the final CSA assembly, all but the partitioning was ignored, and an independent, ab initio reconstruction of the sequence in each component was obtained by applying our whole-genome assembly algorithm to the partitioned, relevant Celera data and the shredded, faux reads of the partitioned, relevant bactig data. 2.3 Whole-genome assembly The algorithms used for whole-genome assembly (WGA) of the human genome were enhancements to those used to produce the sequence of the Drosophila genome reported in detail in (28). The WGA assembler consists of a pipeline composed of five principal stages: Screener, Overlapper, Unitigger, Scaffolder, and Repeat Resolver, respectively. The Screener finds and marks all microsatellite repeats with less than a 6-bp element, and screens out all known interspersed repeat elements, including Alu, Line, and ribosomal DNA. Marked regions get searched for overlaps, whereas screened regions do not get searched, but can be part of an overlap that involves unscreened matching segments. Table 2. GenBank data input into assembly. Center Statistics Completion phase sequence 0 1 and 2 3 Whitehead Institute/ Number of accession records 2,825 6,533 363 MIT Center for Number of contigs 243,786 138,023 363 Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358 USA Total vector masked (bp) 1,553,597 875,618 2,202 Total contaminant masked (bp) 13,654,482 4,417,055 98,028 Average contig length (bp) 798 7,853 134,516 Washington University, Number of accession records 19 3,232 1,300 USA Number of contigs 2,127 61,812 1,300 Total base pairs 1,195,732 561,171,788 164,214,395 Total vector masked (bp) 21,604 270,942 8,287 Total contaminant masked (bp) 22,469 1,476,141 469,487 Average contig length (bp) 562 9,079 126,319 Baylor College of Number of accession records 0 1,626 363 Medicine, USA Number of contigs 0 44,861 363 Total base pairs 0 265,547,066 49,017,104 Total vector masked (bp) 0 218,769 4,960 Total contaminant masked (bp) 0 1,784,700 485,137 Average contig length (bp) 0 5,919 135,033 Production Sequencing Number of accession records 135 2,043 754 Facility, DOE Joint Number of contigs 7,052 34,938 754 Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328 USA Total vector masked (bp) 22,644 162,651 7,274 Total contaminant masked (bp) 665,818 4,642,372 118,387 Average contig length (bp) 1,231 8,422 80,867 The Institute of Physical Number of accession records 0 1,149 300 and Chemical Number of contigs 0 25,772 300 Research (RIKEN), Total base pairs 0 182,812,275 20,093,926 Japan Total vector masked (bp) 0 203,792 2,371 Total contaminant masked (bp) 0 308,426 27,781 Average contig length (bp) 0 7,093 66,978 Sanger Centre, UK Number of accession records 0 4,538 2,599 Number of contigs 0 74,324 2,599 Total base pairs 0 689,059,692 246,118,000 Total vector masked (bp) 0 427,326 25,054 Total contaminant masked (bp) 0 2,066,305 374,561 Average contig length (bp) 0 9,271 94,697 Others* Number of accession records 42 1,894 3,458 Number of contigs 5,978 29,898 3,458 Total base pairs 5,564,879 283,358,877 246,474,157 Total vector masked (bp) 57,448 279,477 32,136 Total contaminant masked (bp) 575,366 1,616,665 1,791,849 Average contig length (bp) 931 9,478 71,277 All centers combined† Number of accession records 3,021 21,015 9,137 Number of contigs 258,943 409,628 9,137 Total base pairs 209,930,983 3,360,047,574 835,722,268 Total vector masked (bp) 1,655,293 2,438,575 82,284 Total contaminant masked (bp) 14,918,135 16,311,664 3,365,230 Average contig length (bp) 811 8,203 91,466 *Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center; Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence Livermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuer Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas Southwestern Medical Center, University of Washington. †The 4,405,700,825 bases contributed by all centers were shredded into faux reads resulting in 2.963 coverage of the genome. T H E H UMAN G ENOME 1310 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME against every other read in search of complete 100-to 400-bp repetitive segment? nd other aggressive and thus more likely to make a The Overlapper compares every read singly interspersed Alu element mistake. For the human assembly, we contin- end-to-end overlaps of at least 40 bp and with The result of running the Unitigger was ued to use the first""Rocks"substage where no more than 6% differences in the match. thus a set of correctly assembled subcontigs all unitigs with a good, but not definitive, Because all data are scrupulously covering an estimated 73.6% of the human discriminator score are placed in a scaffold trimmed, the Overlapper can insist genome. The Scaffolder then proceeded to gap. This was done with the condition that plete overlap matches. Computing Ise mate-pair information to link these to- two or more mate pairs with one of their all overlaps took roughly 10,000 CPU hours gether into scaffolds. When there are two or reads already in the scaffold unambiguously with a suite of four-processor Alpha SMPs more mate pairs that imply that a given pair place the unitig in the given gap. We estimate with 4 gigabytes of RAM. This took 4 to 5 of U-unitigs are at a certain distance and the probability of inserting a unitig into an days in elapsed time with 40 such machines orientation with respect to each other, the incorrect gap with this strategy to be less than operating in parallel robability of this being wrong is again 10- based on a probabilistic analysis Every overlap computed above is statisti- roughly 1 in 10, assuming that mate pairs We revised the ensuing"Stones"substage cally a l-in-107 event and thus not a coinci- are false less than 2% of the time. Thus, one of the human assembly, making it more like dental event. What makes assembly combi- can with high confidence link together all the mechanism suggested in our earlier work natorially difficult is that while many over- U-unitigs that are linked by at least two 2-or (43). For each gap, every read R that is placed laps are actually sampled from overlapping 10-kbp mate pairs producing intermediate- in the gap by virtue of its mated pair M being regions of the genome, and thus imply that sized scaffolds that are then recursively in a contig of the scaffold and implying rs the sequence reads should be assembled to- linked together by confirming 50-kbp mate placement is collected. Celera's mate-pairing gether, even more overlaps are actually from pairs and BAC end sequences. This process information is correct more than 99% of the wo distinct copies of a low-copy repeated yielded scaffolds that are on the order of time. Thus, almost every, but not all, of the an error if put together. We call the former their contigs that generally correspond to re- a read does not belong it rarely agrees with N "true overlaps"and the latter"repeat-induced petitive elements and occasionally to small the remainder of the reads. Therefore, we overlaps. " The assembler must avoid choos- sequencing gaps. These scaffolds reconstruct simply assemble this set of reads within the ng repeat-induced overlaps, especially early the majority of the unique sequence within a gap, eliminating any reads that conflict with in the process. For the Drosophila assembly, we engaged more reliable than the one it replaced for the o ger.We first find all assemblies of reads that in a three-stage repeat resolution strategy Drosophila assembly; in the assembly of a s appear to be uncontested with respect to all where each stage was progressively more simulated shotgun data set of hi other reads. We call the contigs formed from these subassemblies unitigs (for uniquely as- sembled contigs). Formally, these unitigs are 5.11X Celera Reads Public Bactigs the uncontested interval subgraphs of the 21B graph of all overlaps (42). Unfortunately, al- though empirically many of these assemblies are correct(and thus involve only true over- laps), some are in fact collections of reads from several copies of a repetitive element Shredd Matcher that have been overcollapsed into a single 2.96X FaUx ubassembly. However, the overcollapsed Reads Celera-uni Bactigs Celera pairs 33sE= are easily identified because their av- eads binned by BAC) 8 overage depth is too high to be con- with the overall level of sequence WGA WGA coverage. We developed a simple statistical discriminator that gives the logarithm of the odds ratio that a unitig is composed of unique Unique BAC DNA or of a repeat consisting of two or more copies. The discriminator, set to a sufficiently stringent threshold, identifies a subset of the Tiler unitigs that we are certain are correct. In addition, a second, less stringent threshold identifies a subset of remaining unitigs very kely to be correctly assembled, of which we elect those that will consistently scaffold (see below ) and thus are again almost certain Components to be correct. We call the union of these two WGA+Shredder ets U-unitigs. Empirically, we found from 6X simulated shotgun of human chromosome 22 that we get U-unitigs covering 98% of the stretches of unique DNA that are >2 kbp WGA Assembly CSA Assembly long. We are further able to identity the Fig. 4. Architecture of Celera's two-pronged assembly strat boundary of the start of a repetitive element process performing the function indicated by its label, with the labels on arcs between ovals at the ends of a U-unitig and leverage this so describing the nature of the objects produced and/or consumed by a process. This figu that U-unitigs span more than 93% of all summarizes the discussion in the text that defines the terms and phrases used www.sciencemagorgSciEnceVol29116FebRuarY2001 1311
The Overlapper compares every read against every other read in search of complete end-to-end overlaps of at least 40 bp and with no more than 6% differences in the match. Because all data are scrupulously vectortrimmed, the Overlapper can insist on complete overlap matches. Computing the set of all overlaps took roughly 10,000 CPU hours with a suite of four-processor Alpha SMPs with 4 gigabytes of RAM. This took 4 to 5 days in elapsed time with 40 such machines operating in parallel. Every overlap computed above is statistically a 1-in-1017 event and thus not a coincidental event. What makes assembly combinatorially difficult is that while many overlaps are actually sampled from overlapping regions of the genome, and thus imply that the sequence reads should be assembled together, even more overlaps are actually from two distinct copies of a low-copy repeated element not screened above, thus constituting an error if put together. We call the former “true overlaps” and the latter “repeat-induced overlaps.” The assembler must avoid choosing repeat-induced overlaps, especially early in the process. We achieve this objective in the Unitigger. We first find all assemblies of reads that appear to be uncontested with respect to all other reads. We call the contigs formed from these subassemblies unitigs (for uniquely assembled contigs). Formally, these unitigs are the uncontested interval subgraphs of the graph of all overlaps (42). Unfortunately, although empirically many of these assemblies are correct (and thus involve only true overlaps), some are in fact collections of reads from several copies of a repetitive element that have been overcollapsed into a single subassembly. However, the overcollapsed unitigs are easily identified because their average coverage depth is too high to be consistent with the overall level of sequence coverage. We developed a simple statistical discriminator that gives the logarithm of the odds ratio that a unitig is composed of unique DNA or of a repeat consisting of two or more copies. The discriminator, set to a sufficiently stringent threshold, identifies a subset of the unitigs that we are certain are correct. In addition, a second, less stringent threshold identifies a subset of remaining unitigs very likely to be correctly assembled, of which we select those that will consistently scaffold (see below), and thus are again almost certain to be correct. We call the union of these two sets U-unitigs. Empirically, we found from a 63 simulated shotgun of human chromosome 22 that we get U-unitigs covering 98% of the stretches of unique DNA that are .2 kbp long. We are further able to identify the boundary of the start of a repetitive element at the ends of a U-unitig and leverage this so that U-unitigs span more than 93% of all singly interspersed Alu elements and other 100-to 400-bp repetitive segments. The result of running the Unitigger was thus a set of correctly assembled subcontigs covering an estimated 73.6% of the human genome. The Scaffolder then proceeded to use mate-pair information to link these together into scaffolds. When there are two or more mate pairs that imply that a given pair of U-unitigs are at a certain distance and orientation with respect to each other, the probability of this being wrong is again roughly 1 in 1010, assuming that mate pairs are false less than 2% of the time. Thus, one can with high confidence link together all U-unitigs that are linked by at least two 2- or 10-kbp mate pairs producing intermediatesized scaffolds that are then recursively linked together by confirming 50-kbp mate pairs and BAC end sequences. This process yielded scaffolds that are on the order of megabase pairs in size with gaps between their contigs that generally correspond to repetitive elements and occasionally to small sequencing gaps. These scaffolds reconstruct the majority of the unique sequence within a genome. For the Drosophila assembly, we engaged in a three-stage repeat resolution strategy where each stage was progressively more aggressive and thus more likely to make a mistake. For the human assembly, we continued to use the first “Rocks” substage where all unitigs with a good, but not definitive, discriminator score are placed in a scaffold gap. This was done with the condition that two or more mate pairs with one of their reads already in the scaffold unambiguously place the unitig in the given gap. We estimate the probability of inserting a unitig into an incorrect gap with this strategy to be less than 1027 based on a probabilistic analysis. We revised the ensuing “Stones” substage of the human assembly, making it more like the mechanism suggested in our earlier work (43). For each gap, every read R that is placed in the gap by virtue of its mated pair M being in a contig of the scaffold and implying R’s placement is collected. Celera’s mate-pairing information is correct more than 99% of the time. Thus, almost every, but not all, of the reads in the set belong in the gap, and when a read does not belong it rarely agrees with the remainder of the reads. Therefore, we simply assemble this set of reads within the gap, eliminating any reads that conflict with the assembly. This operation proved much more reliable than the one it replaced for the Drosophila assembly; in the assembly of a simulated shotgun data set of human chromoFig. 4. Architecture of Celera’s two-pronged assembly strategy. Each oval denotes a computation process performing the function indicated by its label, with the labels on arcs between ovals describing the nature of the objects produced and/or consumed by a process. This figure summarizes the discussion in the text that defines the terms and phrases used. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1311 on September 27, 2009 www.sciencemag.org Downloaded from
THE HUMAN GENOME some 22, all stones were placed correctly. have required a computer with a 600-gigabyte tribution of each was essentially exponen The final method of resolving gaps is to RAM. By making the Overlapper and Unitigger More than 50% of all gaps were less than 500 the gap. We call this external gap"walking. computation with a maximum of instantaneous long, and no gap was >100 kbp long. Similar- We did not include the very aggressive "Peb- usage of 28 gigabytes of RAM. Moreover, the ly, more than 65% of the sequence is in contigs bles"substage described in our Drosophila incremental nature of the first three stages al- >30 kbp, more than 31% is in contigs >100 ork, which made enough mistakes so as to lowed us to continually update the state of this kbp, and the largest contig was 1. 22 Mbp long produce repeat reconstructions for long inter- of the computation as data were delivered Table 3 gives detailed summary statistics for spersed elements whose quality was only and then perform a 7-day run to complete Scaf- the structure of this assembly with a direct 99.62% correct. We decided that for the hu- folding and Repeat Resolution whenever de- comparison to the compartmentalized shotgun man genome it was philosophically better not sired. For our assembly operations, the total assembly to introduce a step that was certain to produce compute infrastructure consists of 10 four-pro- somewhat larger number of gaps of some- cluster(Compaq's ES40, Regatta) and a 16- assemby mentalized shotgun less than 99.99% accuracy. The cost was a cessor SMPs with 4 gigabytes of memory per 2.4 what larger size. processor NUMA machine with 64 gigabytes In addition to the WGA approach, we pur At the final stage of the assembly process, of memory(Compag's gS160, wildfire). The sued a localized assembly approach that w nd also at several intermediate points, a total compute for a run of the assembler was intended to subdivide the genome into seg- consensus sequence of every contig is pro- roughly 20,000 CPU hours. nents. each of which could be shotgun as- duced. Our algorithm is driven by the princi- The assembly of Celera's data, together sembled individually. We expected that this e of maximum pa y, with quality- with the shredded bactig data, produced a set of would help in resolution of large interchro- value-weighted measures for evaluating each scaffolds totaling 2.848 Gbp in span and con- mosomal duplications and base. The net effect is a Bayesian estimate of sisting of 2. 586 Gbp of sequence. The chaff, or tics for calculating U-unitigs. The compart the correct base to report at each position. set of reads not incorporated in the assembly, mentalized assembly process involved clus- N Consensus generation uses Celera data when- numbered 11.27 million(26%), which is con- tering Celera reads and bactigs into large, ever it is present In the event that no Celera sistent with our experience for Drosophila. multiple megabase regions of the genome, data cover a given region, the bac data More than 84% of the genome was covered by and then running the WGa assembler on the scaffolds >100 kbp long, and these averaged Celera data and shredded, faux reads ob- a A key element of achieving a WGA of the 91% sequence and 9% gaps with a total of tained from the bactig data human genome was to parallelize the Overlap- 2. 297 Gbp of sequence. There were a total of The first phase of the CSa strategy was to per and the central consensus sequence- con- 93, 857 gaps among the 1637 scaffolds >100 separate Celera reads into those that matched structing subroutines. In addition, memory was kbp. The average scaffold size was 1.5 Mbp, the BAC contigs for a particular PFP BAC a real issue-a straightforward application of the average contig size was 24.06 kbp, and the entry, and those that did not match any public o the software we had built for Drosophila would average gap size was 2.43 kbp, where the dis- data. Such matches must be guaranteed to E Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size >30 kbp >100 kbp >500 kbp >1000kbp Compartmentalized shotgun assembly o. of bp in 2748892430 2,700489906 2489357,260 including intrascaffold gap 22486891283 No of bp in contigs 653979,733 2.524251302 2491.538372 2320648201 2,10652190 1935 1,060 170033 107199 93138 92078 No. of gaps≤1kbp 72,091 69.175 67289 59915 53,354 Average scaffold size 54.217 1,395602 3.118848 Average contig size 15609 22496 23,242 25686 Average intrascaffold gap size 1832 19883 1988321 1988321 1988321 1988321 ole-genome assembly No of bp in scaffolds 2574792618 2.525334447 2,328.535 No. of bp in contigs 2,334,343339 2,297,678935 2,143002 No, of scaffolds No. of contigs 221,03 84 No of gaps 102068 96682 No. of gaps≤1kbp 132 4079 1,027,041 2846620 Average contig size(bp) 23.534 24.061 25.999 Average intrascaffold gap size 2487 2213 1,224,073 1.224073 1,224073 1.224073 1.224073 1312 16FebRuaRy2001Vol291SciEncewww.sciencemag.org
some 22, all stones were placed correctly. The final method of resolving gaps is to fill them with assembled BAC data that cover the gap. We call this external gap “walking.” We did not include the very aggressive “Pebbles” substage described in our Drosophila work, which made enough mistakes so as to produce repeat reconstructions for long interspersed elements whose quality was only 99.62% correct. We decided that for the human genome it was philosophically better not to introduce a step that was certain to produce less than 99.99% accuracy. The cost was a somewhat larger number of gaps of somewhat larger size. At the final stage of the assembly process, and also at several intermediate points, a consensus sequence of every contig is produced. Our algorithm is driven by the principle of maximum parsimony, with qualityvalue–weighted measures for evaluating each base. The net effect is a Bayesian estimate of the correct base to report at each position. Consensus generation uses Celera data whenever it is present. In the event that no Celera data cover a given region, the BAC data sequence is used. A key element of achieving a WGA of the human genome was to parallelize the Overlapper and the central consensus sequence–constructing subroutines. In addition, memory was a real issue—a straightforward application of the software we had built for Drosophila would have required a computer with a 600-gigabyte RAM. By making the Overlapper and Unitigger incremental, we were able to achieve the same computation with a maximum of instantaneous usage of 28 gigabytes of RAM. Moreover, the incremental nature of the first three stages allowed us to continually update the state of this part of the computation as data were delivered and then perform a 7-day run to complete Scaffolding and Repeat Resolution whenever desired. For our assembly operations, the total compute infrastructure consists of 10 four-processor SMPs with 4 gigabytes of memory per cluster (Compaq’s ES40, Regatta) and a 16- processor NUMA machine with 64 gigabytes of memory (Compaq’s GS160, Wildfire). The total compute for a run of the assembler was roughly 20,000 CPU hours. The assembly of Celera’s data, together with the shredded bactig data, produced a set of scaffolds totaling 2.848 Gbp in span and consisting of 2.586 Gbp of sequence. The chaff, or set of reads not incorporated in the assembly, numbered 11.27 million (26%), which is consistent with our experience for Drosophila. More than 84% of the genome was covered by scaffolds .100 kbp long, and these averaged 91% sequence and 9% gaps with a total of 2.297 Gbp of sequence. There were a total of 93,857 gaps among the 1637 scaffolds .100 kbp. The average scaffold size was 1.5 Mbp, the average contig size was 24.06 kbp, and the average gap size was 2.43 kbp, where the distribution of each was essentially exponential. More than 50% of all gaps were less than 500 bp long, .62% of all gaps were less than 1 kbp long, and no gap was .100 kbp long. Similarly, more than 65% of the sequence is in contigs .30 kbp, more than 31% is in contigs .100 kbp, and the largest contig was 1.22 Mbp long. Table 3 gives detailed summary statistics for the structure of this assembly with a direct comparison to the compartmentalized shotgun assembly. 2.4 Compartmentalized shotgun assembly In addition to the WGA approach, we pursued a localized assembly approach that was intended to subdivide the genome into segments, each of which could be shotgun assembled individually. We expected that this would help in resolution of large interchromosomal duplications and improve the statistics for calculating U-unitigs. The compartmentalized assembly process involved clustering Celera reads and bactigs into large, multiple megabase regions of the genome, and then running the WGA assembler on the Celera data and shredded, faux reads obtained from the bactig data. The first phase of the CSA strategy was to separate Celera reads into those that matched the BAC contigs for a particular PFP BAC entry, and those that did not match any public data. Such matches must be guaranteed to Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size All .30 kbp .100 kbp .500 kbp .1000 kbp Compartmentalized shotgun assembly No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128 (including intrascaffold gaps) No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902 No. of scaffolds 53,591 2,845 1,935 1,060 721 No. of contigs 170,033 112,207 107,199 93,138 82,009 No. of gaps 116,442 109,362 105,264 92,078 81,288 No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354 Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848 Average contig size (bp) 15,609 22,496 23,242 24,916 25,686 Average intrascaffold gap size (bp) 2,161 2,054 1,985 1,832 1,749 Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321 % of total contigs 100 95 94 87 79 Whole-genome assembly No. of bp in scaffolds (including intrascaffold gaps) 2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032 No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432 No. of scaffolds 118,968 2,507 1,637 818 554 No. of contigs 221,036 99,189 95,494 84,641 76,285 No. of gaps 102,068 96,682 93,857 83,823 75,731 No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592 Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518 Average contig size (bp) 11,702 23,534 24,061 25,319 25,999 Average intrascaffold gap size (bp) 2,560 2,487 2,426 2,213 2,082 Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073 % of total contigs 100 90 89 83 77 T H E H UMAN G ENOME 1312 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from