正在加载图片...
THE HUMAN GENOME A 2.91-billion base pair(bp) consensus sequence of the euchromatic portion of DNA using chain-terminating nucleotide ana- the human genome was generated by the whole-genome shotgun sequencing logs(3). In the same year, the first human gene ethod. The 148-billion bp DNA sequence was generated over 9 months from was isolated and sequenced(4). In 1986. Hood 27, 271, 853 high-quality sequence reads(5.11-fold coverage of the genome and co-workers (5) described an improvement from both ends of plasmid clones made from the DNA of five individuals. Two in the Sanger sequencing method that included assembly strategies-a whole-genome assembly and a regional chromosome attaching fluorescent dyes to the nucleotides assembly-were used, each combining sequence data from Celera and the hich permitted them to be sequentially read publicly funded genome effort. The public data were shredded into 550-bp by a computer. The first automated DNA se- egments to create a 2.9-fold coverage of those genome regions that had been quencer, developed by Applied Biosystems in quenced, without including biases inherent in the cloning and assembly Califonia in 1987. was shown to be successful erage in the asser the publicly funded group. This brought the effective cov- when the sequences of two genes were obtained with this new technology(6). From early se- the final assembly over what would be obtained with 5. 11-fold coverage. The quencing of human genomic regions (7),it two assembly strategies yielded very similar results that largely agree with became clear that cDNA sequences(which are independent mapping data. The assemblies effectively cover the euchromatic reverse-transcribed from RNA) would be es regions of the human chromosomes. More than 90% of the genome is in sential to annotate and validate gene predictions scaffold assemblies of 100,000 bp or more, and 25% of the genome is in in the human genome. These studies were the scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed basis in part for the development of the ex- 26, 588 protein-encoding transcripts for which there was strong corroborating pressed sequence tag (EST)method of gene evidence and an additional, 000 computationally derived genes with mouse dentification(8), which is a random selection matches or other weak supporting evidence. Although gene-dense clusters are very high throughput sequencing approach to obvious, almost half the genes are dispersed in low G+C sequence separated characterize cDNA libraries. The ESt method by large tracts of apparently noncoding sequence. Only 1.1% of the genome led to the rapid discovery and mapping of hu- o is spanned by exons, whereas 24% is in introns, with 75% of the genome being man genes(9). The increasing numbers of hu- intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- man EST sequences necessitated the develop- nosomal lengths, are abundant throughout the genome and reveal a complex ment of new computer algorithms to analyze o evolutionary history. Comparative genomic analysis indicates vertebrate ex large amounts of sequence data, and in 1993 at ansions of genes associated with neuronal function, with tissue-specific de- elopmental regulation, and with the hemostasis and immune systems. DNA algorithm was developed that permitted assem- 5 sequence comparisons between the consensus sequence and publicly funded bly and analysis of hundreds of thousands of genome data provided locations of 2.1 million single-nucleotide polymorphisms ESTS. This algorithm permitted characteriza- (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per tion and annotation of human genes on the basis o 1250 on average, but there was marked heterogeneity in the level of poly of 30,000 EST assemblies(10) morphism across the genome. Less than 1% of all SNPs resulted in variation The complete 49-kbp bacteriophage lamb- proteins, but the task of determining which SNPs have functional consequences remains an open challenge shotgun restriction digest method in 1982 (11). When considering methods for sequenc- Decoding of the dNa that constitutes the derstanding human evolution, the causation ing the smallpox virus genome in 1991(12) human genome has been widely anticipated of disease, and the interplay between the a whole-genome shotgun sequencing method for the contribution it will make toward un- environment and heredity in defining the hu- was discussed and subsequently rejected ow- a Celera Genomics, 45 West Gude Drive, Rockville, MD determining the complete nucleotide se- for genome assembly. However, in 1994, 8 20850. USA "Genetixxpress, 78 Pacific Road. palm quence of the human genome was first for- when a microbial genome-sequencing project Genome Project, University of California, Berkeley, cA years, the idea met with mixed reactions in shotgun sequencing approach was considered 3 4720, USA. " Department of Biology, Penn State Uni- the scientific community(2). However, in possible with the TIGR EST assembly algo- rsity. 208 Mueller Lab, University Park, PA 16802, 1990, the Human Genome Project(HGP)was rithm. In 1995, the 1.8-Mbp Haemophilus officially initiated in the United States under influenzae genome ed by 44106,USA. Johns Hopkins the direction of the National Institutes of whole-genome shotgun sequencing method Avenue, Cleveland,OH 44106, USA 6yo0okins Hospi- Health and the U.S. Department of Energy (13). The experience with several subsequent mD 21287 22 us a rgeckefeler unive situ mre with a I5-year, $3 billion plan for completing genome-sequencing efforts established the York Avenue. New York. NY 10021-6399, USA Ne land BioLabs, 32 Tozer Road, Beverly, MA 01915, Ir intention to build a unique genome- A key feature of the sequencing approach USA. Division of Biology 147-75, California Institute sequencing facility, to determine the se- used for these megabase-size and larger ge- Technology, 1200 East California Boulevard, Pasa- quence of the human genome over a 3-year nomes was the use of paired-end sequences dena, CA 91125, USA. Yale University School of period. Here we report the penultimate mile- (also called mate pairs), derived from sub- Haven, CT 06520-8000, USA "Applied Biosystems, stone along the path toward that goal, a nearly clone libraries with distinct insert sizes and 850 Lincoln Centre Drive, Foster City. CA 94404, USA complete sequence of the euchromatic por- cloning characteristics. Paired-end sequences "The Institute for Genomic Research, 9712 Medical tion of the human genome. The sequencing are sequences 500 to 600 bp in length from rmatica Medica, In- shotgun method with subsequent assembly of prescribed lengths. The success of using end stitut Municipal vestigacio Medica, Universitat sequences from long segments(18 to 20 kbp) ompeu Fabra, 08003-Barcelona, Catalonia, Spain. The modern history of DNA sequencing of DNA cloned into bacteriophage lambda in To whom correspondence should be addressed. E- began in 1977, when Sanger reported his meth- assembly of the microbial genomes led to the mailhumangenome@celera.com od for determining the order of nucleotides of suggestion(16)of an approach to simulta- www.sciencemagorgSciEnceVol29116FebRuarY2001 1305A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective cov￾erage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ;12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G1C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chro￾mosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate ex￾pansions of genes associated with neuronal function, with tissue-specific de￾velopmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of poly￾morphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge. Decoding of the DNA that constitutes the human genome has been widely anticipated for the contribution it will make toward un￾derstanding human evolution, the causation of disease, and the interplay between the environment and heredity in defining the hu￾man condition. A project with the goal of determining the complete nucleotide se￾quence of the human genome was first for￾mally proposed in 1985 (1). In subsequent years, the idea met with mixed reactions in the scientific community (2). However, in 1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the U.S. Department of Energy with a 15-year, $3 billion plan for completing the genome sequence. In 1998 we announced our intention to build a unique genome￾sequencing facility, to determine the se￾quence of the human genome over a 3-year period. Here we report the penultimate mile￾stone along the path toward that goal, a nearly complete sequence of the euchromatic por￾tion of the human genome. The sequencing was performed by a whole-genome random shotgun method with subsequent assembly of the sequenced segments. The modern history of DNA sequencing began in 1977, when Sanger reported his meth￾od for determining the order of nucleotides of DNA using chain-terminating nucleotide ana￾logs (3). In the same year, the first human gene was isolated and sequenced (4). In 1986, Hood and co-workers (5) described an improvement in the Sanger sequencing method that included attaching fluorescent dyes to the nucleotides, which permitted them to be sequentially read by a computer. The first automated DNA se￾quencer, developed by Applied Biosystems in California in 1987, was shown to be successful when the sequences of two genes were obtained with this new technology (6). From early se￾quencing of human genomic regions (7), it became clear that cDNA sequences (which are reverse-transcribed from RNA) would be es￾sential to annotate and validate gene predictions in the human genome. These studies were the basis in part for the development of the ex￾pressed sequence tag (EST) method of gene identification (8), which is a random selection, very high throughput sequencing approach to characterize cDNA libraries. The EST method led to the rapid discovery and mapping of hu￾man genes (9). The increasing numbers of hu￾man EST sequences necessitated the develop￾ment of new computer algorithms to analyze large amounts of sequence data, and in 1993 at The Institute for Genomic Research (TIGR), an algorithm was developed that permitted assem￾bly and analysis of hundreds of thousands of ESTs. This algorithm permitted characteriza￾tion and annotation of human genes on the basis of 30,000 EST assemblies (10). The complete 49-kbp bacteriophage lamb￾da genome sequence was determined by a shotgun restriction digest method in 1982 (11). When considering methods for sequenc￾ing the smallpox virus genome in 1991 (12), a whole-genome shotgun sequencing method was discussed and subsequently rejected ow￾ing to the lack of appropriate software tools for genome assembly. However, in 1994, when a microbial genome-sequencing project was contemplated at TIGR, a whole-genome shotgun sequencing approach was considered possible with the TIGR EST assembly algo￾rithm. In 1995, the 1.8-Mbp Haemophilus influenzae genome was completed by a whole-genome shotgun sequencing method (13). The experience with several subsequent genome-sequencing efforts established the broad applicability of this approach (14, 15). A key feature of the sequencing approach used for these megabase-size and larger ge￾nomes was the use of paired-end sequences (also called mate pairs), derived from sub￾clone libraries with distinct insert sizes and cloning characteristics. Paired-end sequences are sequences 500 to 600 bp in length from both ends of double-stranded DNA clones of prescribed lengths. The success of using end sequences from long segments (18 to 20 kbp) of DNA cloned into bacteriophage lambda in assembly of the microbial genomes led to the suggestion (16) of an approach to simulta- 1 Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA. 2 GenetixXpress, 78 Pacific Road, Palm Beach, Sydney 2108, Australia. 3 Berkeley Drosophila Genome Project, University of California, Berkeley, CA 94720, USA. 4 Department of Biology, Penn State Uni￾versity, 208 Mueller Lab, University Park, PA 16802, USA. 5 Department of Genetics, Case Western Reserve University School of Medicine, BRB-630, 10900 Euclid Avenue, Cleveland, OH 44106, USA. 6 Johns Hopkins University School of Medicine, Johns Hopkins Hospi￾tal, 600 North Wolfe Street, Blalock 1007, Baltimore, MD 21287–4922, USA. 7 Rockefeller University, 1230 York Avenue, New York, NY 10021–6399, USA. 8 New England BioLabs, 32 Tozer Road, Beverly, MA 01915, USA. 9 Division of Biology, 147-75, California Institute of Technology, 1200 East California Boulevard, Pasa￾dena, CA 91125, USA. 10Yale University School of Medicine, 333 Cedar Street, P.O. Box 208000, New Haven, CT 06520–8000, USA. 11Applied Biosystems, 850 Lincoln Centre Drive, Foster City, CA 94404, USA. 12The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA. 13Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, 52900 Israel. 14Grup de Recerca en Informa`tica Me`dica, In￾stitut Municipal d’Investigacio´ Me`dica, Universitat Pompeu Fabra, 08003-Barcelona, Catalonia, Spain. *To whom correspondence should be addressed. E￾mail: humangenome@celera.com T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1305 on September 27, 2009 www.sciencemag.org Downloaded from
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有