正在加载图片...
THE HUMAN GENOME information was ignored because some BACs at least 2.2% of the BACs contained sequence (see below). In short, we performed a true, ab were not correctly placed on the PFP physical data that were not part of the given BAC (41), initio whole-genome assembly in which we map and because we found strong evidence that possibly as a result of sample- tracking errors took the expedient of deriving additional se- quence coverage, but not mate pairs, assembled Table 2. Gen Bank data input into assembly bactigs, or genome locality, from some exter- ally generated data. Completion phase sequence compartmentalized shotgun Center Statistics (CSA), Celera and PFP data were partitioned 0 1 and 2 3 into the largest possible chromosomal segments Whitehead Institute/ Number of accession records 2,825 363 or"components"that could be determined with MIT Center for lumber of contigs 243.786 3 confidence, and then shotgun assembly was ap- Genome Research, 194, 490, 158 1,083, 848, 245 48,829, 358 plied to each partitioned subset wherein the 1553,5 13,654 875, 618 2,202 bactig data were again shredded into faux reads 4417,055 to ensure an independent ab initio assembly of Average contig length(bp) the component. By subsetting the data in this ashington University, Number of accession records way, the overall computational effort was re- 3,232 USA 61812 duced and the effect of interchromosomal dupli- Total base pai 1,195.732561,171.788 cations was ameliorated This also resulted in a Total vector masked(bp) reconstruction of the genome that was relatively Total contaminant ma 224691,476,141 independent of the whole-genome assembly re- Average contig length(bp) 9,079 126,319 pared for consistency The quality of the parti College of Number of accession records 363 tioning into components was crucial so that N Total base pairs Total vector masked(bp) 0 1.,784,70 * 3960 gether. We constructed components from()the 218.769 Total contaminant masked Average contig length(bp) 5. 919 135.033 to Celera's data set. The BaC assemblies were Production Sequencing Number of accession records 754 obtained by a combining assembler that used the s Number of contigs 34938 4 bactigs and the 5X Celera data mapped to those Genome Institute Total base 8.680,214294.249,631 60975.328 bactigs as input. This effort was undertaken as Total vector masked(bp) 7,274 an interim step solely because the more accurate g 665,818 4,642372 118 387 and complete the scaffold for a given sequence E stretch, the more accurately one one can tile these age contig length(bp) 8422 scaffolds into contiguous components on the he Institute of Physical Number of accession records 1.149 Number of contigs 300 basis of sequence overlap and mate-pair infor- esearch(RIKEN), Total ba 018281227520093,926 mation. We further visually inspected and cu- Japan Total vector masked(bp) 2371 rated the scaffold tiling of the components to Total contaminant masked(bp) 308426 27781 further increase its acc uracy. For the final Csa Average contig length(bp) 7,093 66,978 assembly, all but the partitioning was ignored, a nger Centre, UK Number of accession records d an independent, ab initio reconstruction of Number of contigs 0 74324 2,599 the sequence in each component was obtained 8 0 689,059, 692 246, 118,000 by applying our whole-genome assembly algo- Total vector masked(bp) 427326 25,054 rithm to the partitioned, relevant Celera data and 9 otal contaminan verage contig length(bp) 0 2,066, 305 374 561 the shredded, faux reads of the partitioned, rel- 3 94697 evant bactig data. Others* Number of accession records 42 458 Number of contigs 3,458 2.3 Whole-genome assembly Total base pai 5.564879283358877246,474.157 Total vector masked 57448 27947 32136 The algorithms used for whole-genome as ..665 1791.849 sembly (WGA) of the human genome were enhancements to those used to produce the erage contig length(bp) e of the drosophila All centers combined Number of accession records Number of contigs 409628 9,137 The WGA assembler consists of a pipeline 3360047574835,72226 mposed of five prin ages: Screener Total vector masked(bp) 2438575 82, 284 Overlapper, Unitigger, Scaffolder, and Repeat Total contaminant masked 16311,664 365230 Average contig length(bp) 811 66 and marks all microsatellite repeats with less uting serang tots o shung mologyr Keio University School of Medicine: lawrence ing Alu, Line, and ribosomal DNA.Marked 914 than a 6-bp element, and screens out all H: Genome Therapeutics Corporation; GENOSCOPE me Center: known interspersed repeat elements, includ- Livermore National Laboratory: Cold Spring Harbor Laboratory: Los ALamos National Laboratory: Max-Planck Institut fuer regions get searched search; The Institute of Physical and Cherechnology Corporation: Stanford University: The Institute for Genomic screened regions do not get searched, but can rsity of Oklahoma: Universi Southwestern Medical Center, University of washingto tThe 4,405,, 825 bases contributed by all centers were be part of an overlap that involves unscreened shredded into faux reads resulting in 2.96x coverage of the genome atching segments 1310 16FebRuaRy2001Vol291SciEncewww.sciencemag.orginformation was ignored because some BACs were not correctly placed on the PFP physical map and because we found strong evidence that at least 2.2% of the BACs contained sequence data that were not part of the given BAC (41), possibly as a result of sample-tracking errors (see below). In short, we performed a true, ab initio whole-genome assembly in which we took the expedient of deriving additional se￾quence coverage, but not mate pairs, assembled bactigs, or genome locality, from some exter￾nally generated data. In the compartmentalized shotgun assembly (CSA), Celera and PFP data were partitioned into the largest possible chromosomal segments or “components” that could be determined with confidence, and then shotgun assembly was ap￾plied to each partitioned subset wherein the bactig data were again shredded into faux reads to ensure an independent ab initio assembly of the component. By subsetting the data in this way, the overall computational effort was re￾duced and the effect of interchromosomal dupli￾cations was ameliorated. This also resulted in a reconstruction of the genome that was relatively independent of the whole-genome assembly re￾sults so that the two assemblies could be com￾pared for consistency. The quality of the parti￾tioning into components was crucial so that different genome regions were not mixed to￾gether. We constructed components from (i) the longest scaffolds of the sequence from each BAC and (ii) assembled scaffolds of data unique to Celera’s data set. The BAC assemblies were obtained by a combining assembler that used the bactigs and the 53 Celera data mapped to those bactigs as input. This effort was undertaken as an interim step solely because the more accurate and complete the scaffold for a given sequence stretch, the more accurately one can tile these scaffolds into contiguous components on the basis of sequence overlap and mate-pair infor￾mation. We further visually inspected and cu￾rated the scaffold tiling of the components to further increase its accuracy. For the final CSA assembly, all but the partitioning was ignored, and an independent, ab initio reconstruction of the sequence in each component was obtained by applying our whole-genome assembly algo￾rithm to the partitioned, relevant Celera data and the shredded, faux reads of the partitioned, rel￾evant bactig data. 2.3 Whole-genome assembly The algorithms used for whole-genome as￾sembly (WGA) of the human genome were enhancements to those used to produce the sequence of the Drosophila genome reported in detail in (28). The WGA assembler consists of a pipeline composed of five principal stages: Screener, Overlapper, Unitigger, Scaffolder, and Repeat Resolver, respectively. The Screener finds and marks all microsatellite repeats with less than a 6-bp element, and screens out all known interspersed repeat elements, includ￾ing Alu, Line, and ribosomal DNA. Marked regions get searched for overlaps, whereas screened regions do not get searched, but can be part of an overlap that involves unscreened matching segments. Table 2. GenBank data input into assembly. Center Statistics Completion phase sequence 0 1 and 2 3 Whitehead Institute/ Number of accession records 2,825 6,533 363 MIT Center for Number of contigs 243,786 138,023 363 Genome Research, Total base pairs 194,490,158 1,083,848,245 48,829,358 USA Total vector masked (bp) 1,553,597 875,618 2,202 Total contaminant masked (bp) 13,654,482 4,417,055 98,028 Average contig length (bp) 798 7,853 134,516 Washington University, Number of accession records 19 3,232 1,300 USA Number of contigs 2,127 61,812 1,300 Total base pairs 1,195,732 561,171,788 164,214,395 Total vector masked (bp) 21,604 270,942 8,287 Total contaminant masked (bp) 22,469 1,476,141 469,487 Average contig length (bp) 562 9,079 126,319 Baylor College of Number of accession records 0 1,626 363 Medicine, USA Number of contigs 0 44,861 363 Total base pairs 0 265,547,066 49,017,104 Total vector masked (bp) 0 218,769 4,960 Total contaminant masked (bp) 0 1,784,700 485,137 Average contig length (bp) 0 5,919 135,033 Production Sequencing Number of accession records 135 2,043 754 Facility, DOE Joint Number of contigs 7,052 34,938 754 Genome Institute, Total base pairs 8,680,214 294,249,631 60,975,328 USA Total vector masked (bp) 22,644 162,651 7,274 Total contaminant masked (bp) 665,818 4,642,372 118,387 Average contig length (bp) 1,231 8,422 80,867 The Institute of Physical Number of accession records 0 1,149 300 and Chemical Number of contigs 0 25,772 300 Research (RIKEN), Total base pairs 0 182,812,275 20,093,926 Japan Total vector masked (bp) 0 203,792 2,371 Total contaminant masked (bp) 0 308,426 27,781 Average contig length (bp) 0 7,093 66,978 Sanger Centre, UK Number of accession records 0 4,538 2,599 Number of contigs 0 74,324 2,599 Total base pairs 0 689,059,692 246,118,000 Total vector masked (bp) 0 427,326 25,054 Total contaminant masked (bp) 0 2,066,305 374,561 Average contig length (bp) 0 9,271 94,697 Others* Number of accession records 42 1,894 3,458 Number of contigs 5,978 29,898 3,458 Total base pairs 5,564,879 283,358,877 246,474,157 Total vector masked (bp) 57,448 279,477 32,136 Total contaminant masked (bp) 575,366 1,616,665 1,791,849 Average contig length (bp) 931 9,478 71,277 All centers combined† Number of accession records 3,021 21,015 9,137 Number of contigs 258,943 409,628 9,137 Total base pairs 209,930,983 3,360,047,574 835,722,268 Total vector masked (bp) 1,655,293 2,438,575 82,284 Total contaminant masked (bp) 14,918,135 16,311,664 3,365,230 Average contig length (bp) 811 8,203 91,466 *Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center; Genomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence Livermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuer Molekulare, Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic Research; The Institute of Physical and Chemical Research, Gene Bank; The University of Oklahoma; University of Texas Southwestern Medical Center, University of Washington. †The 4,405,700,825 bases contributed by all centers were shredded into faux reads resulting in 2.963 coverage of the genome. T H E H UMAN G ENOME 1310 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有