正在加载图片...
THE HUMAN GENOME some 22, all stones were placed correctly. have required a computer with a 600-gigabyte tribution of each was essentially exponen The final method of resolving gaps is to RAM. By making the Overlapper and Unitigger More than 50% of all gaps were less than 500 the gap. We call this external gap"walking. computation with a maximum of instantaneous long, and no gap was >100 kbp long. Similar- We did not include the very aggressive "Peb- usage of 28 gigabytes of RAM. Moreover, the ly, more than 65% of the sequence is in contigs bles"substage described in our Drosophila incremental nature of the first three stages al- >30 kbp, more than 31% is in contigs >100 ork, which made enough mistakes so as to lowed us to continually update the state of this kbp, and the largest contig was 1. 22 Mbp long produce repeat reconstructions for long inter- of the computation as data were delivered Table 3 gives detailed summary statistics for spersed elements whose quality was only and then perform a 7-day run to complete Scaf- the structure of this assembly with a direct 99.62% correct. We decided that for the hu- folding and Repeat Resolution whenever de- comparison to the compartmentalized shotgun man genome it was philosophically better not sired. For our assembly operations, the total assembly to introduce a step that was certain to produce compute infrastructure consists of 10 four-pro- somewhat larger number of gaps of some- cluster(Compaq's ES40, Regatta) and a 16- assemby mentalized shotgun less than 99.99% accuracy. The cost was a cessor SMPs with 4 gigabytes of memory per 2.4 what larger size. processor NUMA machine with 64 gigabytes In addition to the WGA approach, we pur At the final stage of the assembly process, of memory(Compag's gS160, wildfire). The sued a localized assembly approach that w nd also at several intermediate points, a total compute for a run of the assembler was intended to subdivide the genome into seg- consensus sequence of every contig is pro- roughly 20,000 CPU hours. nents. each of which could be shotgun as- duced. Our algorithm is driven by the princi- The assembly of Celera's data, together sembled individually. We expected that this e of maximum pa y, with quality- with the shredded bactig data, produced a set of would help in resolution of large interchro- value-weighted measures for evaluating each scaffolds totaling 2.848 Gbp in span and con- mosomal duplications and base. The net effect is a Bayesian estimate of sisting of 2. 586 Gbp of sequence. The chaff, or tics for calculating U-unitigs. The compart the correct base to report at each position. set of reads not incorporated in the assembly, mentalized assembly process involved clus- N Consensus generation uses Celera data when- numbered 11.27 million(26%), which is con- tering Celera reads and bactigs into large, ever it is present In the event that no Celera sistent with our experience for Drosophila. multiple megabase regions of the genome, data cover a given region, the bac data More than 84% of the genome was covered by and then running the WGa assembler on the scaffolds >100 kbp long, and these averaged Celera data and shredded, faux reads ob- a A key element of achieving a WGA of the 91% sequence and 9% gaps with a total of tained from the bactig data human genome was to parallelize the Overlap- 2. 297 Gbp of sequence. There were a total of The first phase of the CSa strategy was to per and the central consensus sequence- con- 93, 857 gaps among the 1637 scaffolds >100 separate Celera reads into those that matched structing subroutines. In addition, memory was kbp. The average scaffold size was 1.5 Mbp, the BAC contigs for a particular PFP BAC a real issue-a straightforward application of the average contig size was 24.06 kbp, and the entry, and those that did not match any public o the software we had built for Drosophila would average gap size was 2.43 kbp, where the dis- data. Such matches must be guaranteed to E Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size >30 kbp >100 kbp >500 kbp >1000kbp Compartmentalized shotgun assembly o. of bp in 2748892430 2,700489906 2489357,260 including intrascaffold gap 22486891283 No of bp in contigs 653979,733 2.524251302 2491.538372 2320648201 2,10652190 1935 1,060 170033 107199 93138 92078 No. of gaps≤1kbp 72,091 69.175 67289 59915 53,354 Average scaffold size 54.217 1,395602 3.118848 Average contig size 15609 22496 23,242 25686 Average intrascaffold gap size 1832 19883 1988321 1988321 1988321 1988321 ole-genome assembly No of bp in scaffolds 2574792618 2.525334447 2,328.535 No. of bp in contigs 2,334,343339 2,297,678935 2,143002 No, of scaffolds No. of contigs 221,03 84 No of gaps 102068 96682 No. of gaps≤1kbp 132 4079 1,027,041 2846620 Average contig size(bp) 23.534 24.061 25.999 Average intrascaffold gap size 2487 2213 1,224,073 1.224073 1,224073 1.224073 1.224073 1312 16FebRuaRy2001Vol291SciEncewww.sciencemag.orgsome 22, all stones were placed correctly. The final method of resolving gaps is to fill them with assembled BAC data that cover the gap. We call this external gap “walking.” We did not include the very aggressive “Peb￾bles” substage described in our Drosophila work, which made enough mistakes so as to produce repeat reconstructions for long inter￾spersed elements whose quality was only 99.62% correct. We decided that for the hu￾man genome it was philosophically better not to introduce a step that was certain to produce less than 99.99% accuracy. The cost was a somewhat larger number of gaps of some￾what larger size. At the final stage of the assembly process, and also at several intermediate points, a consensus sequence of every contig is pro￾duced. Our algorithm is driven by the princi￾ple of maximum parsimony, with quality￾value–weighted measures for evaluating each base. The net effect is a Bayesian estimate of the correct base to report at each position. Consensus generation uses Celera data when￾ever it is present. In the event that no Celera data cover a given region, the BAC data sequence is used. A key element of achieving a WGA of the human genome was to parallelize the Overlap￾per and the central consensus sequence–con￾structing subroutines. In addition, memory was a real issue—a straightforward application of the software we had built for Drosophila would have required a computer with a 600-gigabyte RAM. By making the Overlapper and Unitigger incremental, we were able to achieve the same computation with a maximum of instantaneous usage of 28 gigabytes of RAM. Moreover, the incremental nature of the first three stages al￾lowed us to continually update the state of this part of the computation as data were delivered and then perform a 7-day run to complete Scaf￾folding and Repeat Resolution whenever de￾sired. For our assembly operations, the total compute infrastructure consists of 10 four-pro￾cessor SMPs with 4 gigabytes of memory per cluster (Compaq’s ES40, Regatta) and a 16- processor NUMA machine with 64 gigabytes of memory (Compaq’s GS160, Wildfire). The total compute for a run of the assembler was roughly 20,000 CPU hours. The assembly of Celera’s data, together with the shredded bactig data, produced a set of scaffolds totaling 2.848 Gbp in span and con￾sisting of 2.586 Gbp of sequence. The chaff, or set of reads not incorporated in the assembly, numbered 11.27 million (26%), which is con￾sistent with our experience for Drosophila. More than 84% of the genome was covered by scaffolds .100 kbp long, and these averaged 91% sequence and 9% gaps with a total of 2.297 Gbp of sequence. There were a total of 93,857 gaps among the 1637 scaffolds .100 kbp. The average scaffold size was 1.5 Mbp, the average contig size was 24.06 kbp, and the average gap size was 2.43 kbp, where the dis￾tribution of each was essentially exponential. More than 50% of all gaps were less than 500 bp long, .62% of all gaps were less than 1 kbp long, and no gap was .100 kbp long. Similar￾ly, more than 65% of the sequence is in contigs .30 kbp, more than 31% is in contigs .100 kbp, and the largest contig was 1.22 Mbp long. Table 3 gives detailed summary statistics for the structure of this assembly with a direct comparison to the compartmentalized shotgun assembly. 2.4 Compartmentalized shotgun assembly In addition to the WGA approach, we pur￾sued a localized assembly approach that was intended to subdivide the genome into seg￾ments, each of which could be shotgun as￾sembled individually. We expected that this would help in resolution of large interchro￾mosomal duplications and improve the statis￾tics for calculating U-unitigs. The compart￾mentalized assembly process involved clus￾tering Celera reads and bactigs into large, multiple megabase regions of the genome, and then running the WGA assembler on the Celera data and shredded, faux reads ob￾tained from the bactig data. The first phase of the CSA strategy was to separate Celera reads into those that matched the BAC contigs for a particular PFP BAC entry, and those that did not match any public data. Such matches must be guaranteed to Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. Scaffold size All .30 kbp .100 kbp .500 kbp .1000 kbp Compartmentalized shotgun assembly No. of bp in scaffolds 2,905,568,203 2,748,892,430 2,700,489,906 2,489,357,260 2,248,689,128 (including intrascaffold gaps) No. of bp in contigs 2,653,979,733 2,524,251,302 2,491,538,372 2,320,648,201 2,106,521,902 No. of scaffolds 53,591 2,845 1,935 1,060 721 No. of contigs 170,033 112,207 107,199 93,138 82,009 No. of gaps 116,442 109,362 105,264 92,078 81,288 No. of gaps #1 kbp 72,091 69,175 67,289 59,915 53,354 Average scaffold size (bp) 54,217 966,219 1,395,602 2,348,450 3,118,848 Average contig size (bp) 15,609 22,496 23,242 24,916 25,686 Average intrascaffold gap size (bp) 2,161 2,054 1,985 1,832 1,749 Largest contig (bp) 1,988,321 1,988,321 1,988,321 1,988,321 1,988,321 % of total contigs 100 95 94 87 79 Whole-genome assembly No. of bp in scaffolds (including intrascaffold gaps) 2,847,890,390 2,574,792,618 2,525,334,447 2,328,535,466 2,140,943,032 No. of bp in contigs 2,586,634,108 2,334,343,339 2,297,678,935 2,143,002,184 1,983,305,432 No. of scaffolds 118,968 2,507 1,637 818 554 No. of contigs 221,036 99,189 95,494 84,641 76,285 No. of gaps 102,068 96,682 93,857 83,823 75,731 No. of gaps #1 kbp 62,356 60,343 59,156 54,079 49,592 Average scaffold size (bp) 23,938 1,027,041 1,542,660 2,846,620 3,864,518 Average contig size (bp) 11,702 23,534 24,061 25,319 25,999 Average intrascaffold gap size (bp) 2,560 2,487 2,426 2,213 2,082 Largest contig (bp) 1,224,073 1,224,073 1,224,073 1,224,073 1,224,073 % of total contigs 100 90 89 83 77 T H E H UMAN G ENOME 1312 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有