正在加载图片...
THE HUMAN GENOME d provide a comparison to the public gend 2.1 Assembly data sets equences. In the past 2 years the PFP has sequence, which was reconstructed largely by We used two independent sets of data for our focused on a product of lower quality and com- an independent BAC-by-BAC approach Our assemblies. The first was a random shotgun pleteness, but on a faster time-course, by con- assemblies effectively covered the euchromatic data set of 27.27 million reads of average length centrating on the production of Phase I data regions of the human chromosomes. More than 543 bp produced at Celera. This consisted from a 3X to 4X light-shotgun of each BAC 90% of the genome was in scaffold assemblies largely of mate-pair reads from 16 libraries clone of 100,000 bp or greater, and 25% of the ge. constructed from DNA samples taken from five We screened the bactig sequences for con- nome was in scaffolds of 10 million bp or different donors. Libraries with insert sizes of 2, taminants by using the blast algorithm 0, and 50 kbp were used. By looking at how against three data sets: (i) vector sequences mate pairs from a library were positioned inin Univec core (38), filtered for a 25-bl Shotgun sequence assembly is a classic known sequenced stretches of the genome, we match at 98% sequence identity at the ends example of an inverse problem: given a set were able to characterize the range of insert of the sequence and a 30-bp match internal of reads omly sampled from a target sizes in each library and determine a mean and to the sequence; (ii) the nonhuman portion sequence, reconstruct the order and the po- standard deviation. Table 1 details the number of the High Throughput Genomic(HTG) sition of those reads in the target Genome of reads, sequencing coverage, and clone cov- Seqences division of Gen Bank (39), fil- assembly algorithms developed for Dro- erage achieved by the data set. The clone cov- tered at 200 bp at 98%; and (iii)the non- sophila have now been extended to assemble erage is the coverage of the genome in cloned redundant nucleotide sequences from Gen- the "25-fold larger human genome. Celera as- DNA, considering the entire insert of each Bank without primate and human virus en- semblies consist of a set of contigs that are clone that has sequence from both ends. The tries, filtered at 200 bp at 98%.Whenever mapped to chromosomal locations by using amount of physical DNA coverage of the ge- 50 bp of the end of a contig, the ip up to 8 ordered and oriented into scaffolds that are then clone coverage provides a measure of the 25 bp or more of vector was found within lection of overlapping sequence reads that pro- Celera trimmed sequences gave a 51X cover- these criteria we removed 2.6 Mbp of pos- N vide a consensus reconstruction for a contigu- age of the genome, and clone coverage was sible contaminant and vector from the ous interval of the genome Mate pairs are a 3.42X, 1640X, and 18 84X for the 2, 10, and Phase 3 data, 61.0 Mbp from the Phase 1 entral component of the assembly strategy. 50-kbp libraries, respectively, for a total of and 2 data, and 16 1 Mbp from the Phase a size of gaps between consecutive contigs is The second data set was from the publicly 4363. 7 Mbp of PFP sequence data 20% ( known with reasonable precision. This is ac- funded Human Genome Project(PFP)and is finished, 75% rough-draft(Phase 1 and 2) arily derived from BAC clones (30). The and 5% single sequencing reads(Phase 0) one of which is in one contig, and the other of BAC data input to the assemblies came from a An additional 104, 018 BAC end-sequence which is in another, implies an orientation and download of GenBank on I September 2000 mate pairs were also downloaded and in- o distance between the two contigs(Fig 3). Fi-(Table 2) totaling 4443. 3 Mbp of sequence. cluded in the data sets for both assembly E nally, our assemblies did not incorporate all The data for each BAC is deposited at one of processes(18) reads into the final set of reported scaffolds. four levels of completion. Phase 0 data are a set This set of unincorporated reads is termed of generally unassembled sequencing reads 2. 2 Assembly strategies chaff, " and typically consisted of reads from from a very light shotgun of the BAC, typically Two different approaches to assembly were within highly repetitive regions, data from other less than 1x. Phase I data are unordered as- pursued. The first was a whole-genome as- organisms introduced through various routes as semblies of contigs, which we call BAC contigs sembly process that used Celera data and the found in many genome projects, and data of or bactigs. Phase 2 data are ordered assemblies PFP data in the form of additional synthetic a poor quality or with untrimmed vector. was a compart STS tioned the Celera and pfp data into sets o Mappe Genome localized to large chromosomal segments and 9 then performed ab initio shotgun assembly on 3 each set. Figure 4 gives a schematic of the overall process flow For the whole-genome assembly, the PFP data was first disassembled or shredded into a synthetic shotgun data set of 550-bp reads that form a perfect 2X covering of the bactigs. This Gap(mean std. dev. Known resulted in 16.05 million "faux reads that were sufficient to cover the genome 2.96X because of redundancy in the Bac data set, without Contig corporating the biases inherent in the PFP Consensus assembly process. The combined data set of Reads(of several haplotypes) 43.32 million reads(8X), and all associated mate-pair information, were then subjected to SNPS our whole-genome assembly algorithm to pro- BAC Fragments duce a reconstruction of the genome. Neither the location of a bac in the ernally derived reads ive different individuals(black lines)are combined to produce a assembly of bactigs was used in this process. contig and a ce (green line). Contigs are connected into scaffolds(red) by using Bactigs were are then mapped to the genome (gray line)with STS (blue star) found strong evidence that 2. 13% of them were misassembled (40). Furthermore, BAC location www.sciencemagorgSciEnceVol29116FebRuarY2001 1309and provide a comparison to the public genome sequence, which was reconstructed largely by an independent BAC-by-BAC approach. Our assemblies effectively covered the euchromatic regions of the human chromosomes. More than 90% of the genome was in scaffold assemblies of 100,000 bp or greater, and 25% of the ge￾nome was in scaffolds of 10 million bp or larger. Shotgun sequence assembly is a classic example of an inverse problem: given a set of reads randomly sampled from a target sequence, reconstruct the order and the po￾sition of those reads in the target. Genome assembly algorithms developed for Dro￾sophila have now been extended to assemble the ;25-fold larger human genome. Celera as￾semblies consist of a set of contigs that are ordered and oriented into scaffolds that are then mapped to chromosomal locations by using known markers. The contigs consist of a col￾lection of overlapping sequence reads that pro￾vide a consensus reconstruction for a contigu￾ous interval of the genome. Mate pairs are a central component of the assembly strategy. They are used to produce scaffolds in which the size of gaps between consecutive contigs is known with reasonable precision. This is ac￾complished by observing that a pair of reads, one of which is in one contig, and the other of which is in another, implies an orientation and distance between the two contigs (Fig. 3). Fi￾nally, our assemblies did not incorporate all reads into the final set of reported scaffolds. This set of unincorporated reads is termed “chaff,” and typically consisted of reads from within highly repetitive regions, data from other organisms introduced through various routes as found in many genome projects, and data of poor quality or with untrimmed vector. 2.1 Assembly data sets We used two independent sets of data for our assemblies. The first was a random shotgun data set of 27.27 million reads of average length 543 bp produced at Celera. This consisted largely of mate-pair reads from 16 libraries constructed from DNA samples taken from five different donors. Libraries with insert sizes of 2, 10, and 50 kbp were used. By looking at how mate pairs from a library were positioned in known sequenced stretches of the genome, we were able to characterize the range of insert sizes in each library and determine a mean and standard deviation. Table 1 details the number of reads, sequencing coverage, and clone cov￾erage achieved by the data set. The clone cov￾erage is the coverage of the genome in cloned DNA, considering the entire insert of each clone that has sequence from both ends. The clone coverage provides a measure of the amount of physical DNA coverage of the ge￾nome. Assuming a genome size of 2.9 Gbp, the Celera trimmed sequences gave a 5.13 cover￾age of the genome, and clone coverage was 3.423, 16.403, and 18.843 for the 2-, 10-, and 50-kbp libraries, respectively, for a total of 38.73 clone coverage. The second data set was from the publicly funded Human Genome Project (PFP) and is primarily derived from BAC clones (30). The BAC data input to the assemblies came from a download of GenBank on 1 September 2000 (Table 2) totaling 4443.3 Mbp of sequence. The data for each BAC is deposited at one of four levels of completion. Phase 0 data are a set of generally unassembled sequencing reads from a very light shotgun of the BAC, typically less than 13. Phase 1 data are unordered as￾semblies of contigs, which we call BAC contigs or bactigs. Phase 2 data are ordered assemblies of bactigs. Phase 3 data are complete BAC sequences. In the past 2 years the PFP has focused on a product of lower quality and com￾pleteness, but on a faster time-course, by con￾centrating on the production of Phase 1 data from a 33 to 43 light-shotgun of each BAC clone. We screened the bactig sequences for con￾taminants by using the BLAST algorithm against three data sets: (i) vector sequences in Univec core (38), filtered for a 25-bp match at 98% sequence identity at the ends of the sequence and a 30-bp match internal to the sequence; (ii) the nonhuman portion of the High Throughput Genomic (HTG) Seqences division of GenBank (39), fil￾tered at 200 bp at 98%; and (iii) the non￾redundant nucleotide sequences from Gen￾Bank without primate and human virus en￾tries, filtered at 200 bp at 98%. Whenever 25 bp or more of vector was found within 50 bp of the end of a contig, the tip up to the matching vector was excised. Under these criteria we removed 2.6 Mbp of pos￾sible contaminant and vector from the Phase 3 data, 61.0 Mbp from the Phase 1 and 2 data, and 16.1 Mbp from the Phase 0 data (Table 2). This left us with a total of 4363.7 Mbp of PFP sequence data 20% finished, 75% rough-draft (Phase 1 and 2), and 5% single sequencing reads (Phase 0). An additional 104,018 BAC end-sequence mate pairs were also downloaded and in￾cluded in the data sets for both assembly processes (18). 2.2 Assembly strategies Two different approaches to assembly were pursued. The first was a whole-genome as￾sembly process that used Celera data and the PFP data in the form of additional synthetic shotgun data, and the second was a compart￾mentalized assembly process that first parti￾tioned the Celera and PFP data into sets localized to large chromosomal segments and then performed ab initio shotgun assembly on each set. Figure 4 gives a schematic of the overall process flow. For the whole-genome assembly, the PFP data was first disassembled or “shredded” into a synthetic shotgun data set of 550-bp reads that form a perfect 23 covering of the bactigs. This resulted in 16.05 million “faux” reads that were sufficient to cover the genome 2.963 because of redundancy in the BAC data set, without incorporating the biases inherent in the PFP assembly process. The combined data set of 43.32 million reads (83), and all associated mate-pair information, were then subjected to our whole-genome assembly algorithm to pro￾duce a reconstruction of the genome. Neither the location of a BAC in the genome nor its assembly of bactigs was used in this process. Bactigs were shredded into reads because we found strong evidence that 2.13% of them were misassembled (40). Furthermore, BAC location Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and internally derived reads from five different individuals (black lines) are combined to produce a contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) physical map information. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1309 on September 27, 2009 www.sciencemag.org Downloaded from
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有