正在加载图片...
articles h of new regions. The small clone, several centres routinely examined an initial sample of 96 raw or merged with others as sequence reads from each subclone library to evaluate possible the map matured. overlap with previously sequenced clones. The clones that make up the draft genome sequence therefore do Sequencing not constitute a minimally overlapping set-there is overlap and The selected clones were subjected to shotgun sequencing. Although redundancy in places. The cost of using suboptimal overlaps was the basic approach of shotgun sequencing is well established, the justified by the benefit of earlier availability of the draft genome details of implementation varied among the centres. For example, lence data. Minimizing the overlap between adjacent clones there were differences in the average insert size of the shotgun would have required completing the physical map before under- libraries, in the use of single-stranded or double-stranded cloning taking large-scale sequencing. In addition, the overlaps between vectors, and in sequencing from one end or both ends of each insert. BAC clones provide a rich collection of SNPs. More than 1. 4 million Centres differed in the fluorescent labels employed and in the degree SNPs have already been identified from clone overlaps and other to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices Because the sequencing project was shared among twenty centres Detailed protocols are available on the web sites of many of the insixcountriesitwasimportanttocoordinateselectionofclonesindividualcentres(urlscanbefoundatwww.nhgri.nih.gov/ across the centres. Most centres focused on particular chromosomes genomehub). The extent of automation also varied greatly or, in some cases, larger regions of the genome. We also maintained among the centres, with the most aggressive automation efforts a clone registry to track selected clones and their progress. In later resulting in factory-style systems able to process more than 100,000 phases, the global map provided an integrated view of the data from sequencing reactions in 12 hours(Fig. 3). In addition, centres ll centres, facilitating the distribution of effort to maximize cover- differed in the amount of raw sequence data typically obtained for age of the genome Before performing extensive sequencing on a each clone(so-called half-shotgun, full shotgun and finished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were Lm L analysed by a common computational procedure. Raw sequenc traces were processed and assembled with the PHRED and PHRAP software packages".(P. Green, unpublished). All assembled con- tigs of more than 2 kb were deposited in public databases within The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequence acity and output rose approx eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate(ratio of useable reads human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public A version of the draft genome sequence was prepared on the basis Figure 3 The automated production line for sample preparation at the whitehead of the map and sequence data available on 7 October 2000. For this Institute,Center for Genome Research. The system consists of custom-designed factory. version, the mapping effort had assembled the fingerprinted BACs style conveyor belt robots that perform all functions from purifying DNA from bacterial into 1, 246 fingerprint clone contigs. The sequencing effort had cultures through setting up and purifying sequencing reactions sequenced and assembled 29, 298 overlapping BACs and other large insert clones(Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the 4,500 Finished genome(including both draft and finished sequence). The various Unfinished(draft and pre-d contributions to the total amount of sequence deposited in the HTGS division of Gen Bank are given in Table 3 Table 2 Total genome sequence from 2500 sequence status Sequent umber of Total clon number depth sequence(Mb) nis number di Figure 4 Total amount of human sequence in the High Throughput Genome Sequer sequencing centre. The average varies among the centres, and the number may rGS)division of GenBank. The total is the sum of finished sequence(red) and unfinished vary considerably for clones with the same sequencing status. For draft clones in the public draft plus predraft sequence yellow) NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltdthat were used to `seed' the sequencing of new regions. The small ®ngerprint clone contigs were extended or merged with others as the map matured. The clones that make up the draft genome sequence therefore do not constitute a minimally overlapping setÐthere is overlap and redundancy in places. The cost of using suboptimal overlaps was justi®ed by the bene®t of earlier availability of the draft genome sequence data. Minimizing the overlap between adjacent clones would have required completing the physical map before under￾taking large-scale sequencing. In addition, the overlaps between BAC clones provide a rich collection of SNPs. More than 1.4 million SNPs have already been identi®ed from clone overlaps and other sequence comparisons97. Because the sequencing project was shared among twenty centres in six countries, it was important to coordinate selection of clones across the centres. Most centres focused on particular chromosomes or, in some cases, larger regions of the genome. We also maintained a clone registry to track selected clones and their progress. In later phases, the global map provided an integrated view of the data from all centres, facilitating the distribution of effort to maximize cover￾age of the genome. Before performing extensive sequencing on a clone, several centres routinely examined an initial sample of 96 raw sequence reads from each subclone library to evaluate possible overlap with previously sequenced clones. Sequencing The selected clones were subjected to shotgun sequencing. Although the basic approach of shotgun sequencing is well established, the details of implementation varied among the centres. For example, there were differences in the average insert size of the shotgun libraries, in the use of single-stranded or double-stranded cloning vectors, and in sequencing from one end or both ends of each insert. Centres differed in the ¯uorescent labels employed and in the degree to which they used dye-primers or dye-terminators. The sequence detectors included both slab gel- and capillary-based devices. Detailed protocols are available on the web sites of many of the individual centres (URLs can be found at www.nhgri.nih.gov/ genome_hub). The extent of automation also varied greatly among the centres, with the most aggressive automation efforts resulting in factory-style systems able to process more than 100,000 sequencing reactions in 12 hours (Fig. 3). In addition, centres differed in the amount of raw sequence data typically obtained for each clone (so-called half-shotgun, full shotgun and ®nished sequence). Sequence information from the different centres could be directly integrated despite this diversity, because the data were analysed by a common computational procedure. Raw sequence traces were processed and assembled with the PHRED and PHRAP software packages77,78 (P. Green, unpublished). All assembled con￾tigs of more than 2 kb were deposited in public databases within 24 hours of assembly. The overall sequencing output rose sharply during production (Fig. 4). Following installation of new sequence detectors beginning in June 1999, sequencing capacity and output rose approximately eightfold in eight months to nearly 7 million samples processed per month, with little or no drop in success rate (ratio of useable reads to attempted reads). By June 2000, the centres were producing raw sequence at a rate equivalent to onefold coverage of the entire human genome in less than six weeks. This corresponded to a continuous throughput exceeding 1,000 nucleotides per second, 24 hours per day, seven days per week. This scale-up resulted in a concomitant increase in the sequence available in the public databases (Fig. 4). A version of the draft genome sequence was prepared on the basis of the map and sequence data available on 7 October 2000. For this version, the mapping effort had assembled the ®ngerprinted BACs into 1,246 ®ngerprint clone contigs. The sequencing effort had sequenced and assembled 29,298 overlapping BACs and other large￾insert clones (Table 2), comprising a total length of 4.26 gigabases (Gb). This resulted from around 23 Gb of underlying raw shotgun sequence data, or about 7.5-fold coverage averaged across the genome (including both draft and ®nished sequence). The various contributions to the total amount of sequence deposited in the HTGS division of GenBank are given in Table 3. articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 867 Figure 3 The automated production line for sample preparation at the Whitehead Institute, Center for Genome Research. The system consists of custom-designed factory￾style conveyor belt robots that perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions. 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Jan-96 Apr-96 Jul-96 Oct-96 Jan-97 Apr-97 Jul-97 Oct-97 Jan-98 Apr-98 Jul-98 Oct-98 Jan-99 Apr-99 Jul-99 Oct-99 Jan-00 Apr-00 Jul-00 Oct-00 Sequence (Mb) Finished Unfinished (draft and pre-draft) Month Figure 4 Total amount of human sequence in the High Throughput Genome Sequence (HTGS) division of GenBank. The total is the sum of ®nished sequence (red) and un®nished (draft plus predraft) sequence (yellow). Table 2 Total genome sequence from the collection of sequenced clones, by sequence status Sequence status Number of clones Total clone length (Mb) Average number of sequence reads per kb* Average sequence depth² Total amount of raw sequence (Mb) Finished 8,277 897 20±25 8±12 9,085 Draft 18,969 3,097 12 4.5 13,395 Predraft 2,052 267 6 2.5 667 Total 23,147 ............................................................................................................................................................................. * The average number of reads per kb was estimated based on information provided by each sequencing centre. This number differed among sequencing centres, based on the actual protocols used. ² The average depth in high quality bases ($99% accuracy) was estimated from information provided by each sequencing centre. The average varies among the centres, and the number may vary considerably for clones with the same sequencing status. For draft clones in the public databases (keyword: HTGS_draft), the number can be computed from the quality scores listed in the database entry. © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有