正在加载图片...
THE HUMAN GENOME collected, as well as five specimens of semen, the dideoxy sequencing method (35), which throug collected over a 6-week period. Permanent typically yields only 500 to 750 bp of sequence central laboratory information management lymphoblastoid cell lines were created by per reaction. This limitation on read length has system(LIMS)tracked all sample plates by Epstein-Barr virus immortalization. DNA made monumental gains in throughput a pre- unique bar code identifiers. The facility was from five subjects was selected for genomic requisite for the analysis of large eukaryotic supported by a quality control team that per- DNA sequencing: two males and three fe- genomes. We accomplished this at the Celera formed raw material and in-process testing males--one African-American, one Asian- facility, which occupies about 30,000 square and a quality assurance group with responsi- Chinese, one Hispanic-Mexican, and two feet of laboratory space and produces sequence bilities including document control, valida- Caucasians(see Web fig. 2 on Science Online data continuously at a rate of 175,000 total tion, and auditing of the facility. Critical to atwww.sciencemag.org/cgi/content/291/5507/readsperdayTheDna-sEquenCingfacilityisthesuccessofthescale-upwasthevalidation 1304/DC1). The decision of whose DNA to supported by a high-performance computation- of all software and instrumentation before well as technical issues such as the quality of ular by design and automa the dNa libraries and availability of immortal- sample backlogs allowed four principal 1.2 Trace processing ized cell lines modules to operate independently: (i)li- An automated trace-processing pipeline has rary transformation, plating, and colony been developed to process each sequence file 1.1 Library construction and picking; (ii) DNA template preparation; (37). After quality and vector trimming, the sequencing eaction set-up average trimmed sequence length was 543 ing process is preparation of high-quality plas- mination with the ABI PRISM 3700 DNA nentially distributed with a mean of 99.5% 8 Central to the whole-genome shotgun sequenc- and purification; and (iv) sequence deter- bp, and the sequencing accuracy was expo mid libraries in a variety of insert sizes so that Analyzer. Because the inputs and outputs and with less than 1 in 1000 reads being less one read from both ends of each plasmid insert. matched and sample backlogs are continu- quence was screened for matches to contam- High-quality libraries have an equal representa- ously managed, sequencing has proceeded inants including sequences of vector alone, E. on of all parts of the genome, a small number without a single day s interruption since the coli genomic DNA, and human mitochondri- o from such sources as the mitochondrial genome 1999. The abl 3700 is a fully automated with a significant match to a contaminant was o and Escherichia coli genomic DNA. DNA from capillary array sequencer and as such can discarded. A total of 713 reads matched E. s each donor was used to construct plasmid librar- be operated with a minimal amount of coli genomic DNA and 21 14 reads matched ies in one or more of three size classes: 2 kbp, 10 hands-on time, currently estimated at about the human mitochondrial genome cbp, and 50 kbp (Table 1)(33) 15 min per day. The capillary system also c In designing the DNA-sequencing pro- facilitates correct associations of sequenc- 1.3 Quality assessment and control cess, we focused on developing a simple ing traces with samples through the elimi- The importance of the base-pair level ac- ystem that could be implemented in a robust nation of manual sample loading and lane- curacy of the sequence data increases as the and reproducible manner and monitored ef- tracking errors associated with slab gels. size and repetitive nature of the genome to fectively(Fig. 2)(34). About 65 production staff were hired and be sequenced increases. Each sequence Current sequencing protocols are based on trained, and were rotated on a regular basis read must be placed uniquely in the ge- Table 1. Celera-generated data input into assembly Number of reads for different insert libraries Individual Total number of Total No of sequencing reads 2,767,357 2.767, 11.736,757 7467.755 66930 27 10464393006 853819 942,164,187 952523 19993 1.085640534 1,498607 Total 13543099 10894467 2834,287 27,271,853 14808616,179 Fold sequence coverage (2.9-Gb genome) 220 0.37 0.28 old clone coverag 18.39 18.39 11.26 Total 16.43 3868 sert size*(mean nsert size*(SD) 8.10% Mates Averag 8080 75.60 "insert size and SD are calculated from assembly of mates on contigs. t% Mates is based on laboratory tracking of sequencing runs www.sciencemagorgSciEnceVol29116FebRuarY2001 1307collected, as well as five specimens of semen, collected over a 6-week period. Permanent lymphoblastoid cell lines were created by Epstein-Barr virus immortalization. DNA from five subjects was selected for genomic DNA sequencing: two males and three fe￾males—one African-American, one Asian￾Chinese, one Hispanic-Mexican, and two Caucasians (see Web fig. 2 on Science Online at www.sciencemag.org/cgi/content/291/5507/ 1304/DC1). The decision of whose DNA to sequence was based on a complex mix of fac￾tors, including the goal of achieving diversity as well as technical issues such as the quality of the DNA libraries and availability of immortal￾ized cell lines. 1.1 Library construction and sequencing Central to the whole-genome shotgun sequenc￾ing process is preparation of high-quality plas￾mid libraries in a variety of insert sizes so that pairs of sequence reads (mates) are obtained, one read from both ends of each plasmid insert. High-quality libraries have an equal representa￾tion of all parts of the genome, a small number of clones without inserts, and no contamination from such sources as the mitochondrial genome and Escherichia coli genomic DNA. DNA from each donor was used to construct plasmid librar￾ies in one or more of three size classes: 2 kbp, 10 kbp, and 50 kbp (Table 1) (33). In designing the DNA-sequencing pro￾cess, we focused on developing a simple system that could be implemented in a robust and reproducible manner and monitored ef￾fectively (Fig. 2) (34). Current sequencing protocols are based on the dideoxy sequencing method (35), which typically yields only 500 to 750 bp of sequence per reaction. This limitation on read length has made monumental gains in throughput a pre￾requisite for the analysis of large eukaryotic genomes. We accomplished this at the Celera facility, which occupies about 30,000 square feet of laboratory space and produces sequence data continuously at a rate of 175,000 total reads per day. The DNA-sequencing facility is supported by a high-performance computation￾al facility (36). The process for DNA sequencing was mod￾ular by design and automated. Intermodule sample backlogs allowed four principal modules to operate independently: (i) li￾brary transformation, plating, and colony picking; (ii) DNA template preparation; (iii) dideoxy sequencing reaction set-up and purification; and (iv) sequence deter￾mination with the ABI PRISM 3700 DNA Analyzer. Because the inputs and outputs of each module have been carefully matched and sample backlogs are continu￾ously managed, sequencing has proceeded without a single day’s interruption since the initiation of the Drosophila project in May 1999. The ABI 3700 is a fully automated capillary array sequencer and as such can be operated with a minimal amount of hands-on time, currently estimated at about 15 min per day. The capillary system also facilitates correct associations of sequenc￾ing traces with samples through the elimi￾nation of manual sample loading and lane￾tracking errors associated with slab gels. About 65 production staff were hired and trained, and were rotated on a regular basis through the four production modules. A central laboratory information management system (LIMS) tracked all sample plates by unique bar code identifiers. The facility was supported by a quality control team that per￾formed raw material and in-process testing and a quality assurance group with responsi￾bilities including document control, valida￾tion, and auditing of the facility. Critical to the success of the scale-up was the validation of all software and instrumentation before implementation, and production-scale testing of any process changes. 1.2 Trace processing An automated trace-processing pipeline has been developed to process each sequence file (37). After quality and vector trimming, the average trimmed sequence length was 543 bp, and the sequencing accuracy was expo￾nentially distributed with a mean of 99.5% and with less than 1 in 1000 reads being less than 98% accurate (26). Each trimmed se￾quence was screened for matches to contam￾inants including sequences of vector alone, E. coli genomic DNA, and human mitochondri￾al DNA. The entire read for any sequence with a significant match to a contaminant was discarded. A total of 713 reads matched E. coli genomic DNA and 2114 reads matched the human mitochondrial genome. 1.3 Quality assessment and control The importance of the base-pair level ac￾curacy of the sequence data increases as the size and repetitive nature of the genome to be sequenced increases. Each sequence read must be placed uniquely in the ge￾Table 1. Celera-generated data input into assembly. Individual Number of reads for different insert libraries Total number of base pairs 2 kbp 10 kbp 50 kbp Total No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851 B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006 C 853,819 881,290 0 1,735,109 942,164,187 D 952,523 1,046,815 0 1,999,338 1,085,640,534 F 0 1,498,607 0 1,498,607 813,743,601 Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179 Fold sequence coverage A 0 0 0.52 0.52 (2.9-Gb genome) B 2.20 1.40 0.01 3.61 C 0.16 1.17 0 0.32 D 0.18 0.20 0 0.37 F 0 0.28 0 0.28 Total 2.54 2.04 0.53 5.11 Fold clone coverage A 0 0 18.39 18.39 B 2.96 11.26 0.44 14.67 C 0.22 1.33 0 1.54 D 0.24 1.58 0 1.82 F 0 2.26 0 2.26 Total 3.42 16.43 18.84 38.68 Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bp Insert size* (SD) Average 6.10% 8.10% 14.90% % Mates† Average 74.50 80.80 75.60 *Insert size and SD are calculated from assembly of mates on contigs. †% Mates is based on laboratory tracking of sequencing runs. T H E H UMAN G ENOME www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 1307 on September 27, 2009 www.sciencemag.org Downloaded from
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有