DNA from five subjects was selected for genomic DNA sequencing: two males and three fe￾males—one African-American, one Asian￾Chinese, one Hispanic-Mexican, and two Caucasians (see Web fig. 2 on Science Online at www.sciencemag.org/cgi/content/291/5507/ 1304/DC1). The decision of whose DNA to sequence was based on a complex mix of fac￾tors, including the goal of achieving diversity as well as technical issues such as the quality of the DNA libraries and availability of immortal￾ized cell lines. 1.1 Library construction and sequencing Central to the whole-genome shotgun sequenc￾ing process is preparation of high-quality plas￾mid libraries in a variety of insert sizes so that pairs of sequence reads (mates) are obtained, one read from both ends of each plasmid insert. High-quality libraries have an equal representa￾tion of all parts of the genome, a small number of clones without inserts, and no contamination from such sources as the mitochondrial genome and Escherichia coli genomic DNA. DNA from each donor was used to construct plasmid librar￾ies in one or more of three size classes: 2 kbp, 10 kbp, and 50 kbp (Table 1) (33). In designing the DNA-sequencing pro￾cess, we focused on developing a simple system that could be implemented in a robust and reproducible manner and monitored ef￾fectively (Fig. 2) (34). Current sequencing protocols are based on the dideoxy sequencing method (35), which typically yields only 500 to 750 bp of sequence per reaction. This limitation on read length has made monumental gains in throughput a pre￾requisite for the analysis of large eukaryotic genomes. We accomplished this at the Celera facility, which occupies about 30,000 square feet of laboratory space and produces sequence data continuously at a rate of 175,000 total reads per day. The DNA-sequencing facility is supported by a high-performance computation￾al facility (36). The process for DNA sequencing was mod￾ular by design and automated. Intermodule sample backlogs allowed four principal modules to operate independently: (i) li￾brary transformation, plating, and colony picking; (ii) DNA template preparation; (iii) dideoxy sequencing reaction set-up and purification; and (iv) sequence deter￾mination with the ABI PRISM 3700 DNA Analyzer. Because the inputs and outputs of each module have been carefully matched and sample backlogs are continu￾ously managed, sequencing has proceeded without a single day’s interruption since the initiation of the Drosophila project in May 1999. The ABI 3700 is a fully automated capillary array sequencer and as such can be operated with a minimal amount of hands-on time, currently estimated at about 15 min per day. The capillary system also facilitates correct associations of sequenc￾ing traces with samples through the elimi￾nation of manual sample loading and lane￾tracking errors associated with slab gels. About 65 production staff were hired and trained, and were rotated on a regular basis through the four production modules. A central laboratory information management system (LIMS) tracked all sample plates by unique bar code identifiers. The facility was supported by a quality control team that per￾formed raw material and in-process testing and a quality assurance group with responsi￾bilities including document control, valida￾tion, and auditing of the facility. Critical to the success of the scale-up was the validation of all software and instrumentation before implementation, and production-scale testing of any process changes. 1.2 Trace processing An automated trace-processing pipeline has been developed to process each sequence file (37). After quality and vector trimming, the average trimmed sequence length was 543 bp, and the sequencing accuracy was expo￾nentially distributed with a mean of 99.5% and with less than 1 in 1000 reads being less than 98% accurate (26). Each trimmed se￾quence was screened for matches to contam￾inants including sequences of vector alone, E. coli genomic DNA, and human mitochondri￾al DNA. The entire read for any sequence with a significant match to a contaminant was discarded. A total of 713 reads matched E. coli genomic DNA and 2114 reads matched the human mitochondrial genome. 1.3 Quality assessment and control The importance of the base-pair level ac￾curacy of the sequence data increases as the size and repetitive nature of the genome to be sequenced increases. Each sequence read must be placed uniquely in the ge￾Table 1. Celera-generated data input into assembly. Individual Number of reads for different insert libraries Total number of base pairs 2 kbp 10 kbp 50 kbp Total No. of sequencing reads A 0 0 2,767,357 2,767,357 1,502,674,851 B 11,736,757 7,467,755 66,930 19,271,442 10,464,393,006 C 853,819 881,290 0 1,735,109 942,164,187 D 952,523 1,046,815 0 1,999,338 1,085,640,534 F 0 1,498,607 0 1,498,607 813,743,601 Total 13,543,099 10,894,467 2,834,287 27,271,853 14,808,616,179 Fold sequence coverage A 0 0 0.52 0.52 (2.9-Gb genome) B 2.20 1.40 0.01 3.61 C 0.16 1.17 0 0.32 D 0.18 0.20 0 0.37 F 0 0.28 0 0.28 Total 2.54 2.04 0.53 5.11 Fold clone coverage A 0 0 18.39 18.39 B 2.96 11.26 0.44 14.67 C 0.22 1.33 0 1.54 D 0.24 1.58 0 1.82 F 0 2.26 0 2.26 Total 3.42 16.43 18.84 38.68 Insert size* (mean) Average 1,951 bp 10,800 bp 50,715 bp Insert size* (SD) Average 6.10% 8.10% 14.90% % Mates† Average 74.50 80.80 75.60 *Insert size and SD are calculated from assembly of mates on contigs. †% Mates is based on laboratory tracking of sequencing runs. 