THE HUMAN GENOME nome, and even a modest error rate can entire human genome in a single facility, dent, nonbiased view of the genome. The sec- reduce the effectiveness of assembly. In we were able to ensure uniform quality ond approach involves clustering all of the frag- ddition, maintaining the validity of mate- standards and the cost advantages associat- ments to a region or chromosome on the basis pair information is absolutely critical for ed with automation, an economy of scale, of mapping information. The clustered data the algorithms described below. Procedural and process consistency were then shredded and subjected to computa- controls were established for maintaining tional assembly. Both approaches provided es- the validity of sequence mate-pairs as se- 2 Genome Assembly Strategy and ntially the same reconstruction of assembled quencing reactions proceeded through the Characterization process, including strict rules built into the Summary. We describe in this section the two DNA sequence win proper order and orienta- LIMS. The accuracy of sequence data pro- approaches that we used to assemble the ge- greater sequence coverage(fewer gaps) and duced by the Celera process was validated nome. One method involves the computational was the principal sequence used for the analysi in the course of the Drosophila genome combination of all sequence reads with shred- phase. In addition, we document the complete- project (26). By collecting data for the ded data from Gen Bank to generate an indepen- ness and correctness of this assembly process Potential Entry Points Potential Exit Points oces Human Sample Workflow Process - sample screening Tissue Samples DNA/RNA Extraction QC: size and clarity DNA/RNA (DNA Resources] /DNA Resources, (DNA Resources) QC, size concentration QC, insert size DNA/RNA(External) Libraries (DNA Resource Library Construction library complexity /DNA Resources/ (DNA Resources] 8cNEam5o Libraries QC: titer functional test Pre-Sequencing Fluorescently Labeled DNA Resource (Pre-Sequencing Labl C: monitor statistical Fluorescently Labeled Sequencing summary data Trace Files [NT] Sequencing Lab (Pre-Sequencing Lab) (Sequencing Lab/ vector contaminant Trace Files [UNIX load QCDS quality info Post-Sequencing creening [Content Systems/ 33sE= QC: byte count, External Fragments emove duplicates Proces Pre-Assembly IContent Systems·EDA Content Systems] /Content Systemsj QC: "gatekee External Trimmed syntax, duplicates Fragments Proto I/O File Generation_ quality values. Proto l/o files y Chromosome Proto l/o Files gatekeeper"run again Assembly Team QA review Assemblies [Informatics Research/ R/C Fig. 2. Flow diagram for sequencing pipeli lected, and processed in compliance with standard operating proc and da tau ith ot Maternal and extemal entities ac t dures, with a focus on quality within and across departments. Each ntrol measures, and responsible parties are indicated and are process has defined inputs and outputs with the capability to exchang further in the text. 1308 16FebRuaRy2001Vol291SciEncewww.sciencemag.orgnome, and even a modest error rate can reduce the effectiveness of assembly. In addition, maintaining the validity of matepair information is absolutely critical for the algorithms described below. Procedural controls were established for maintaining the validity of sequence mate-pairs as sequencing reactions proceeded through the process, including strict rules built into the LIMS. The accuracy of sequence data produced by the Celera process was validated in the course of the Drosophila genome project (26). By collecting data for the entire human genome in a single facility, we were able to ensure uniform quality standards and the cost advantages associated with automation, an economy of scale, and process consistency. 2 Genome Assembly Strategy and Characterization Summary. We describe in this section the two approaches that we used to assemble the genome. One method involves the computational combination of all sequence reads with shredded data from GenBank to generate an independent, nonbiased view of the genome. The second approach involves clustering all of the fragments to a region or chromosome on the basis of mapping information. The clustered data were then shredded and subjected to computational assembly. Both approaches provided essentially the same reconstruction of assembled DNA sequence with proper order and orientation. The second method provided slightly greater sequence coverage (fewer gaps) and was the principal sequence used for the analysis phase. In addition, we document the completeness and correctness of this assembly process Fig. 2. Flow diagram for sequencing pipeline. Samples are received, selected, and processed in compliance with standard operating procedures, with a focus on quality within and across departments. Each process has defined inputs and outputs with the capability to exchange samples and data with both internal and external entities according to defined quality guidelines. Manufacturing pipeline processes, products, quality control measures, and responsible parties are indicated and are described further in the text. T H E H UMAN G ENOME 1308 16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org on September 27, 2009 www.sciencemag.org Downloaded from