We found 1.9 instances per Mb in which the alignment showed an internal gap, again suggesting a possible misassembly; and 0.5 instances per Mb in which the alignment indicated that two initial sequence contigs that overlapped by at least 150 bp had not been merged by PHRAP. Finally, there were another 0.9 instances per Mb with various other problems. This gives a total of 8.6 instances per Mb of possible misassembly, with about half being relatively small issues involving a few hundred bases. Some of the potential problems might not result from misassem￾bly, but might re¯ect sequence polymorphism in the population, small rearrangements during growth of the large-insert clones, regions of low-quality sequence or matches between segmental duplications. Thus, the frequency of misassemblies may be over￾stated. On the other hand, the criteria for recognizing overlap between draft and ®nished clones may have eliminated some misassemblies. Layout of the sequenced clones. We assessed the accuracy of the layout of sequenced clones onto the ®ngerprinted clone contigs by calculating the concordance between the positions assigned to a sequenced clone on the basis of in silico digestion and the position assigned on the basis of BAC end sequence data. The positions agreed in 98% of cases in which independent assignments could be made by both methods. The results were also compared with well studied regions containing both ®nished and draft genome sequence. These results indicated that sequenced clone order in the ®ngerprint map was reliable to within about half of one clone length (,100 kb). A direct test of the layout is also provided by the draft genome sequence assembly itself. With extensive coverage of the genome, a correctly placed clone should usually (although not always) show sequence overlap with its neighbours in the map. We found only 421 instances of `singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet overlap an adjacent sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the ®ngerprint clone contigs. The alignment of the ®ngerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with these previous maps, but the positions of about 1.7% differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying articles 872 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 0 100 200 300 400 500 600 700 800 900 1,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Sequence level continuity Clone level continuity Cumulative percentage b a Initial sequence contigs Sequence contigs Sequence-contig scaffolds 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Cumulative percentage Sequenced clones Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs Figure 8 Cumulative distributions of several measures of clone level contiguity and sequence contiguity. The ®gures represent the proportion of the draft genome sequence contained in contigs of at most the indicated size. a, Clone level contiguity. The clones have a tight size distribution with an N50 of , 160 kb (corresponding to 50% on the cumulative distribution). Sequenced-clone contigs represent the next level of continuity, and are linked by mRNA sequences or pairs of BAC end sequences to yield the sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced clones against the ®ngerprinted clone contigs is only partially shown at this scale. b, Sequence contiguity. The input fragments have low continuity (N50 = 21.7 kb). After merging, the sequence contigs grow to an N50 length of about 82 kb. After linking, sequence-contig scaffolds with an N50 length of about 274 kb are created. Figure 9 Overview of features of draft human genome. The Figure shows the occurrences of twelve important types of feature across the human genome. Large grey blocks represent centromeres and centromeric heterochromatin (size not precisely to scale). Each of the feature types is depicted in a track, from top to bottom as follows. (1) Chromosome position in Mb. (2) The approximate positions of Giemsa-stained chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome sequence. Red, areas covered by ®nished clones; yellow, areas covered by predraft sequence. Regions covered by draft sequenced clones are in orange, with darker shades re¯ecting increasing shotgun sequence coverage. (4) GC content. Percentage of bases in a 20,000 base window that are C or G. (5) Repeat density. Red line, density of SINE class repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000- base window. (6) Density of SNPs in a 50,000-base window. The SNPs were detected by sequencing and alignments of random genomic reads. Some of the heterogeneity in SNP density re¯ects the methods used for SNP discovery. Rigorous analysis of SNP density requires comparing the number of SNPs identi®ed to the precise number of bases surveyed. (7) Non-coding RNA genes. Brown, functional RNA genes such as tRNAs, snoRNAs and rRNAs; light orange, RNA pseudogenes. (8) CpG islands. Green ticks represent regions of , 200 bases with CpG levels signi®cantly higher than in the genome as a whole, and GC ratios of at least 50%. (9) Exo®sh ecores. Regions of homology with the puffer®sh T. nigroviridis 292 are blue. (10) ESTs with at least one intron when aligned against genomic DNA are shown as black tick marks. (11) The starts of genes predicted by Genie or Ensembl are shown as red ticks. The starts of known genes from the RefSeq database110 are shown in blue. (12) The names of genes that have been uniquely located in the draft genome sequence, characterized and named by the HGM Nomenclature Committee. Known disease genes from the OMIM database are red, other genes blue. This Figure is based on an earlier version of the draft genome sequence than analysed in the text, owing to production constraints. We are aware of various errors in the Figure, including omissions of some known genes and misplacements of others. Some genes are mapped to more than one location, owing to errors in assembly, close paralogues or pseudogenes. Manual review was performed to select the most likely location in these cases and to correct other regions. For updated information, see http://genome.ucsc.edu/ and http://www.ensembl.org/. Q © 2001 Macmillan Magazines Ltd
