正在加载图片...
articles of the end of the contig, suggesting a possible false join in the small rearrangements during growth of the large-insert clones, ssembly of the initial sequence contig. In about half of these cases, regions of low-quality sequence or matches between segmental that a single raw sequence read may have been incorrectly joined. We stated. On the other hand, the criteria for recoglla be over- suggesting a possible misassembly; and 0.5 misassemblies and finished clones may have eliminated verlap instances per Mb in which the alignment indicated that two initial Layout of the sequenced clones. We assessed the accuracy of the sequence contigs that overlapped by at least 150 bp had not been layout of sequenced clones onto the fingerprinted clone contigs by merged by PHRAP. Finally, there were another 0.9 instances per Mb calculating the concordance between the positions assigned to a ith various other problems. This gives a total of 8.6 instances per sequenced clone on the basis of in silico digestion and the position Mb of possible misassembly, with about half being relatively small assigned on the basis of BAC end sequence data. The positions issues involving a few hundred bases agreed in 98% of cases in which independent assignments could be Some of the potential problems might not result from misassem- made by both methods. The results oly, but might reflect sequence polymorphism in the population, studied regions containing both finished and draft genome sequence. These results indicated that sequenced clone order the fingerprint map was reliable to within about half of one clone length(100 kb) a direct test of the layout is also provided by the draft genome sequence assembly itself. with extensive coverage of the genome,a correctly placed clone should usually(although not always)show sequence overlap with its neighbours in the map. We found only 421 instances of singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the fingerprint clone contigs. The alignment of the fingerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with lone contig scaffolds these previous maps, but the positions of about 1.7%differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying 050010001.5002.0002,5003,0003,5004.0004,5005000 Clone level continuity Figure 9 overview of features of draft human genome. The Figure shows the ccurrences of twelve important types of feature across the human genome. Large cale). Each of the feature types is depicted in a track, from top to bottom as follows Chromosome position in Mb. (2) The approximate positions of Giemsa-stained chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome sequence. Red, areas covered by finished clones: yellow, areas covered by predraft sequence Regions covered by draft d clones are in orange, with darker shades reflecting increasing shotgun sequence coverage. (4)GC content Percentage of bases in a 20,000 base window that are c or g(5) Repeat density. Red line, density of SINE class repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000- base window.(6) Density of SNPs in a 50,000-base window. The SNPs were detected by sequencing and alignments of random genomic reads Some of the heterogeneity in SNP density reflects the methods used for SNP discovery. Rigorous analysis of SNP density equires comparing the number of SNPs identified to the precise number of bases surveyed.() Non-coding RNA genes. Brown, functional RNA genes such as tRNAs, Sequence-contig scaffolds snoRNAs and rRNAs: light orange, RNA pseudogenes (8) CpG islands. Green 01002003004005006007008009001,000 represent regions of 200 bases with CpG levels significantly higher than in the genome as a whole, and GC ratios of at least 50%. (9)Exofish ecores. Regions of homology with the pufferfish T. nigroviridisare blue. (10) ESTs with at least one intron when aligned against genomic DNA are shown as black tick marks (11) The starts of genes predicted by Figure 8 Cumulative distributions of several measures of clone level contiguity and Genie or Ensembl are shown as red ticks. The starts of known genes from the Refseq sequence contiguity. The figures represent the proportion of the draft genome sequence database" are shovn in blue. (12) The names of genes that have been uniquely located contained in contigs of at most the indicated size. a, Clone level contiguity. The clones in the draft genome sequence, characterized and named by the HGM Nomenclature have a tight size distribution with an N50 of 160 kb(corresponding to 50% on the Committee. Known disease genes from the OMIM database are red, other genes blue cumulative distribution). Sequenced-clone contigs represent the next level of continuity, This Figure is based on an earlier version of the draft genome sequence than analysed in and are linked by mRNA sequences or pairs of BAC end sequences to yield the the text, owing to production constraints. We are aware of various errors in the Figure sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced including omissions of some known genes and misplacements of others. Some genes are clones against the fingerprinted clone contigs is only partially shown at this scale. apped to more than one location, owing to errors in assembly, close paralogues or b, Sequence contiguity. The input fragments have low continuity(N50= 21 7 kb). After pseudogenes. Manual review was performed to select the most likely location in these mergingthesequencecontigsgrowtoanN50lengthofabout82kb.AfterlinkingcasesandtocorectotherregionsForupdatedinformationseehttp://genome.ucsc.edu sequence-contig scaffolds with an N50 length of about 274 kb are created andhttp://www.ensemblorg/ 872 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011of the end of the contig, suggesting a possible false join in the assembly of the initial sequence contig. In about half of these cases, the potential misassembly involved fewer than 400 bases, suggesting that a single raw sequence read may have been incorrectly joined. We found 1.9 instances per Mb in which the alignment showed an internal gap, again suggesting a possible misassembly; and 0.5 instances per Mb in which the alignment indicated that two initial sequence contigs that overlapped by at least 150 bp had not been merged by PHRAP. Finally, there were another 0.9 instances per Mb with various other problems. This gives a total of 8.6 instances per Mb of possible misassembly, with about half being relatively small issues involving a few hundred bases. Some of the potential problems might not result from misassem￾bly, but might re¯ect sequence polymorphism in the population, small rearrangements during growth of the large-insert clones, regions of low-quality sequence or matches between segmental duplications. Thus, the frequency of misassemblies may be over￾stated. On the other hand, the criteria for recognizing overlap between draft and ®nished clones may have eliminated some misassemblies. Layout of the sequenced clones. We assessed the accuracy of the layout of sequenced clones onto the ®ngerprinted clone contigs by calculating the concordance between the positions assigned to a sequenced clone on the basis of in silico digestion and the position assigned on the basis of BAC end sequence data. The positions agreed in 98% of cases in which independent assignments could be made by both methods. The results were also compared with well studied regions containing both ®nished and draft genome sequence. These results indicated that sequenced clone order in the ®ngerprint map was reliable to within about half of one clone length (,100 kb). A direct test of the layout is also provided by the draft genome sequence assembly itself. With extensive coverage of the genome, a correctly placed clone should usually (although not always) show sequence overlap with its neighbours in the map. We found only 421 instances of `singleton' clones that failed to overlap a neighbouring clone. Close examination of the data suggests that most of these are correctly placed, but simply do not yet overlap an adjacent sequenced clone. About 150 clones appeared to be candidates for being incorrectly placed. Alignment of the ®ngerprint clone contigs. The alignment of the ®ngerprint clone contigs with the chromosomes was based on the radiation hybrid, YAC and genetic maps of STSs. The positions of most of the STSs in the draft genome sequence were consistent with these previous maps, but the positions of about 1.7% differed from one or more of them. Some of these disagreements may be due to errors in the layout of the sequenced clones or in the underlying articles 872 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 0 100 200 300 400 500 600 700 800 900 1,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Sequence level continuity Clone level continuity Cumulative percentage b a Initial sequence contigs Sequence contigs Sequence-contig scaffolds 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 0 10 20 30 40 50 60 70 80 90 100 Size (kb) Cumulative percentage Sequenced clones Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs Figure 8 Cumulative distributions of several measures of clone level contiguity and sequence contiguity. The ®gures represent the proportion of the draft genome sequence contained in contigs of at most the indicated size. a, Clone level contiguity. The clones have a tight size distribution with an N50 of , 160 kb (corresponding to 50% on the cumulative distribution). Sequenced-clone contigs represent the next level of continuity, and are linked by mRNA sequences or pairs of BAC end sequences to yield the sequenced-clone-contig scaffolds. The underlying contiguity of the layout of sequenced clones against the ®ngerprinted clone contigs is only partially shown at this scale. b, Sequence contiguity. The input fragments have low continuity (N50 = 21.7 kb). After merging, the sequence contigs grow to an N50 length of about 82 kb. After linking, sequence-contig scaffolds with an N50 length of about 274 kb are created. Figure 9 Overview of features of draft human genome. The Figure shows the occurrences of twelve important types of feature across the human genome. Large grey blocks represent centromeres and centromeric heterochromatin (size not precisely to scale). Each of the feature types is depicted in a track, from top to bottom as follows. (1) Chromosome position in Mb. (2) The approximate positions of Giemsa-stained chromosome bands at the 800 band resolution. (3) Level of coverage in the draft genome sequence. Red, areas covered by ®nished clones; yellow, areas covered by predraft sequence. Regions covered by draft sequenced clones are in orange, with darker shades re¯ecting increasing shotgun sequence coverage. (4) GC content. Percentage of bases in a 20,000 base window that are C or G. (5) Repeat density. Red line, density of SINE class repeats in a 100,000-base window; blue line, density of LINE class repeats in a 100,000- base window. (6) Density of SNPs in a 50,000-base window. The SNPs were detected by sequencing and alignments of random genomic reads. Some of the heterogeneity in SNP density re¯ects the methods used for SNP discovery. Rigorous analysis of SNP density requires comparing the number of SNPs identi®ed to the precise number of bases surveyed. (7) Non-coding RNA genes. Brown, functional RNA genes such as tRNAs, snoRNAs and rRNAs; light orange, RNA pseudogenes. (8) CpG islands. Green ticks represent regions of , 200 bases with CpG levels signi®cantly higher than in the genome as a whole, and GC ratios of at least 50%. (9) Exo®sh ecores. Regions of homology with the puffer®sh T. nigroviridis 292 are blue. (10) ESTs with at least one intron when aligned against genomic DNA are shown as black tick marks. (11) The starts of genes predicted by Genie or Ensembl are shown as red ticks. The starts of known genes from the RefSeq database110 are shown in blue. (12) The names of genes that have been uniquely located in the draft genome sequence, characterized and named by the HGM Nomenclature Committee. Known disease genes from the OMIM database are red, other genes blue. This Figure is based on an earlier version of the draft genome sequence than analysed in the text, owing to production constraints. We are aware of various errors in the Figure, including omissions of some known genes and misplacements of others. Some genes are mapped to more than one location, owing to errors in assembly, close paralogues or pseudogenes. Manual review was performed to select the most likely location in these cases and to correct other regions. For updated information, see http://genome.ucsc.edu/ and http://www.ensembl.org/. Q © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有