正在加载图片...
articles contigs in 942 fingerprint clone contigs Quality assessment The hierarchy of contigs is summarized in Fig. 7. Initial The draft genome sequence already covers the vast majority of the genome, but it remains an incomplete, intermediate product that is then linked to form sequence-contig scaffolds These scaffo regularly updated as we work towards a complete finished sequence ithin sequenced-clone contigs, which in turn reside within finger- The current version contains many gaps and errors. We therefore ne co sought to evaluate the quality of various aspects of the current draft The draft genome sequence nome sequence, including the sequenced clones themselves, their are reported in Tables 5-7, including the proportion represented by sequence-contig scaffolds. nished, draft and predraft categories. The Tables also show the Nucleotide accuracy is reflected in a PhRaP score assigned to numbers and lengths of different types of contig, for each chromo- each base in the draft genome sequence and available to users some and for the genome as a whole hrough the Genome Browsers(see below) and public database The contiguity of the draft genome sequence at each level is an entries. A summary of these scores for the unfinished portion of the mportant feature. Two commonly used statistics have significant genome is shown in Table 9. About 91% of the unfinished draft drawbacks for describing contiguity. The 'average length of a contig genome sequence has or rate of less than I per 10,000 bases is deflated by the presence of many small contigs comprising o a(PhRAP score >40), and about 96% has an error rate of less than 1 small proportion of the genome, whereas the "length-weighted in 1,000 bases(PHRAP> 30). These values are based only on the average length'is inflated by the presence of large segments of quality scores for the bases in the sequenced clones; they do not finished sequence. Instead, we chose to describe the contiguity as a reflect additional confidence in the sequences that are represented in N50 length, defined as the largest length L such that 50% of all sequence has an error rate of less than l per 10,000 baseasgenome property of the 'typical"nucleotide. We used a statistic called the overlapping clones. The finished portion of the draft nucleotides are contained in contigs of size at least L. Individual sequenced clones. We assessed the frequency of mis- The continuity of the draft genome sequence reported here and assemblies, which can occur when the assembly program PHRAP the effectiveness of assembly can be readily seen from the following: joins two nonadjacent regions in the clone into a single initial half of all nucleotides reside within an initial sequence contig of at sequence contig. The frequency of misassemblies depends heavily least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig on the depth and quality of coverage of each clone and the nature of scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb the underlying sequence; thus it may vary among genomic regions and a fingerprint clone contig of at least 8.4 Mb(Tables 6, 7). The and among individual centres. Most clone misassemblies are readily cumulative distributions for each of these measures of contiguity corrected as coverage is added during finishing, but they may have are shown in Fig 8, in which the N50 values for each measure can be been propagated into the current version of the draft genome seen as the value at which the cumulative distributions cross 50% lence and they justify caution for certain applications. Ve have also estimated the size of each chromosome, by estimating the gap sizes(see below)and the extent of missing heterochromatic instances in which there was substantial overlap between a dr tion and does not adequately take into account the oversimplifica- clone and a finished clone. We studied 83 Mb of such overlaps, ontigs. We found 5.3 of each chromosome. Nonetheless, it provides a useful way to relate instances per Mb in which the alignment of an initial sequence the draft sequence to the chromosomes. contig to the finished sequence failed to extend to within 200 bases le 6 Clone level Sequenced-clone contigs Fingerprint clone contigs with sequence N50 length(kb) Number N50 length(kb umber N50 length (b 279 1,915 28 234567891 1.550 6.918 ngth estimates are from the draft genome sequence, in which gaps between onby slightly. Forunfnished chromosomes, the N50 length ranges from 1. 5 to 3 times the arithmetic r affords, and 1.5 to 6 times for fingerprint clone contigs with sequen NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltd 871contigs in 942 ®ngerprint clone contigs. The hierarchy of contigs is summarized in Fig. 7. Initial sequence contigs are integrated to create merged sequence contigs, which are then linked to form sequence-contig scaffolds. These scaffolds reside within sequenced-clone contigs, which in turn reside within ®nger￾print clone contigs. The draft genome sequence The result of the assembly process is an integrated draft sequence of the human genome. Several features of the draft genome sequence are reported in Tables 5±7, including the proportion represented by ®nished, draft and predraft categories. The Tables also show the numbers and lengths of different types of contig, for each chromo￾some and for the genome as a whole. The contiguity of the draft genome sequence at each level is an important feature. Two commonly used statistics have signi®cant drawbacks for describing contiguity. The `average length' of a contig is de¯ated by the presence of many small contigs comprising only a small proportion of the genome, whereas the `length-weighted average length' is in¯ated by the presence of large segments of ®nished sequence. Instead, we chose to describe the contiguity as a property of the `typical' nucleotide. We used a statistic called the `N50 length', de®ned as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L. The continuity of the draft genome sequence reported here and the effectiveness of assembly can be readily seen from the following: half of all nucleotides reside within an initial sequence contig of at least 21.7 kb, a sequence contig of at least 82 kb, a sequence-contig scaffold of at least 274 kb, a sequenced-clone contig of at least 826 kb and a ®ngerprint clone contig of at least 8.4 Mb (Tables 6, 7). The cumulative distributions for each of these measures of contiguity are shown in Fig. 8, in which the N50 values for each measure can be seen as the value at which the cumulative distributions cross 50%. We have also estimated the size of each chromosome, by estimating the gap sizes (see below) and the extent of missing heterochromatic sequence93,94,105±108 (Table 8). This is undoubtedly an oversimpli®ca￾tion and does not adequately take into account the sequence status of each chromosome. Nonetheless, it provides a useful way to relate the draft sequence to the chromosomes. Quality assessment The draft genome sequence already covers the vast majority of the genome, but it remains an incomplete, intermediate product that is regularly updated as we work towards a complete ®nished sequence. The current version contains many gaps and errors. We therefore sought to evaluate the quality of various aspects of the current draft genome sequence, including the sequenced clones themselves, their assignment to a position in the ®ngerprint clone contigs, and the assembly of initial sequence contigs from the individual clones into sequence-contig scaffolds. Nucleotide accuracy is re¯ected in a PHRAP score assigned to each base in the draft genome sequence and available to users through the Genome Browsers (see below) and public database entries. A summary of these scores for the un®nished portion of the genome is shown in Table 9. About 91% of the un®nished draft genome sequence has an error rate of less than 1 per 10,000 bases (PHRAP score . 40), and about 96% has an error rate of less than 1 in 1,000 bases (PHRAP . 30). These values are based only on the quality scores for the bases in the sequenced clones; they do not re¯ect additional con®dence in the sequences that are represented in overlapping clones. The ®nished portion of the draft genome sequence has an error rate of less than 1 per 10,000 bases. Individual sequenced clones. We assessed the frequency of mis￾assemblies, which can occur when the assembly program PHRAP joins two nonadjacent regions in the clone into a single initial sequence contig. The frequency of misassemblies depends heavily on the depth and quality of coverage of each clone and the nature of the underlying sequence; thus it may vary among genomic regions and among individual centres. Most clone misassemblies are readily corrected as coverage is added during ®nishing, but they may have been propagated into the current version of the draft genome sequence and they justify caution for certain applications. We estimated the frequency of misassembly by examining instances in which there was substantial overlap between a draft clone and a ®nished clone. We studied 83 Mb of such overlaps, involving about 9,000 initial sequence contigs. We found 5.3 instances per Mb in which the alignment of an initial sequence contig to the ®nished sequence failed to extend to within 200 bases articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 871 Table 6 Clone level contiguity of the draft genome sequence Chromosome Sequenced-clone contigs Sequenced-clone-contig scaffolds Fingerprint clone contigs with sequence Number N50 length (kb) Number N50 length (kb) Number N50 length (kb) All 4,884 826 2,191 2,279 942 8,398 1 453 650 197 1,915 106 3,537 2 348 1,028 127 3,140 52 10,628 3 409 672 201 1,550 73 5,077 4 384 606 163 1,659 41 6,918 5 385 623 164 1,642 48 5,747 6 292 814 98 3,292 17 24,680 7 224 1.074 86 3,527 29 20,401 8 292 542 115 1,742 43 6,236 9 143 1,242 78 2,411 21 29,108 10 179 1,097 105 1,952 16 30,284 11 224 887 89 3,024 31 9,414 12 196 1,138 76 2,717 28 9,546 13 128 1,151 56 3,257 13 25,256 14 54 3,079 27 8,489 14 22,128 15 123 797 56 2,095 19 8,274 16 159 620 92 1,317 57 2,716 17 138 831 58 2,138 43 2,816 18 137 709 47 2,572 24 4,887 19 159 569 79 1,200 51 1,534 20 42 2,318 20 6,862 9 23,489 21 5 28,515 5 28,515 5 28,515 22 11 23,048 11 23,048 11 23,048 X 325 572 181 1,082 143 1,436 Y 27 1,539 20 3,290 8 5,135 UL 47 227 40 281 40 281 ................................................................................................................................................................................................................................................................................................................................................................... Number and size of sequenced-clone contigs, sequenced-clone-contig scaffolds and those ®ngerprint clone contigs (see Box 1) that contain sequenced clones; some small ®ngerprint clone contigs do not as yet have associated sequence. UL, ®ngerprint clone contigs that could not reliably be placed on a chromosome. These length estimates are from the draft genome sequence, in which gaps between sequence contigs are arbitrarily represented with 100 Ns and gaps between sequence clone contigs with 50,000 Ns for `bridged gaps' and 100,000 Ns for `unbridged gaps'. These arbitrary values differ minimally from empirical estimates of gap size (see text), and using the empirically derived estimates would change the N50 lengths presented here only slightly. For un®nished chromosomes, the N50 length ranges from 1.5 to 3 times the arithmetic mean for sequenced-clone contigs, 1.5 to 3 times for sequenced-clone-contig scaffolds, and 1.5 to 6 times for ®ngerprint clone contigs with sequence. © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有