正在加载图片...
articles fingerprint map. However, many involve STSs that have been landmark content remain difficult to place. Full utilization of localized on only one or two of the previous maps or that occur the higher resolution radiation hybrid map(the TNG map)may as isolated discrepancies in conflict with several flanking STSs. help in this. Future targeted FISH experiments and increased map Many of these cases are probably due to errors in the previous continuity will also facilitate positioning of these sequences maps(with error rates for individual maps estimated at 1-2% Genome coverage (e-PCR) computer program) or to database entries that contain genome not represented within the current version sequence data from more than one clone (owing to cross- Gaps in draft genome sequence coverage. There are three types of gap in the draft genome sequence: gaps within unfinished Graphical views of the independent data sets were particularly sequenced clones; gaps between sequenced-clone contigs, but seful in detecting problems with order or orientation(Fig. 5). within fingerprint clone contigs; and gaps between fingerprint Areas of conflict were reviewed and corrected if orted by the clone contigs. The first two types are relatively straightforward to underlying data. In the version discussed here, there were 41 close simply by performing additional sequencing and finishing on sequenced clones falling in 14 sequenced-clone contigs with STs already identified clones. Closing the third type may require screen- content information from multiple maps that disagreed with the ing of additional large-insert clone libraries and possibly new flanking clones or sequenced-clone contigs; the placement of these technologies for the most recalcitrant regions. We consider these clones thus remains suspect. Four of these instances suggest errors three cases in turn in the fingerprint map, whereas the others suggest errors in the We estimated the size of gaps within draft clones by studying layout of sequenced clones. These cases are being investigated and instances in which there was substantial overlap between a draft clone and a finished clone, as described above. The average gap siz Assembly of the sequenced clones. We assessed the accuracy of the in these draft sequenced clones was 554 bp, although the precise assembly by using a set of 148 draft clones comprising 22. 4 Mb for estimate was sensitive to certain assumptions in the analysis which finished sequence subsequently became available. The Assuming that the sequence gaps in the draft genome sequence tion,and Gig Assembler attempts to use linking data to infer such (likely range 2-4%)of sequence may lie in the 145,514 gaps within information as far as possible. Starting with initial sequence draft sequenced clones contigs that were unordered and unoriented, the program placed The gaps between sequenced-clone contigs but within fingerprint 90% of the initial sequence contigs in the correct orientation and clone contigs are more difficult to evaluate directly, because the 85% in the correct order with respect to one another. In a separate draft genome sequence flanking many of the gaps is often not test, Gig Assembler was tested on simulated draft data produced precisely aligned with the fingerprinted clones. However, most are from finished sequence on chromosome 22 and similar results were much smaller than a single BAC. In fact, nearly three-quarters of obtained these gaps are bridged by one or more individual BACs, as indicated Some problems remain at all levels. First, errors in the initial by linking information from BAC end sequences. We measured the sequence contigs persist in the merged sequence contigs built from sizes of a subset of gaps directly by examining restriction fragment them and can cause difficulties in the assembly of the draft genome fingerprints of overlapping clones. A study of 157"bridged gaps and sequence. Second, Gig Assembler may fail to merge some over- 55'unbridged gaps gave an average gap size of 25 kb. Allowing for the lapping sequences because of poor data quality, allelic differences or possibility that these gaps may not be fully repre ve and that lisassemblies of the initial sequence contigs; this may result in some restriction fragments are not included in the calculation,a more apparent local duplication of a sequence. We have estimated by conservative estimate of gap size would be 35 kb. This would indicate various methods the amount of such artefactual duplication in the that about 150 Mb or 5%of the human genome may reside in the assembly from these and other sources to be about 100 Mb. On the 4,076 gaps between sequenced-clone contigs. This sequence should other hand, nearby duplicated sequences may occasionally be incor- be readily obtained as the clones spanning them are sequenced. thes merged. Some sequenced clones remain incorrectly placed on The size of the gaps between fingerprint clone contigs was the layout, as discussed above, and others(<0.5%)remain unplaced. estimated by comparing the fingerprint maps to the essentially The fingerprint map has undoubtedly failed to resolve some closely completed chromosomes 21 and 22. The analysis shows that the related duplicated regions, such as the williams region and several fingerprinted BAC clones in the global database cover 97-98% of highly repetitive subtelomeric and pericentric regions(see below). the sequenced portions of those chromosomes. The published Detailed examination and sequence finishing may be required to sequences of these chromosomes also contain a few small gaps(5 ort out these regions precisely, as has been done with chromosome and 11, respectively) amounting to some 1.6% of the euchromatic Y. Finally, small sequenced-clone contigs with limited or no STs sequence, and do not include the heterochromatic portion. This suggests that the gaps between contigs in the fingerprint ma Table 9 Dis trouton or Phrae scores in tne gran genome sequence- closure of such gaps on chromosomes 20 and 7 suggests that ma PHRAP score Percentage of bases in the dratt of these gaps are less than one clone in length and will be closed by clones from other libraries. However, recovery of sequence from these gaps represents the most challenging aspect of producing a complete finished sequence of the human genome. As another measure of the representation of the BAC libraries, Riethman0 has found BAC or cosmid clones that link to telomeric half-YACs or to the telomeric sequence itself for 40 of the 41 non- satellite telomeres. Thus, the fingerprint map appears to have no 35.9 substantial gaps in these regions. Many of th bicentric so represented, but analysis is less complete here(see below) of 10-0. Thus, PHRAP scores of 20, 30 and 40correspondto Representation of random raw sequences. In another approach to ctively. PHRAP ertyingsequencereadsusedinsequenceassemblySeehttp://www.gen are deried rom qualty measuring coverage, we compared a collection of random raw washington edw/WGC/analysistool/phrap. htm sequence reads to the existing draft genome sequence. In principle, 87 A@2001 Macmillan Magazines Ltd NATURE VOL 409 15 FEBRUARY 20011®ngerprint map. However, many involve STSs that have been localized on only one or two of the previous maps or that occur as isolated discrepancies in con¯ict with several ¯anking STSs. Many of these cases are probably due to errors in the previous maps (with error rates for individual maps estimated at 1±2%100). Others may be due to incorrect assignment of the STSs to the draft genome sequence (by the electronic polymerase chain reaction (e-PCR) computer program) or to database entries that contain sequence data from more than one clone (owing to cross￾contamination). Graphical views of the independent data sets were particularly useful in detecting problems with order or orientation (Fig. 5). Areas of con¯ict were reviewed and corrected if supported by the underlying data. In the version discussed here, there were 41 sequenced clones falling in 14 sequenced-clone contigs with STS content information from multiple maps that disagreed with the ¯anking clones or sequenced-clone contigs; the placement of these clones thus remains suspect. Four of these instances suggest errors in the ®ngerprint map, whereas the others suggest errors in the layout of sequenced clones. These cases are being investigated and will be corrected in future versions. Assembly of the sequenced clones. We assessed the accuracy of the assembly by using a set of 148 draft clones comprising 22.4 Mb for which ®nished sequence subsequently became available104. The initial sequence contigs lack information about order and orienta￾tion, and GigAssembler attempts to use linking data to infer such information as far as possible104. Starting with initial sequence contigs that were unordered and unoriented, the program placed 90% of the initial sequence contigs in the correct orientation and 85% in the correct order with respect to one another. In a separate test, GigAssembler was tested on simulated draft data produced from ®nished sequence on chromosome 22 and similar results were obtained. Some problems remain at all levels. First, errors in the initial sequence contigs persist in the merged sequence contigs built from them and can cause dif®culties in the assembly of the draft genome sequence. Second, GigAssembler may fail to merge some over￾lapping sequences because of poor data quality, allelic differences or misassemblies of the initial sequence contigs; this may result in apparent local duplication of a sequence. We have estimated by various methods the amount of such artefactual duplication in the assembly from these and other sources to be about 100 Mb. On the other hand, nearby duplicated sequences may occasionally be incor￾rectly merged. Some sequenced clones remain incorrectly placed on the layout, as discussed above, and others (, 0.5%) remain unplaced. The ®ngerprint map has undoubtedly failed to resolve some closely related duplicated regions, such as the Williams region and several highly repetitive subtelomeric and pericentric regions (see below). Detailed examination and sequence ®nishing may be required to sort out these regions precisely, as has been done with chromosome Y89. Finally, small sequenced-clone contigs with limited or no STS landmark content remain dif®cult to place. Full utilization of the higher resolution radiation hybrid map (the TNG map) may help in this95. Future targeted FISH experiments and increased map continuity will also facilitate positioning of these sequences. Genome coverage We next assessed the nature of the gaps within the draft genome sequence, and attempted to estimate the fraction of the human genome not represented within the current version. Gaps in draft genome sequence coverage. There are three types of gap in the draft genome sequence: gaps within un®nished sequenced clones; gaps between sequenced-clone contigs, but within ®ngerprint clone contigs; and gaps between ®ngerprint clone contigs. The ®rst two types are relatively straightforward to close simply by performing additional sequencing and ®nishing on already identi®ed clones. Closing the third type may require screen￾ing of additional large-insert clone libraries and possibly new technologies for the most recalcitrant regions. We consider these three cases in turn. We estimated the size of gaps within draft clones by studying instances in which there was substantial overlap between a draft clone and a ®nished clone, as described above. The average gap size in these draft sequenced clones was 554 bp, although the precise estimate was sensitive to certain assumptions in the analysis. Assuming that the sequence gaps in the draft genome sequence are fairly represented by this sample, about 80 Mb or about 3% (likely range 2±4%) of sequence may lie in the 145,514 gaps within draft sequenced clones. The gaps between sequenced-clone contigs but within ®ngerprint clone contigs are more dif®cult to evaluate directly, because the draft genome sequence ¯anking many of the gaps is often not precisely aligned with the ®ngerprinted clones. However, most are much smaller than a single BAC. In fact, nearly three-quarters of these gaps are bridged by one or more individual BACs, as indicated by linking information from BAC end sequences. We measured the sizes of a subset of gaps directly by examining restriction fragment ®ngerprints of overlapping clones. A study of 157 `bridged' gaps and 55 `unbridged' gaps gave an average gap size of 25 kb. Allowing for the possibility that these gaps may not be fully representative and that some restriction fragments are not included in the calculation, a more conservative estimate of gap size would be 35 kb. This would indicate that about 150 Mb or 5% of the human genome may reside in the 4,076 gaps between sequenced-clone contigs. This sequence should be readily obtained as the clones spanning them are sequenced. The size of the gaps between ®ngerprint clone contigs was estimated by comparing the ®ngerprint maps to the essentially completed chromosomes 21 and 22. The analysis shows that the ®ngerprinted BAC clones in the global database cover 97±98% of the sequenced portions of those chromosomes86. The published sequences of these chromosomes also contain a few small gaps (5 and 11, respectively) amounting to some 1.6% of the euchromatic sequence, and do not include the heterochromatic portion. This suggests that the gaps between contigs in the ®ngerprint map contain about 4% of the euchromatic genome. Experience with closure of such gaps on chromosomes 20 and 7 suggests that many of these gaps are less than one clone in length and will be closed by clones from other libraries. However, recovery of sequence from these gaps represents the most challenging aspect of producing a complete ®nished sequence of the human genome. As another measure of the representation of the BAC libraries, Riethman109 has found BAC or cosmid clones that link to telomeric half-YACs or to the telomeric sequence itself for 40 of the 41 non￾satellite telomeres. Thus, the ®ngerprint map appears to have no substantial gaps in these regions. Many of the pericentric regions are also represented, but analysis is less complete here (see below). Representation of random raw sequences. In another approach to measuring coverage, we compared a collection of random raw sequence reads to the existing draft genome sequence. In principle, articles 874 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com Table 9 Distribution of PHRAP scores in the draft genome sequence PHRAP score Percentage of bases in the draft genome sequence 0±9 0.6 10±19 1.3 20±29 2.2 30±39 4.8 40±49 8.1 50±59 8.7 60±69 9.0 70±79 12.1 80±89 17.3 .90 35.9 ............................................................................................................................................................................. PHRAP scores are a logarithmically based representation of the error probability. A PHRAP score of X corresponds to an error probability of 10-X/10. Thus, PHRAP scores of 20, 30 and 40 correspond to accuracy of 99%, 99.9% and 99.99%, respectively. PHRAP scores are derived from quality scores of the underlying sequence reads used in sequence assembly. See http://www.genome. washington.edu/UWGC/analysistools/phrap.htm. © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有