正在加载图片...
articles the fraction of reads matching the draft genome sequence should contigs but between sequenced clones(gaps of the second type)and provide an estimate of genome coverage. In practice, the compar- one failed to identify clones in the fingerprint map(gaps of the third ison is complicated by the need to allow for repeat sequences, the type) but did identify clones in another large-insert library mperfect sequence quality of both the raw sequence and the draft Although these numbers are small, they are consistent with the genome sequence, and the possibility of polymorphism. None- view that the much of the remaining genome sequence lies within theless, the analysis provides a reasonable view of the extent to already identified clones in the current map which the genome is represented in the draft genome sequence and Estimates of genome and chromosome sizes. Informed by this the public databases. analysis of genome coverage, we proceeded to estimate the sizes of We compared the raw sequence reads against both the sequences the genome and each of the chromosomes(Table 8). Beginning with ed in the construction of the draft genome sequence and all of the current assigned sequence for each chromosome, we corrected sequence reads analysed (each containing at least 100 bp of con- above). We attempted to account for the sizes of centromeres and identity ith-repetitive sequence), 4,924 had a match of =97% heterochromatin, neither of which are well represented in the draft a sequenced clone, indicating that 88+ 1.5%of the sequence. Finally, we corrected for around 100 Mb of artefactual genome was represented in sequenced clones. The estimate is duplication in the assembly. We arrived at a total human genome subject to various uncertainties. Most serious is the proportion of size estimate of around 3, 200 Mb, which compares favourably with peat sequence in the remainder of the genome. If the unsequenced previous estimates based on DNA content. ortion of the genome is unusually rich in repeated sequence, We also independently estimated the size of the euchromatic we would underestimate its size (although the excess would be portion of the genome by determining the fraction of the 5,615 random raw sequences that matched the finished portion of We examined those raw sequences that failed to match by the human genome (whose total length is known with greater comparing them to the other publicly available sequence resources. precision). Twenty-nine per cent of these raw sequences found a ifty(0.9%) had matches in public databases containing cDNA match among 835 Mb of nonredundant finished sequence. This sequences, STSs and similar data. An additional 276(or 43% of the leads to an estimate of the euchromatic genome size of 2.9 Gb. This remaining raw sequence)had matches to the whole-genome shot- agrees reasonably with the prediction above based on the length of gun reads discussed above(consistent with the idea that these reads the draft quence(Table 8). cover about half of the genome) Update. The results above reflect the data on 7 October 2000. New We also examined the extent of genome coverage by aligning the data are continually being added, with improvements being made CDNA sequences for genes in the RefSeq dataset to the draft the physical map, new clones being sequenced to close gaps and genome sequence. We found that 88%of the bases of these cDNAs draft clones progressing to full shotgun coverage and finishing. The ould be aligned to the draft genome sequence at high stringency (at draft genome sequence will be regularly reassembled and publicly least 98% identity). (A few of the alignments with either the random released. raw sequence reads or the cDNAs may be to a highly similar region Currently, the physical map has been refined such that the in the genome, but such matches should affect the estimate of number of fingerprint clone contigs has fallen from 1, 246 to 965; genome coverage by considerably less than 1%, based on the this reflects the elimination of some artefactual contigs and the estimated extent of duplication within the genome(see below). closure of some gaps. The sequence coverage has risen such that These results indicate that about 88% of the human genome is 90% of the human genome is now represented in the sequenced represented in the draft genome sequence and about 94% in the clones and more than 94% is represented in the combined publicly ombined publicly available sequence databases. The figure of 88% available sequence databases. The total amount of finished sequence agrees well with our independent estimates above that about 3%, is now around 1 Gb 5%and 4% of the genome reside in the three types of gap in the draft genome sequence. Broad genomic landscape Finally, a small experimental check was perform ge-insert clone library with probes corresponding to 16 of the What biological insights can be gleaned from the draft sequence? In whole genome shotgun reads that failed to match the draft genome this section, we consider very large-scale features of the draft sequence. Five hybridized to many clones from different fingerprint genome sequence: the distribution of GC content, CpG islands remaining eleven, two fell within sequenced clones(presumably the human genome. The draft genome sequence makes it possible to ithin sequence gaps of the first type), eight fell in fingerprint clone integrate these features and others at scales ranging from individual e/ Ensembl Figure 10 Screen shot from UCSC Draft Human Genome Browser. See Figure 11 Screen shot from the Genome Browser of Project Ensembl. See httpgenome.ucscedu/. NatuRevOl409115FeBruAry2001www.nature.com A@ 2001 Macmillan Magazinesthe fraction of reads matching the draft genome sequence should provide an estimate of genome coverage. In practice, the compar￾ison is complicated by the need to allow for repeat sequences, the imperfect sequence quality of both the raw sequence and the draft genome sequence, and the possibility of polymorphism. None￾theless, the analysis provides a reasonable view of the extent to which the genome is represented in the draft genome sequence and the public databases. We compared the raw sequence reads against both the sequences used in the construction of the draft genome sequence and all of GenBank using the BLAST computer program. Of the 5,615 raw sequence reads analysed (each containing at least 100 bp of con￾tiguous non-repetitive sequence), 4,924 had a match of $ 97% identity with a sequenced clone, indicating that 88 6 1.5% of the genome was represented in sequenced clones. The estimate is subject to various uncertainties. Most serious is the proportion of repeat sequence in the remainder of the genome. If the unsequenced portion of the genome is unusually rich in repeated sequence, we would underestimate its size (although the excess would be comprised of repeated sequence). We examined those raw sequences that failed to match by comparing them to the other publicly available sequence resources. Fifty (0.9%) had matches in public databases containing cDNA sequences, STSs and similar data. An additional 276 (or 43% of the remaining raw sequence) had matches to the whole-genome shot￾gun reads discussed above (consistent with the idea that these reads cover about half of the genome). We also examined the extent of genome coverage by aligning the cDNA sequences for genes in the RefSeq dataset110 to the draft genome sequence. We found that 88% of the bases of these cDNAs could be aligned to the draft genome sequence at high stringency (at least 98% identity). (A few of the alignments with either the random raw sequence reads or the cDNAs may be to a highly similar region in the genome, but such matches should affect the estimate of genome coverage by considerably less than 1%, based on the estimated extent of duplication within the genome (see below).) These results indicate that about 88% of the human genome is represented in the draft genome sequence and about 94% in the combined publicly available sequence databases. The ®gure of 88% agrees well with our independent estimates above that about 3%, 5% and 4% of the genome reside in the three types of gap in the draft genome sequence. Finally, a small experimental check was performed by screening a large-insert clone library with probes corresponding to 16 of the whole genome shotgun reads that failed to match the draft genome sequence. Five hybridized to many clones from different ®ngerprint clone contigs and were discarded as being repetitive. Of the remaining eleven, two fell within sequenced clones (presumably within sequence gaps of the ®rst type), eight fell in ®ngerprint clone contigs but between sequenced clones (gaps of the second type) and one failed to identify clones in the ®ngerprint map (gaps of the third type) but did identify clones in another large-insert library. Although these numbers are small, they are consistent with the view that the much of the remaining genome sequence lies within already identi®ed clones in the current map. Estimates of genome and chromosome sizes. Informed by this analysis of genome coverage, we proceeded to estimate the sizes of the genome and each of the chromosomes (Table 8). Beginning with the current assigned sequence for each chromosome, we corrected for the known gaps on the basis of their estimated sizes (see above). We attempted to account for the sizes of centromeres and heterochromatin, neither of which are well represented in the draft sequence. Finally, we corrected for around 100 Mb of artefactual duplication in the assembly. We arrived at a total human genome size estimate of around 3,200 Mb, which compares favourably with previous estimates based on DNA content. We also independently estimated the size of the euchromatic portion of the genome by determining the fraction of the 5,615 random raw sequences that matched the ®nished portion of the human genome (whose total length is known with greater precision). Twenty-nine per cent of these raw sequences found a match among 835 Mb of nonredundant ®nished sequence. This leads to an estimate of the euchromatic genome size of 2.9 Gb. This agrees reasonably with the prediction above based on the length of the draft genome sequence (Table 8). Update. The results above re¯ect the data on 7 October 2000. New data are continually being added, with improvements being made to the physical map, new clones being sequenced to close gaps and draft clones progressing to full shotgun coverage and ®nishing. The draft genome sequence will be regularly reassembled and publicly released. Currently, the physical map has been re®ned such that the number of ®ngerprint clone contigs has fallen from 1,246 to 965; this re¯ects the elimination of some artefactual contigs and the closure of some gaps. The sequence coverage has risen such that 90% of the human genome is now represented in the sequenced clones and more than 94% is represented in the combined publicly available sequence databases. The total amount of ®nished sequence is now around 1 Gb. Broad genomic landscape What biological insights can be gleaned from the draft sequence? In this section, we consider very large-scale features of the draft genome sequence: the distribution of GC content, CpG islands and recombination rates, and the repeat content and gene content of the human genome. The draft genome sequence makes it possible to integrate these features and others at scales ranging from individual articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 875 Figure 10 Screen shot from UCSC Draft Human Genome Browser. See http://genome.ucsc.edu/. Figure 11 Screen shot from the Genome Browser of Project Ensembl. See http://www.ensembl.org. © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有