正在加载图片...
articles ultimate goal of a completely finished sequence. The results below partial digestion of genomic DNA with restriction enzymes. are based on the map and sequence data available on 7 October Together, they represent around 65-fold coverage(redundant sam- 2000, except as otherwise noted. At the end of this section, we pling) of the genome. Libraries based on other vectors, such as provide a brief update of key data cosmids, were also used in early stages of the project. Clone selection The libraries(Table 1)were prepared from DNA obtained from e hierarchical shotgun method involves the sequencing of over- anonymous human donors in accordance with US Federal R lapping large-insert clones spanning the genome. For the Human lations for the Protection of Human Subjects in Research Genome Project, clones were largely chosen from eight large-insert (45CFR46)and following full review by an Institutional Review libraries containing BAC or Pl-derived artificial chromosome Board. Briefly, the opportunity to donate DNA for this purpose was (PAC)clones(Table 1; refs 82-88). The libraries were made by broadly advertised near the two laboratories engaged in library BoX Sequence Sequenced-clone contigs Contigs produced by merging over Raw sequence Individual unassembled sequence reads, produced lapping sequenced clones by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a ing sequenced-clone contigs on the basis of linking information. cloned insert in any vector, such as a plasmid or bacterial artificial Draft genome sequence The sequence produced by combining mosor the information from the individual sequenced clones (by creating Finished sequence Complete sequence of a clone or genome, with merged sequence contigs and then employing linking information to an accuracy of at least 99.99% and no gaps create scaffolds)and positioning the sequence along the physical map ot Coverage (or depth) The average number of times a nucleotide is the chromosomes. represented by a high-quality base in a collection of random raw N50 length A measure of the contig length (or scaffold length) equence. Operationally, a high-quality base is defined as one with an containing a 'typical nucleotide. Specifically, it is the maximum length L accuracy of at least 99%(corresponding to a PHRED score of at least 20). such that 50%of all nucleotides lie in contigs (or scaffolds)of size at least L Full shotgun coverage The coverage in random raw sequence Computer programs and databases centres but is typically 8-10-fold. Clones with full shotgun to produce a 'base call with an associated quality score'for eachCs needed from a large-insert clone to ensure that it is ready for finishing; this PHRED Awidely used computer program that analyses raw sequence coverage can usually be assembled with only a handful of gaps per position in the sequence. A PHRED quality score of X corresponds to an 00kb. error probability of approximately 10. Thus, a PHRED quality score of Half shotgun coverage Half the amount of full shotgun coverage 30 corresponds to 99.9% accuracy for the base call in the raw read (typically, 4-5-fold random coverage PHRAP A widely used computer program that assembles raw ce contigs and assigns to each position in the BAC clone Bacterial artificial chromosome vector carying a genomic sequence an associated 'quality score, on the basis of the PHRED DNA insert, typically 100-200 kb. Most of the large-insert clones scores of the raw sequence reads A PHRAP quality score of X sequenced in the project were BAC clones. orresponds to an error probability of approximately 10.Thus, a Finished clone A large-insert clone that is entirely represented by PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in finished sequence. the assembled sequence Full shotgun clone A large-insert clone for which full shotgun GigAssembler A computer program developed during this project equence has been produced. for merging the information from individual sequenced clones into a draft Draft clone A large-insert clone for which roughly half-shotgun genome sequence. sequence has been produced. Operationally, the collection of draft Public sequence databases The three coordinated international clones produced by each centre was required to have an average sequence databases: GenBank, the EMBL data library and DDBJ coverage of fourfold for the entire set and a minimum coverage of Map features threefold for each clone STS Sequence tagged site, corresponding to a short (typically less Predraft clone A large-insert clone for which some shotgun than 500 bp) unique genomic locus for which a polymerase chain sequence is available, but which does not meet the standards for reaction assay has been developed inclusion in the collection of draft clones EST Expressed sequence tag, obtained by performing a single raw Contigs and scaffolds uence read from a random complementary DNA clone. ontig The result of joining an overlapping collection of sequences or SsR Simple sequence repeat, a sequence consisting largely of a ones tandem repeat of a specific k-mer(such as(CA)15). Many SSRs are caffold The result of connecting contigs by linking infomation from polymorphic and have been widely used in genetic mapping and oriented with respect to one another. present at appreciable frequency(traditionally, at least 1%)in the human Fingerprint clone contigs Contigs produced by joining clones population ferred to overlap on the basis of their restriction digest fingerprints Genetic map A genome map in which polymorphic loci are Sequenced-clone layout Assignment of sequenced clones to the positioned relative to one another on the basis of the frequency with nap of fingerprint clone which they recombine during meiosis. The unit of distance is Initial sequence contigs Contigs produced by merging over centimorgans (cM), denoting a 1% chance of recombination ping sequence reads obtained from a single clone, in a process called Radiation hybrid ( RH)map A genome map in which STSs are positioned relative to one another on the basis of the frequency with erged sequence contigs Contigs produced by taking the initial which they are separated by radiation-induced breaks. The frequency is sequence contigs contained in overlapping clones and merging those assayed by analysing a panel of human-hamster hybrid cell lines, each found to overlap. These are also referred to simply as sequence contigs oduced by lethally irradiating human cells and fusing them with where no confusion will result pient hamster cells such that each cames a collection of human Sequence-contig scaffolds Scaffolds pre onnect ing hromosomal fragments. The unit of distance is centirays (cR), denoting sequence contigs on the basis of linking inform a 1% chance of a break occuring between two loci NatuRevOl409115FeBruAry2001www.nature.com A@2001 Macmillan Magazines Ltdultimate goal of a completely ®nished sequence. The results below are based on the map and sequence data available on 7 October 2000, except as otherwise noted. At the end of this section, we provide a brief update of key data. Clone selection The hierarchical shotgun method involves the sequencing of over￾lapping large-insert clones spanning the genome. For the Human Genome Project, clones were largely chosen from eight large-insert libraries containing BAC or P1-derived arti®cial chromosome (PAC) clones (Table 1; refs 82±88). The libraries were made by partial digestion of genomic DNA with restriction enzymes. Together, they represent around 65-fold coverage (redundant sam￾pling) of the genome. Libraries based on other vectors, such as cosmids, were also used in early stages of the project. The libraries (Table 1) were prepared from DNA obtained from anonymous human donors in accordance with US Federal Regu￾lations for the Protection of Human Subjects in Research (45CFR46) and following full review by an Institutional Review Board. Brie¯y, the opportunity to donate DNA for this purpose was broadly advertised near the two laboratories engaged in library articles NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 865 Box 1 Genome glossary Sequence Raw sequence Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts. Paired-end sequence Raw sequence obtained from both ends of a cloned insert in any vector, such as a plasmid or bacterial arti®cial chromosome. Finished sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps. Coverage (or depth) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a `high-quality base' is de®ned as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20). Full shotgun coverage The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for ®nishing; this varies among centres but is typically 8±10-fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb. Half shotgun coverage Half the amount of full shotgun coverage (typically, 4±5-fold random coverage). Clones BAC clone Bacterial arti®cial chromosome vector carrying a genomic DNA insert, typically 100±200 kb. Most of the large-insert clones sequenced in the project were BAC clones. Finished clone A large-insert clone that is entirely represented by ®nished sequence. Full shotgun clone A large-insert clone for which full shotgun sequence has been produced. Draft clone A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each centre was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone. Predraft clone A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones. Contigs and scaffolds Contig The result of joining an overlapping collection of sequences or clones. Scaffold The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. Fingerprint clone contigs Contigs produced by joining clones inferred to overlap on the basis of their restriction digest ®ngerprints. Sequenced-clone layout Assignment of sequenced clones to the physical map of ®ngerprint clone contigs. Initial sequence contigs Contigs produced by merging over￾lapping sequence reads obtained from a single clone, in a process called sequence assembly. Merged sequence contigs Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simply as `sequence contigs' where no confusion will result. Sequence-contig scaffolds Scaffolds produced by connecting sequence contigs on the basis of linking information. Sequenced-clone contigs Contigs produced by merging over￾lapping sequenced clones. Sequenced-clone-contig scaffolds Scaffolds produced by join￾ing sequenced-clone contigs on the basis of linking information. Draft genome sequence The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes. N50 length A measure of the contig length (or scaffold length) containing a `typical' nucleotide. Speci®cally, it is the maximum length L suchthat 50% of all nucleotides lie in contigs (or scaffolds) of size at least L. Computer programs and databases PHRED A widely used computer program that analyses raw sequence to produce a `base call' with an associated `quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read. PHRAP A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated `quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10- X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence. GigAssembler A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence. Public sequence databases The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ. Map features STS Sequence tagged site, corresponding to a short (typically less than 500 bp) unique genomic locus for which a polymerase chain reaction assay has been developed. EST Expressed sequence tag, obtained by performing a single raw sequence read from a random complementary DNA clone. SSR Simple sequence repeat, a sequence consisting largely of a tandem repeat of a speci®c k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping. SNP Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population. Genetic map A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination. Radiation hybrid (RH) map A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human±hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci. © 2001 Macmillan Magazines Ltd
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有