正在加载图片...
ARTICLES NATURE Vol 447 14 June 2007 Regulat surround transcription start sites and what we believe the pi are for a broader with no bias towards upstream investigation of the functional elements in the human id the reader, Box I provides a glossary for many of the e Chromatin accessibility and histone modification patterns are ns used throughout this paper highly predictive of both the presence and activity of transcription start sites Experimental techniques Distal DNasel hypersensitive sites have characteristic histone Table 1(expanded in Supplementary Information section 1.1)lists modification patterns that reliably distinguish them from promo- the major experimental techniques used for the studies reported here, ters; some of these distal sites show marks consistent with insulator relevant acronyms, and references reporting the generated data sets. function These data sets reflect over 400 million experimental data points e DNA replication timing is correlated with chromatin structure. (603 million data points if one includes comparative sequencing e A total of 5% of the bases in the genome can be confidently bases). In describing the major results and initial conclusions, we identified as being under evolutionary constraint in mammals; for seek to distinguish biochemical function'from biological role approximately 60% of these constrained bases, there is evidence of Biochemical function reflects the direct behaviour of a molecule(s) unction on the basis of the results of the experimental assays per- whereas biological role is used to describe the consequence(s)of this formed to date function for the organism. Genome-analysis techniques nearly e Although there is general overlap between genomic regions iden- always focus on biochemical function but not necessarily on bio tified as functional by experimental assays and those under evolu- logical role. This is because the former is more amenable to large tionary constraint, not all bases within these experimentally defined scale data-generation methods, whereas the latter is more difficult to regions show evidence of constraint. assay on a large scale Different functional elements vary greatly in their sequence vari- The ENCODe pilot project aimed to establish redundancy with ability across the human population and in their likelihood of res- respect to the findings represented by different data sets. In some iding within a structurally variable region of the genome instances, this involved the intentional use of different assays that were e Surprisingly, many functional elements are seemingly uncon- based on a similar technique, whereas in other situations, different strained across mammalian evolution. This suggests the possibility techniques assayed the same biochemical function. Such redundancy of a large pool of neutral elements that are biochemically active but has allowed methods to be compared and consensus data sets to be provide no specific benefit to the organism. This pool may serve as a generated, much of which is discussed in warehouse for natural selection, potentially acting as the source as the ChIP-chip platform comparison. L. All ENCODE data have of lineage-specific elements and functionally conserved but non- been released after verification but before this publication, as befits orthologous elements between species. acommunityresource'project(seehttp://www.wellcome.ac.uk/ Below, we first provide an overview of the experimental techniques doc_wtdo03208. html) Verification is defined as when the experiment used for our studies, after which we describe the insights gained from is reproducibly confirmed (see Supplementary Information section halysing and integrating the generated data sets. We conclude with a 1.2). The main portal for ENCoDE data is provided by the UCSC perspectiveofwhatwehavelearnedtodateaboutthis1%oftheGenomebrOwser(http://genome.ucsc.edu/encode/);thisis Box 1 Frequently used abbreviations in this paper at that was inserted into the early ndel An insertion or deletion; two sequences often show a length mammalian lineage and has since become dormant; the majority of difference within alignments, but it is not always clear whether this ancient repeats are thought to be neutrally evolving reflects a previous insertion or a deletion CAGE tag A short sequence from the 5' end of a transcript PET A short sequence that contains both the 5 and 3' ends of CDS Coding sequence: a region of a cDNA or genome that encodes transcri roteins RACE Rapid amplification of cDNA ends: a technique for amplifying ChIP-chip Chromatin immunoprecipitation followed by detection of cDNa sequences between a known internal position in a transcript and the products using a genomic tiling array CNV Copy number variants: regions of the genome that have large factor binding region: a genomic region found by a duplications in some individuals in the human population ChIP-chip assay to be bound by a protein fac CS Constrained sequence: a genomic region associated with evidence RFBR-Seqsp Regulatory factor binding regions that are from of negative selection(that is, rejection of mutations relative to neutral sequence-specific binding factors RT-PCR Reverse transcriptase polymera n reaction: a Nasel hypersensitive site: a region of the genome showing a echnique for ga spe different sensitivity to DNasel compared with its RxFrag Fragment of a race reaction: a egion found to be ocale present in a RACE product by an unbiased tiling-array assay EST Expressed sequence tag: a short sequence of acDNA indicative of SNP Single nucleotide polymorphism: a single base pair change expression at this point between two individuals in uman population FAIRE Formaldehyde -assisted isolation of regulatory elements: a TAGE Sequence tag analysis of genomic enrichment: a method similar method to open chromatin using formaldehyde crosslinking to ChIP-chip for detecting protein factor binding regions but using ollowed by detection of the products using a genomic tiling array extensive short sequence determination rather than genomic tiling arrays FDR False discovery rate: a statistical method for setting thresholds on SVM Support vector machine: a machine-learning technique that ca statistical tests to correct for multiple testing establish an optimal classifier on the basis of labelled training data GENCODE Integrated annotation of existing cDNA and protein TR50 A measure of replication timing corresponding to the time in the resources to define transcripts with both manual review and GSC Genome structure correction: a method to adapt statistical tests Tss Transcription start site to make fewer assumptions about the distribution of features on the Tx Frag Fragment of a transcript: a genomic region found to be present genome sequence. This provides a conservative correction to standard in a transcript by an unbiased tiling-array assay ests Un. TxFrag A Tx Frag that is not associated with any other functional HMM Hidden Markov model: a machine-learning technique that can establish optimal parameters for a given model to explain the observed ITR Untranslated region: part of a cDNA either at the 5 or 3 end that does not encode a protein sequence E2007 Nature Publishing Group$ Regulatory sequences that surround transcription start sites are symmetrically distributed, with no bias towards upstream regions. $ Chromatin accessibility and histone modification patterns are highly predictive of both the presence and activity of transcription start sites. $ Distal DNaseI hypersensitive sites have characteristic histone modification patterns that reliably distinguish them from promo￾ters; some of these distal sites show marks consistent with insulator function. $ DNA replication timing is correlated with chromatin structure. $ A total of 5% of the bases in the genome can be confidently identified as being under evolutionary constraint in mammals; for approximately 60% of these constrained bases, there is evidence of function on the basis of the results of the experimental assays per￾formed to date. $ Although there is general overlap between genomic regions iden￾tified as functional by experimental assays and those under evolu￾tionary constraint, not all bases within these experimentally defined regions show evidence of constraint. $ Different functional elements vary greatly in their sequence vari￾ability across the human population and in their likelihood of res￾iding within a structurally variable region of the genome. $ Surprisingly, many functional elements are seemingly uncon￾strained across mammalian evolution. This suggests the possibility of a large pool of neutral elements that are biochemically active but provide no specific benefit to the organism. This pool may serve as a ‘warehouse’ for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but non￾orthologous elements between species. Below, we first provide an overview of the experimental techniques used for our studies, after which we describe the insights gained from analysing and integrating the generated data sets. We conclude with a perspective of what we have learned to date about this 1% of the human genome and what we believe the prospects are for a broader and deeper investigation of the functional elements in the human genome. To aid the reader, Box 1 provides a glossary for many of the abbreviations used throughout this paper. Experimental techniques Table 1 (expanded in Supplementary Information section 1.1) lists the major experimental techniques used for the studies reported here, relevant acronyms, and references reporting the generated data sets. These data sets reflect over 400 million experimental data points (603 million data points if one includes comparative sequencing bases). In describing the major results and initial conclusions, we seek to distinguish ‘biochemical function’ from ‘biological role’. Biochemical function reflects the direct behaviour of a molecule(s), whereas biological role is used to describe the consequence(s) of this function for the organism. Genome-analysis techniques nearly always focus on biochemical function but not necessarily on bio￾logical role. This is because the former is more amenable to large￾scale data-generation methods, whereas the latter is more difficult to assay on a large scale. The ENCODE pilot project aimed to establish redundancy with respect to the findings represented by different data sets. In some instances, this involved the intentional use of different assays that were based on a similar technique, whereas in other situations, different techniques assayed the same biochemical function. Such redundancy has allowed methods to be compared and consensus data sets to be generated, much of which is discussed in companion papers, such as the ChIP-chip platform comparison10,11. All ENCODE data have been released after verification but before this publication, as befits a ‘community resource’ project (see http://www.wellcome.ac.uk/ doc_wtd003208.html). Verification is defined as when the experiment is reproducibly confirmed (see Supplementary Information section 1.2). The main portal for ENCODE data is provided by the UCSC Genome Browser (http://genome.ucsc.edu/ENCODE/); this is Box 1 | Frequently used abbreviations in this paper AR Ancient repeat: a repeat that was inserted into the early mammalian lineage and has since become dormant; the majority of ancient repeats are thought to be neutrally evolving. CAGE tag A short sequence from the 59 end of a transcript CDS Coding sequence: a region of a cDNA or genome that encodes proteins ChIP-chip Chromatin immunoprecipitation followed by detection of the products using a genomic tiling array CNV Copy number variants: regions of the genome that have large duplications in some individuals in the human population CS Constrained sequence: a genomic region associated with evidence of negative selection (that is, rejection of mutations relative to neutral regions) DHS DNaseI hypersensitive site: a region of the genome showing a sharply different sensitivity to DNaseI compared with its immediate locale EST Expressed sequence tag: a short sequence of a cDNA indicative of expression at this point FAIRE Formaldehyde-assisted isolation of regulatory elements: a method to assay open chromatin using formaldehyde crosslinking followed by detection of the products using a genomic tiling array FDR False discovery rate: a statistical method for setting thresholds on statistical tests to correct for multiple testing GENCODE Integrated annotation of existing cDNA and protein resources to define transcripts with both manual review and experimental testing procedures GSC Genome structure correction: a method to adapt statistical tests to make fewer assumptions about the distribution of features on the genome sequence. This provides a conservative correction to standard tests HMM Hidden Markov model: a machine-learning technique that can establish optimal parameters for a given model to explain the observed data Indel An insertion or deletion; two sequences often show a length difference within alignments, but it is not always clear whether this reflects a previous insertion or a deletion PET A short sequence that contains both the 59 and 39 ends of a transcript RACE Rapid amplification of cDNA ends: a technique for amplifying cDNA sequences between a known internal position in a transcript and its 59 end RFBR Regulatory factor binding region: a genomic region found by a ChIP-chip assay to be bound by a protein factor RFBR-Seqsp Regulatory factor binding regions that are from sequence-specific binding factors RT–PCR Reverse transcriptase polymerase chain reaction: a technique for amplifying a specific region of a transcript RxFrag Fragment of a RACE reaction: a genomic region found to be present in a RACE product by an unbiased tiling-array assay SNP Single nucleotide polymorphism: a single base pair change between two individuals in the human population STAGE Sequence tag analysis of genomic enrichment: a method similar to ChIP-chip for detecting protein factor binding regions but using extensive short sequence determination rather than genomic tiling arrays SVM Support vector machine: a machine-learning technique that can establish an optimal classifier on the basis of labelled training data TR50 A measure of replication timing corresponding to the time in the cell cycle when 50% of the cells have replicated their DNA at a specific genomic position TSS Transcription start site TxFrag Fragment of a transcript: a genomic region found to be present in a transcript by an unbiased tiling-array assay Un.TxFrag A TxFrag that is not associated with any other functional annotation UTR Untranslated region: part of a cDNA either at the 59 or 39 end that does not encode a protein sequence ARTICLES NATURE|Vol 447| 14 June 2007 800 ©2007 NaturePublishingGroup
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有