《遗传学》课程教学资源（学科前沿）蕴藏在基因组中的生命密码 Nature ENCODE pilot project

团购合买资源类别：文库，文档格式：PDF，文档页数：18，文件大小：4.47MB

Vol 447 14 June 2007 doi: 10. 1038/nature05874 nature ARTICLES Identification and analysis of functional elements in 1% of the human genome by the Encode pilot project The ENCODE Project Consortium* We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. these data have been further integrated and augmented by a number of evolutionary and computational analyses. Together our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another Second systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. third, a more sophisticated view of chromatin structure has emerged including its inter-relationship with DNA replication and transcriptional regulation Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function The human genome is an elegant but cryptic store of information. The evolve, our present understanding about the evolution of other func roughly three billion bases encode, either directly or indirectly, the tional genomic regions is poorly developed. Experimental studies instructions for synthesizing nearly all the molecules that form each that augment what we learn from evolutionary analyses are key for human cell, tissue and organ. Sequencing the human genome- pro- solidifying our insights regarding genome function. vided highly accurate DNA sequences for each of the 24 chromosomes. The Encyclopedia of DNA Elements(ENCODE) Project aims to However, at present, we have an incomplete understanding of the provide a more biologically informative representation of the human protein-coding portions of the genome, and markedly less under- genome by using high-throughput methods to identify and catalogue standing of both non-protein-coding transcripts and genomic ele- the functional elements encoded. In its pilot phase, 35 groups pro ments that temporally and spatially regulate gene expression. To vided more than 200 experimental and computational data sets that understand the human genome, and by extension the biological pro- examined in unprecedented detail a targeted 29, 998 kilobases(kb)of cesses it orchestrates and the ways in which its defects can give rise to the human genome. These roughly 30 Mb--equivalent to -1% of disease, we need a more transparent view of the information it encodes. the human genome--are sufficiently large and diverse to allow for The molecular mechanisms by which genomic information directs rigorous pilot testing of multiple experimental and computational the synthesis of different biomolecules has been the focus of much of methods. These 30 Mb are divided among 44 genomic regions; molecular biology research over the last three decades. Previous stud- approximately 15 Mb reside in 14 regions for which there is already ies have typically concentrated on individual genes, with the resulting substantial biological knowledge, whereas the other 15 Mb reside in general principles then providing insights into transcription, chro- 30 regions chosen by a stratified random-sampling method(see matinremodellingmessengerRnasplicingDnareplicationandhttp://www.genome.gov/10506161).Thehighlightsofourfindings numerous other genomic processes. Although many such principles to date include seem valid as additional genes are investigated, they generally have The human is pervasively transcribed, such that the not provided genome-wide insights about biological function. majority of its bases are associated with at least one primary tran E The first genome-wide analyses that shed light on human genome script and many t pts link distal regions to established protei Inction made use of observing the actions of evolution. The ever- coding loci growing set of vertebrate genome sequences- is providing increas-. Many novel non-protein-coding transcripts have been identified, convincingly indicate the presence of numerous genomic regions tionally silent. under strong evolutionary constraint, they have less power in iden- Numerous previously unrecognized transcription start sites tifying the precise bases that are constrained and provide little, if any, have been identified, many of which show chromatin structure insight into why those bases are biologically important. Furthermore, and sequence-specific protein-binding properties similar to well lthough we have good models for how protein-coding regions understood promoters a list of authors and their affiliations appears at the end of the paper. E2007 Nature Publishing Group

ARTICLES Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project The ENCODE Project Consortium* We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function. The human genome is an elegant but cryptic store of information. The roughly three billion bases encode, either directly or indirectly, the instructions for synthesizing nearly all the molecules that form each human cell, tissue and organ. Sequencing the human genome1–3 provided highly accurate DNA sequences for each of the 24 chromosomes. However, at present, we have an incomplete understanding of the protein-coding portions of the genome, and markedly less understanding of both non-protein-coding transcripts and genomic elements that temporally and spatially regulate gene expression. To understand the human genome, and by extension the biological processes it orchestrates and the ways in which its defects can give rise to disease, we need a more transparent view of the information it encodes. The molecular mechanisms by which genomic information directs the synthesis of different biomolecules has been the focus of much of molecular biology research over the last three decades. Previous studies have typically concentrated on individual genes, with the resulting general principles then providing insights into transcription, chromatin remodelling, messenger RNA splicing, DNA replication and numerous other genomic processes. Although many such principles seem valid as additional genes are investigated, they generally have not provided genome-wide insights about biological function. The first genome-wide analyses that shed light on human genome function made use of observing the actions of evolution. The evergrowing set of vertebrate genome sequences4–8 is providing increasing power to reveal the genomic regions that have been most and least acted on by the forces of evolution. However, although these studies convincingly indicate the presence of numerous genomic regions under strong evolutionary constraint, they have less power in identifying the precise bases that are constrained and provide little, if any, insight into why those bases are biologically important. Furthermore, although we have good models for how protein-coding regions evolve, our present understanding about the evolution of other functional genomic regions is poorly developed. Experimental studies that augment what we learn from evolutionary analyses are key for solidifying our insights regarding genome function. The Encyclopedia of DNA Elements (ENCODE) Project9 aims to provide a more biologically informative representation of the human genome by using high-throughput methods to identify and catalogue the functional elements encoded. In its pilot phase, 35 groups provided more than 200 experimental and computational data sets that examined in unprecedented detail a targeted 29,998 kilobases (kb) of the human genome. These roughly 30 Mb—equivalent to ,1% of the human genome—are sufficiently large and diverse to allow for rigorous pilot testing of multiple experimental and computational methods. These 30 Mb are divided among 44 genomic regions; approximately 15 Mb reside in 14 regions for which there is already substantial biological knowledge, whereas the other 15 Mb reside in 30 regions chosen by a stratified random-sampling method (see http://www.genome.gov/10506161). The highlights of our findings to date include: $ The human genome is pervasively transcribed, such that the majority of its bases are associated with at least one primary transcript and many transcripts link distal regions to established proteincoding loci. $ Many novel non-protein-coding transcripts have been identified, with many of these overlapping protein-coding loci and others located in regions of the genome previously thought to be transcriptionally silent. $ Numerous previously unrecognized transcription start sites have been identified, many of which show chromatin structure and sequence-specific protein-binding properties similar to wellunderstood promoters. *A list of authors and their affiliations appears at the end of the paper. Vol 447| 14 June 2007| doi:10.1038/nature05874 799 ©2007 NaturePublishingGroup

ARTICLES NATURE Vol 447 14 June 2007 Regulat surround transcription start sites and what we believe the pi are for a broader with no bias towards upstream investigation of the functional elements in the human id the reader, Box I provides a glossary for many of the e Chromatin accessibility and histone modification patterns are ns used throughout this paper highly predictive of both the presence and activity of transcription start sites Experimental techniques Distal DNasel hypersensitive sites have characteristic histone Table 1(expanded in Supplementary Information section 1.1)lists modification patterns that reliably distinguish them from promo- the major experimental techniques used for the studies reported here, ters; some of these distal sites show marks consistent with insulator relevant acronyms, and references reporting the generated data sets. function These data sets reflect over 400 million experimental data points e DNA replication timing is correlated with chromatin structure. (603 million data points if one includes comparative sequencing e A total of 5% of the bases in the genome can be confidently bases). In describing the major results and initial conclusions, we identified as being under evolutionary constraint in mammals; for seek to distinguish biochemical function'from biological role approximately 60% of these constrained bases, there is evidence of Biochemical function reflects the direct behaviour of a molecule(s) unction on the basis of the results of the experimental assays per- whereas biological role is used to describe the consequence(s)of this formed to date function for the organism. Genome-analysis techniques nearly e Although there is general overlap between genomic regions iden- always focus on biochemical function but not necessarily on bio tified as functional by experimental assays and those under evolu- logical role. This is because the former is more amenable to large tionary constraint, not all bases within these experimentally defined scale data-generation methods, whereas the latter is more difficult to regions show evidence of constraint. assay on a large scale Different functional elements vary greatly in their sequence vari- The ENCODe pilot project aimed to establish redundancy with ability across the human population and in their likelihood of res- respect to the findings represented by different data sets. In some iding within a structurally variable region of the genome instances, this involved the intentional use of different assays that were e Surprisingly, many functional elements are seemingly uncon- based on a similar technique, whereas in other situations, different strained across mammalian evolution. This suggests the possibility techniques assayed the same biochemical function. Such redundancy of a large pool of neutral elements that are biochemically active but has allowed methods to be compared and consensus data sets to be provide no specific benefit to the organism. This pool may serve as a generated, much of which is discussed in warehouse for natural selection, potentially acting as the source as the ChIP-chip platform comparison. L. All ENCODE data have of lineage-specific elements and functionally conserved but non- been released after verification but before this publication, as befits orthologous elements between species. acommunityresource'project(seehttp://www.wellcome.ac.uk/ Below, we first provide an overview of the experimental techniques doc_wtdo03208. html) Verification is defined as when the experiment used for our studies, after which we describe the insights gained from is reproducibly confirmed (see Supplementary Information section halysing and integrating the generated data sets. We conclude with a 1.2). The main portal for ENCoDE data is provided by the UCSC perspectiveofwhatwehavelearnedtodateaboutthis1%oftheGenomebrOwser(http://genome.ucsc.edu/encode/);thisis Box 1 Frequently used abbreviations in this paper at that was inserted into the early ndel An insertion or deletion; two sequences often show a length mammalian lineage and has since become dormant; the majority of difference within alignments, but it is not always clear whether this ancient repeats are thought to be neutrally evolving reflects a previous insertion or a deletion CAGE tag A short sequence from the 5' end of a transcript PET A short sequence that contains both the 5 and 3' ends of CDS Coding sequence: a region of a cDNA or genome that encodes transcri roteins RACE Rapid amplification of cDNA ends: a technique for amplifying ChIP-chip Chromatin immunoprecipitation followed by detection of cDNa sequences between a known internal position in a transcript and the products using a genomic tiling array CNV Copy number variants: regions of the genome that have large factor binding region: a genomic region found by a duplications in some individuals in the human population ChIP-chip assay to be bound by a protein fac CS Constrained sequence: a genomic region associated with evidence RFBR-Seqsp Regulatory factor binding regions that are from of negative selection(that is, rejection of mutations relative to neutral sequence-specific binding factors RT-PCR Reverse transcriptase polymera n reaction: a Nasel hypersensitive site: a region of the genome showing a echnique for ga spe different sensitivity to DNasel compared with its RxFrag Fragment of a race reaction: a egion found to be ocale present in a RACE product by an unbiased tiling-array assay EST Expressed sequence tag: a short sequence of acDNA indicative of SNP Single nucleotide polymorphism: a single base pair change expression at this point between two individuals in uman population FAIRE Formaldehyde -assisted isolation of regulatory elements: a TAGE Sequence tag analysis of genomic enrichment: a method similar method to open chromatin using formaldehyde crosslinking to ChIP-chip for detecting protein factor binding regions but using ollowed by detection of the products using a genomic tiling array extensive short sequence determination rather than genomic tiling arrays FDR False discovery rate: a statistical method for setting thresholds on SVM Support vector machine: a machine-learning technique that ca statistical tests to correct for multiple testing establish an optimal classifier on the basis of labelled training data GENCODE Integrated annotation of existing cDNA and protein TR50 A measure of replication timing corresponding to the time in the resources to define transcripts with both manual review and GSC Genome structure correction: a method to adapt statistical tests Tss Transcription start site to make fewer assumptions about the distribution of features on the Tx Frag Fragment of a transcript: a genomic region found to be present genome sequence. This provides a conservative correction to standard in a transcript by an unbiased tiling-array assay ests Un. TxFrag A Tx Frag that is not associated with any other functional HMM Hidden Markov model: a machine-learning technique that can establish optimal parameters for a given model to explain the observed ITR Untranslated region: part of a cDNA either at the 5 or 3 end that does not encode a protein sequence E2007 Nature Publishing Group

$ Regulatory sequences that surround transcription start sites are symmetrically distributed, with no bias towards upstream regions. $ Chromatin accessibility and histone modification patterns are highly predictive of both the presence and activity of transcription start sites. $ Distal DNaseI hypersensitive sites have characteristic histone modification patterns that reliably distinguish them from promoters; some of these distal sites show marks consistent with insulator function. $ DNA replication timing is correlated with chromatin structure. $ A total of 5% of the bases in the genome can be confidently identified as being under evolutionary constraint in mammals; for approximately 60% of these constrained bases, there is evidence of function on the basis of the results of the experimental assays performed to date. $ Although there is general overlap between genomic regions identified as functional by experimental assays and those under evolutionary constraint, not all bases within these experimentally defined regions show evidence of constraint. $ Different functional elements vary greatly in their sequence variability across the human population and in their likelihood of residing within a structurally variable region of the genome. $ Surprisingly, many functional elements are seemingly unconstrained across mammalian evolution. This suggests the possibility of a large pool of neutral elements that are biochemically active but provide no specific benefit to the organism. This pool may serve as a ‘warehouse’ for natural selection, potentially acting as the source of lineage-specific elements and functionally conserved but nonorthologous elements between species. Below, we first provide an overview of the experimental techniques used for our studies, after which we describe the insights gained from analysing and integrating the generated data sets. We conclude with a perspective of what we have learned to date about this 1% of the human genome and what we believe the prospects are for a broader and deeper investigation of the functional elements in the human genome. To aid the reader, Box 1 provides a glossary for many of the abbreviations used throughout this paper. Experimental techniques Table 1 (expanded in Supplementary Information section 1.1) lists the major experimental techniques used for the studies reported here, relevant acronyms, and references reporting the generated data sets. These data sets reflect over 400 million experimental data points (603 million data points if one includes comparative sequencing bases). In describing the major results and initial conclusions, we seek to distinguish ‘biochemical function’ from ‘biological role’. Biochemical function reflects the direct behaviour of a molecule(s), whereas biological role is used to describe the consequence(s) of this function for the organism. Genome-analysis techniques nearly always focus on biochemical function but not necessarily on biological role. This is because the former is more amenable to largescale data-generation methods, whereas the latter is more difficult to assay on a large scale. The ENCODE pilot project aimed to establish redundancy with respect to the findings represented by different data sets. In some instances, this involved the intentional use of different assays that were based on a similar technique, whereas in other situations, different techniques assayed the same biochemical function. Such redundancy has allowed methods to be compared and consensus data sets to be generated, much of which is discussed in companion papers, such as the ChIP-chip platform comparison10,11. All ENCODE data have been released after verification but before this publication, as befits a ‘community resource’ project (see http://www.wellcome.ac.uk/ doc_wtd003208.html). Verification is defined as when the experiment is reproducibly confirmed (see Supplementary Information section 1.2). The main portal for ENCODE data is provided by the UCSC Genome Browser (http://genome.ucsc.edu/ENCODE/); this is Box 1 | Frequently used abbreviations in this paper AR Ancient repeat: a repeat that was inserted into the early mammalian lineage and has since become dormant; the majority of ancient repeats are thought to be neutrally evolving. CAGE tag A short sequence from the 59 end of a transcript CDS Coding sequence: a region of a cDNA or genome that encodes proteins ChIP-chip Chromatin immunoprecipitation followed by detection of the products using a genomic tiling array CNV Copy number variants: regions of the genome that have large duplications in some individuals in the human population CS Constrained sequence: a genomic region associated with evidence of negative selection (that is, rejection of mutations relative to neutral regions) DHS DNaseI hypersensitive site: a region of the genome showing a sharply different sensitivity to DNaseI compared with its immediate locale EST Expressed sequence tag: a short sequence of a cDNA indicative of expression at this point FAIRE Formaldehyde-assisted isolation of regulatory elements: a method to assay open chromatin using formaldehyde crosslinking followed by detection of the products using a genomic tiling array FDR False discovery rate: a statistical method for setting thresholds on statistical tests to correct for multiple testing GENCODE Integrated annotation of existing cDNA and protein resources to define transcripts with both manual review and experimental testing procedures GSC Genome structure correction: a method to adapt statistical tests to make fewer assumptions about the distribution of features on the genome sequence. This provides a conservative correction to standard tests HMM Hidden Markov model: a machine-learning technique that can establish optimal parameters for a given model to explain the observed data Indel An insertion or deletion; two sequences often show a length difference within alignments, but it is not always clear whether this reflects a previous insertion or a deletion PET A short sequence that contains both the 59 and 39 ends of a transcript RACE Rapid amplification of cDNA ends: a technique for amplifying cDNA sequences between a known internal position in a transcript and its 59 end RFBR Regulatory factor binding region: a genomic region found by a ChIP-chip assay to be bound by a protein factor RFBR-Seqsp Regulatory factor binding regions that are from sequence-specific binding factors RT–PCR Reverse transcriptase polymerase chain reaction: a technique for amplifying a specific region of a transcript RxFrag Fragment of a RACE reaction: a genomic region found to be present in a RACE product by an unbiased tiling-array assay SNP Single nucleotide polymorphism: a single base pair change between two individuals in the human population STAGE Sequence tag analysis of genomic enrichment: a method similar to ChIP-chip for detecting protein factor binding regions but using extensive short sequence determination rather than genomic tiling arrays SVM Support vector machine: a machine-learning technique that can establish an optimal classifier on the basis of labelled training data TR50 A measure of replication timing corresponding to the time in the cell cycle when 50% of the cells have replicated their DNA at a specific genomic position TSS Transcription start site TxFrag Fragment of a transcript: a genomic region found to be present in a transcript by an unbiased tiling-array assay Un.TxFrag A TxFrag that is not associated with any other functional annotation UTR Untranslated region: part of a cDNA either at the 59 or 39 end that does not encode a protein sequence ARTICLES NATURE|Vol 447| 14 June 2007 800 ©2007 NaturePublishingGroup

NATURE Vol 447 14 June 2007 ARTICLES augmented by multiple other websites(see Supplementary Informa- compared with the total RNA in a cell, suggesting that there are tion section 1.1) numerous RNA species yet to be classified-. In addition, studies A common feature of genomic analyses is the need to assess the of specific loci have indicated the presence of RNA transcripts that ignificance of the co-occurrence of features or of other statistical have a role in chromatin maintenance and other regulatory control. e44 across the genome. We have developed and used a statistical frame- encoded RNA molecule work that mitigates many of these hidden correlations by adjusting Transcript maps. We used three methods to identify transcripts he appropriate null distribution of the test statistics. We term this emanating from the ENCODE regions: hybridization of rNa(either correction procedure genome structure correction(GSC)(see Sup- total or polyA-selected)to unbiased tiling arrays(see Supplementary plementary Information section 1.3) Information section 2.1), tag sequencing of cap-selected RNA at the In the next five sections, we detail the various biological insights of 5 or joint 5 /3 ends(see Supplementary Information sections 2.2 the pilot phase of the ENCODE Project. and S2.3), and integrated annotation of available complementary DNA and EST sequences involving computational, manual, and Transcript experimental approaches(see Supplementary Information section Overview. RNA transcripts are involved in many cellular functions, 2.4). We abbreviate the regions identified by unbiased tiling arrays as either directly as biologically active molecules or indirectly by encod- Tx Frags, the cap-selected RNAs as CAGE or PET tags(see Box 1),and ons other active molecules. In the conventional view of genome the integrated annotation as GENCODE transcripts. When a TxFrag ganization, sets of RNA transcripts(for example, messenger does no lap a GENCODE annotation, we call it an Un. TxFrag RNAs)are encoded by distinct loci, with each usually dedicated to Validation of these various studies is described in papers reporting a single biological role( for example, encoding a specific protein). these data sets(see Supplementary Information sections 2.1.4 and However, this picture has substantially grown in complexity in recent 2.1.5) years 2. Other forms of RNA molecules(such as small nucleolar These methods recapitulate previous findings, but provide RNAs and micro(mi)RNAs)are known to exist, and often these enhanced resolution owing to the larger number of tissues sampled are encoded by regions that intercalate with protein-coding genes. and the integration of results across the three approaches(see Table 2) These observations are consistent with the well-known discrepancy To begin with, our studies show that 14.7% of the bases represented in between the levels of observable mRNAs and large structural RNAs the unbiased tiling arrays are transcribed in at least one tissue sample Consistent with previous work. s, many (63%)Tx Frags reside out- side of GENCODE annotations, both in intronic(40.9%)and inter Table 1 Summary of types of experimental techniques used in ENCODE genic(22.6%)regions. GENCODE annotations are richer than the more-conservative RefSeq or Ensembl annotations, with 2, 608 tran- data points scripts clustered into 487 loci, leading to an average of 5. 4 transcripts 63348656 per locus. Finally, extensive testing of predicted protein-coding sequences outside of GENCODE annotations was positive in only annotation 2% of cases 6, suggesting that GENCODE annotations cover nearly Tag sequencing PET, CAGE 121 864,964 all protein-coding sequences. The GENCODE annotations are cate transcripts gorized both by likely function (mainly, the presence of an open Tiling array Histone 4,401,291 reading frame)and by classification evidence(for example, transcripts based solely on ESTs are distinguished from other scenarios ); this Chromatin QT-PCR, tiling DHS, FAIRE 42 15.318.324 classification is not strongly correlated with expression levels(see upplementary Information sections 2.4.2 and 2.4.3 Analyses of more biological samples have allowed a richer descrip tion of the transcription specificity(see Fig. I and Supplementary Tiling array, tag STAGE, ChIP- 41, 52 324, 846,018 Information section 2.5). We found that 40%of Tx Frags are preser promoter assays Chip, chIP-PET, 11, 1. in only one sample, whereas only 2% are present in all sampl Although exon-containing Tx Frags are more likely(74%)to be expressed in more than one sample, 45% of unannotated TxFrags are also expressed in multiple samples. GENCODE annotations of separate loci often(42%)overlap with respect to their genomic ates, in p plication Tiling array TR50 analysis of GENCODE-annotated sequences with respect to the posi- Computational Computational CC, RFBR cluster tions of open reading frames revealed that some component exons do not have the expected synonymous versus non-synonymous substi- tution patterns of protein-coding sequence(see Supplement Infor mation section 2.6)and some have deletions incompatible with Table 2 Bases detected in processed transcripts either as a GENCODE exon, a TxFrag, or as either a gENCODE exon or a Tx Frag GENCODE exon Either GENCODE exon TxFrag T e1,776,157(59%)1,369611(46%)2519,280(84%) transcripts(bases) copy number Transcripts detected1,447,192(98%)1,369611(93%)2163303(14.7%) ariation Not all da ENCODE Project. t Histone code nomenclature follows the Brno nomenclature as described in ref. 129 Percentages are of total bases in ENCODE in the first row and bases tiled in arrays in the second tAlso contains histone modification. E2007 Nature Publishing Group

augmented by multiple other websites (see Supplementary Information section 1.1). A common feature of genomic analyses is the need to assess the significance of the co-occurrence of features or of other statistical tests. One confounding factor is the heterogeneity of the genome, which can produce uninteresting correlations of variables distributed across the genome. We have developed and used a statistical framework that mitigates many of these hidden correlations by adjusting the appropriate null distribution of the test statistics. We term this correction procedure genome structure correction (GSC) (see Supplementary Information section 1.3). In the next five sections, we detail the various biological insights of the pilot phase of the ENCODE Project. Transcription Overview. RNA transcripts are involved in many cellular functions, either directly as biologically active molecules or indirectly by encoding other active molecules. In the conventional view of genome organization, sets of RNA transcripts (for example, messenger RNAs) are encoded by distinct loci, with each usually dedicated to a single biological role (for example, encoding a specific protein). However, this picture has substantially grown in complexity in recent years12. Other forms of RNA molecules (such as small nucleolar RNAs and micro (mi)RNAs) are known to exist, and often these are encoded by regions that intercalate with protein-coding genes. These observations are consistent with the well-known discrepancy between the levels of observable mRNAs and large structural RNAs compared with the total RNA in a cell, suggesting that there are numerous RNA species yet to be classified13–15. In addition, studies of specific loci have indicated the presence of RNA transcripts that have a role in chromatin maintenance and other regulatory control. We sought to assay and analyse transcription comprehensively across the 44 ENCODE regions in an effort to understand the repertoire of encoded RNA molecules. Transcript maps. We used three methods to identify transcripts emanating from the ENCODE regions: hybridization of RNA (either total or polyA-selected) to unbiased tiling arrays (see Supplementary Information section 2.1), tag sequencing of cap-selected RNA at the 59 or joint 59/39 ends (see Supplementary Information sections 2.2 and S2.3), and integrated annotation of available complementary DNA and EST sequences involving computational, manual, and experimental approaches16 (see Supplementary Information section 2.4). We abbreviate the regions identified by unbiased tiling arrays as TxFrags, the cap-selected RNAs as CAGE or PET tags (see Box 1), and the integrated annotation as GENCODE transcripts. When a TxFrag does not overlap a GENCODE annotation, we call it an Un.TxFrag. Validation of these various studies is described in papers reporting these data sets17 (see Supplementary Information sections 2.1.4 and 2.1.5). These methods recapitulate previous findings, but provide enhanced resolution owing to the larger number of tissues sampled and the integration of results across the three approaches (see Table 2). To begin with, our studies show that 14.7% of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside outside of GENCODE annotations, both in intronic (40.9%) and intergenic (22.6%) regions. GENCODE annotations are richer than the more-conservative RefSeq or Ensembl annotations, with 2,608 transcripts clustered into 487 loci, leading to an average of 5.4 transcripts per locus. Finally, extensive testing of predicted protein-coding sequences outside of GENCODE annotations was positive in only 2% of cases16, suggesting that GENCODE annotations cover nearly all protein-coding sequences. The GENCODE annotations are categorized both by likely function (mainly, the presence of an open reading frame) and by classification evidence (for example, transcripts based solely on ESTs are distinguished from other scenarios); this classification is not strongly correlated with expression levels (see Supplementary Information sections 2.4.2 and 2.4.3). Analyses of more biological samples have allowed a richer description of the transcription specificity (see Fig. 1 and Supplementary Information section 2.5). We found that 40% of TxFrags are present in only one sample, whereas only 2% are present in all samples. Although exon-containing TxFrags are more likely (74%) to be expressed in more than one sample, 45% of unannotated TxFrags are also expressed in multiple samples. GENCODE annotations of separate loci often (42%) overlap with respect to their genomic coordinates, in particular on opposite strands (33% of loci). Further analysis of GENCODE-annotated sequences with respect to the positions of open reading frames revealed that some component exons do not have the expected synonymous versus non-synonymous substitution patterns of protein-coding sequence (see Supplement Information section 2.6) and some have deletions incompatible with Table 1 | Summary of types of experimental techniques used in ENCODE Feature class Experimental technique(s) Abbreviations References Number of experimental data points Transcription Tiling array, integrated annotation TxFrag, RxFrag, GENCODE 117 118 19 119 63,348,656 59 ends of transcripts* Tag sequencing PET, CAGE 121 13 864,964 Histone modifications Tiling array Histone nomenclature{, RFBR 46 4,401,291 Chromatin{ structure QT-PCR, tiling array DHS, FAIRE 42 43 44 122 15,318,324 Sequencespecific factors Tiling array, tag sequencing, promoter assays STAGE, ChIPChip, ChIP-PET, RFBR 41,52 11,120 123 81 34,51 124 49 33 40 324,846,018 Replication Tiling array TR50 59 75 14,735,740 Computational analysis Computational methods CCI, RFBR cluster 80 125 10 16 126 127 NA Comparative sequence analysis* Genomic sequencing, multisequence alignments, computational analyses CS 87 86 26 NA Polymorphisms* Resequencing, copy number variation CNV 103 128 NA * Not all data generated by the ENCODE Project. { Histone code nomenclature follows the Brno nomenclature as described in ref. 129. {Also contains histone modification. Table 2 | Bases detected in processed transcripts either as a GENCODE exon, a TxFrag, or as either a GENCODE exon or a TxFrag GENCODE exon TxFrag Either GENCODE exon or TxFrag Total detectable transcripts (bases) 1,776,157 (5.9%) 1,369,611 (4.6%) 2,519,280 (8.4%) Transcripts detected in tiled regions of arrays (bases) 1,447,192 (9.8%) 1,369,611 (9.3%) 2,163,303 (14.7%) Percentages are of total bases in ENCODE in the first row and bases tiled in arrays in the second row. NATURE| Vol 447|14 June 2007 ARTICLES 801 ©2007 NaturePublishingGroup

ARTICLES NATURE Vol 447 14 June 2007 protein structure. Such exons are on average less expressed (25% detected using RACE followed by hybridization to tiling arrays as versus 87% by RT-PCR; see Supplementary Information section 2.7) Rx Frags. We performed RACE to examine 399 protein-coding loci than exons involved in more than one transcript(see Supple- (those loci found entirely in ENCODE regions)using RNA derived mentary Information section 2.4.3), but when expressed have a tissue from 12 tissues, and were able to unambiguously detect 4,573 distribution comparable to well-established genes. RxFrags for 359 loci(see Supplementary Information section 2.9) Critical questions are raised by the presence of a large amount of Almost half of these RxFrags (2, 324)do not overlap a GENCODE unannotated transcription with respect to how the corresponding exon, and most(90%)loci have at least one novel RxFrag, which sequences are organized in the genome--do these reflect longer tran- often extends a considerable distance beyond the 5 end of the locus. ripts that include known loci, do they link known loci, or are they Figure 2 shows the distribution of distances between these new mpletely separate from known loci? We further investigated these RACE-detected ends and the previously annotated TSS of each locus. issues using both computational and new experimental techniques. The average distance of the extensions is between 50 kb and 100 kb, Unannotated transcription. Consistent with previous findings, the with many extensions(20%)being more than 200 kb. Consistent UnT Exsa information section 2.8). One might expect Un Tx Frags our findings reveal evidence for an overlapping gene at 224 loci, with did not show evidence of encoding proteins(see Sup- with the known presence of overlapping genes in the human genome, ent to be linked within transcripts that exhibit coordinated expression transcripts from 180 of these loci (-50% of the RACE-positive loci) and have similar conservation profiles across species. To test this, we appearing to have incorporated at least one exon from an upstream clustered Un Tx Frags using two methods. The first methodused gene expression levels in 11 cell lines or conditions, dinucleotide composi- To characterize further the 5 Rx Frag extensions, we performed tion, location relative to annotated genes, and evolutionary conser- RT-PCR followed by cloning and sequencing for 550 of the 5 vation profiles to cluster Tx Frags(both unannotated and annotated ) RxFrags(including the 261 longest extensions identified for each loci,and 21% could be clustered into 200 novel loci (with an average is a combination method previously described and validated in sev- of -7TxFrags per locus). We experimentally examined these novel eral studies 4.170 Hybridization of the RT-PCR products to tiling loci to study the connectivity of transcripts amongst Un Tx Frags and arrays confirmed connectivity in almost 60%of the cases. Sequenced between UnTx Frags and known exons. Overall, about 40% of the clones confirmed transcript extensions. Longer extensions were connections(18 out of 46)were validated by RT-PCR. The second harder to clone and sequence, but 5 out of 18 RT-PCR-positive clustering method involved analysing a time course(0, 2, 8 and 32 h) extensions over 100 kb were verified by sequencing(see Supple- of expression changes in human HL60 cells following retinoic-acid mentary Information section 2.9.7 and ref. 17). The detection of stimulation. There is a coordinated program of expression changes numerous RxFrag extensions coupled with evidence of considerable from annotated loci, which can be shown by plotting Pearson intronic transcription indicates that protein-coding loci are more correlation values of the expression levels of exons inside annotated transcriptionally complex than previously thought. Instead of the loci versus unrelated exons(see Supplementary Information sec- traditional view that many genes have one or more alternative tran Un TxFrags, albeit lower, though still significantly different from gene may both encode multiple protein products and produce other randomized sets. Both clustering methods indicate that there is coor- transcripts that include sequences from both strands and from neigh dinated behaviour of many Un. Tx Frags, consistent with them res- bouring loci(often without encoding a different protein).Figure 3 ding in connected transcripts illustrates such a case, in which a new fusion transcript is expressed in Transcript connectivity. We used a combination of RACe and tiling he small intestine, and consists of at least three coding exons from rrayszo to investigate the diversity of transcripts emanating from the ATP50 gene and at least two coding exons from the DONSON protein-coding loci. Analogous to TxFrags, we refer to transcript 1/112113114115/1 a Intronic proximal hill 宽×889x6 Figure 1 Annotated and unannotated TxFrags detected in different cell lines. The proportion of different types of transcripts detected in the indicated number of cell lines(from 1/ll at the far left to 11/11 at the far t)is shown. The data for annotated and unannotated TxFrags are indicated separately, and also split into different cat based on Extension length(kb) GENCODE classification: exonic, intergenic(proximal being within 5kb of a Figure 2 Length of genomic extensions to GENCODE-annotated gene and distal being otherwise), intronic(proximal being within 5 kb of an the basis of RACE experiments followed by array hybridizations ( intron and distal being otherwise), and matching other ESTs not used in the The indicated bars reflect the frequency of extension lengths amon GENCODE annotation(principally because they were unspliced). The yaxis length classes. The solid line shows the cumulative frequency of indicates the per cent of tiling array nucleotides present in that class for that of that length or greater. Most of the extensions are greater than 50kb from number of samples(combination of cell lines and tissues the annotated gene(see text for details) E2007 Nature Publishing Group

protein structure18. Such exons are on average less expressed (25% versus 87% by RT–PCR; see Supplementary Information section 2.7) than exons involved in more than one transcript (see Supplementary Information section 2.4.3), but when expressed have a tissue distribution comparable to well-established genes. Critical questions are raised by the presence of a large amount of unannotated transcription with respect to how the corresponding sequences are organized in the genome—do these reflect longer transcripts that include known loci, do they link known loci, or are they completely separate from known loci? We further investigated these issues using both computational and new experimental techniques. Unannotated transcription. Consistent with previous findings, the Un.TxFrags did not show evidence of encoding proteins (see Supplementary Information section 2.8). One might expect Un.TxFrags to be linked within transcripts that exhibit coordinated expression and have similar conservation profiles across species. To test this, we clustered Un.TxFrags using two methods. The first method19 used expression levels in 11 cell lines or conditions, dinucleotide composition, location relative to annotated genes, and evolutionary conservation profiles to cluster TxFrags (both unannotated and annotated). By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus). We experimentally examined these novel loci to study the connectivity of transcripts amongst Un.TxFrags and between Un.TxFrags and known exons. Overall, about 40% of the connections (18 out of 46) were validated by RT–PCR. The second clustering method involved analysing a time course (0, 2, 8 and 32 h) of expression changes in human HL60 cells following retinoic-acid stimulation. There is a coordinated program of expression changes from annotated loci, which can be shown by plotting Pearson correlation values of the expression levels of exons inside annotated loci versus unrelated exons (see Supplementary Information section 2.8.2). Similarly, there is coordinated expression of nearby Un.TxFrags, albeit lower, though still significantly different from randomized sets. Both clustering methods indicate that there is coordinated behaviour of many Un.TxFrags, consistent with them residing in connected transcripts. Transcript connectivity. We used a combination of RACE and tiling arrays20 to investigate the diversity of transcripts emanating from protein-coding loci. Analogous to TxFrags, we refer to transcripts detected using RACE followed by hybridization to tiling arrays as RxFrags. We performed RACE to examine 399 protein-coding loci (those loci found entirely in ENCODE regions) using RNA derived from 12 tissues, and were able to unambiguously detect 4,573 RxFrags for 359 loci (see Supplementary Information section 2.9). Almost half of these RxFrags (2,324) do not overlap a GENCODE exon, and most (90%) loci have at least one novel RxFrag, which often extends a considerable distance beyond the 59 end of the locus. Figure 2 shows the distribution of distances between these new RACE-detected ends and the previously annotated TSS of each locus. The average distance of the extensions is between 50 kb and 100 kb, with many extensions (.20%) being more than 200 kb. Consistent with the known presence of overlapping genes in the human genome, our findings reveal evidence for an overlapping gene at 224 loci, with transcripts from 180 of these loci (,50% of the RACE-positive loci) appearing to have incorporated at least one exon from an upstream gene. To characterize further the 59 RxFrag extensions, we performed RT–PCR followed by cloning and sequencing for 550 of the 59 RxFrags (including the 261 longest extensions identified for each locus). The approach of mapping RACE products using microarrays is a combination method previously described and validated in several studies14,17,20. Hybridization of the RT–PCR products to tiling arrays confirmed connectivity in almost 60% of the cases. Sequenced clones confirmed transcript extensions. Longer extensions were harder to clone and sequence, but 5 out of 18 RT–PCR-positive extensions over 100 kb were verified by sequencing (see Supplementary Information section 2.9.7 and ref. 17). The detection of numerous RxFrag extensions coupled with evidence of considerable intronic transcription indicates that protein-coding loci are more transcriptionally complex than previously thought. Instead of the traditional view that many genes have one or more alternative transcripts that code for alternative proteins, our data suggest that a given gene may both encode multiple protein products and produce other transcripts that include sequences from both strands and from neighbouring loci (often without encoding a different protein). Figure 3 illustrates such a case, in which a new fusion transcript is expressed in the small intestine, and consists of at least three coding exons from the ATP5O gene and at least two coding exons from the DONSON 1/11 2/11 3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 cell lines Intronic proximal Intronic distal Intergenic proximal Intergenic distal Other ESTs GENCODE exonic 12 Annotated transcripts Novel transcripts 10 8 6 4 2 0 2 Tiling array nucleotides (%) 4 6 8 10 12 Figure 1 | Annotated and unannotated TxFrags detected in different cell lines. The proportion of different types of transcripts detected in the indicated number of cell lines (from 1/11 at the far left to 11/11 at the far right) is shown. The data for annotated and unannotated TxFrags are indicated separately, and also split into different categories based on GENCODE classification: exonic, intergenic (proximal being within 5 kb of a gene and distal being otherwise), intronic (proximal being within 5 kb of an intron and distal being otherwise), and matching other ESTs not used in the GENCODE annotation (principally because they were unspliced). The y axis indicates the per cent of tiling array nucleotides present in that class for that number of samples (combination of cell lines and tissues). Per cent of RxFrag extensions (shaded boxes) 0 5 10 15 Extension length (kb) Cumulative per cent of extensions this length or greater (line) < 0.5 0.5–1 5–10 10–25 25–50 50–100 100–200 200–300 300–400 400–500 ≥ 1–5 500 0 10 20 30 40 50 60 70 80 90 100 Figure 2 | Length of genomic extensions to GENCODE-annotated genes on the basis of RACE experiments followed by array hybridizations (RxFrags). The indicated bars reflect the frequency of extension lengths among different length classes. The solid line shows the cumulative frequency of extensions of that length or greater. Most of the extensions are greater than 50 kb from the annotated gene (see text for details). ARTICLES NATURE|Vol 447| 14 June 2007 802 ©2007 NaturePublishingGroup

NATURE Vol 447 14 June 2007 ARTICLES ch.2133.900000,33950000134000.000 34,150000 ATP50 H+ ↓4‖↓ Figure 3 Overview of RACE experiments showing a gene fusion. ray analyses(RxFrags) are shown along the top. Along th Transcripts emanating from the region between the doNSON and ATP5O genes. A 330-kbinterval ofhuman chromosome 21(within ENm005)is shot om the DONSON gene f and sequenced RT-PCR productt ollowed by three exons from the APso genes which contains four annotated genes: DONSON, CRYZLI, ITSNI and ATP50 ences are separated by a 300 kb intron in the genome. A PET tag The 5" RACE products generated from small intestine RNA and detected by termini of a transcript consistent with this RT-PCR product. gene, with no evidence of sequences from two intervening protein- Information sections 2 11 and 2.9.3); the predictions were validated PseudogenesPseudogenes,reviewed in refs 21 and 22, are generally respec%, and 63% rate for Evofold, RNAz and dual predictions, coding genes(ITSNI and CRYZLI) ata56%,65% Dat of genes, are sometimes tran- Primary transcripts. The detection of numerous unannotated scribed and often complicate analysis of transcription owing to close transcripts coupled with increasing knowledge of the general com- ei quence similarity to functional genes. We used various computa- plexity of transcription prompted us to examine the extent of prim onal methods to identify 201 pseudogenes(124 processed and 77 ary(that is, unspliced) transcripts across the ENCODE regions. non-processed)in the ENCODE regions(see Supplementary Infor- Three data sources provide insight about these primary transcripts mation section 2.10 and ref 23). Tiling-array analysis of 189 of these the GENCODE annotation, PETs, and RxFrag extensions. Figure 4 revealed that 56% overlapped at least one TxFrag. However, possible summarizes the fraction of bases in the ENCODE regions that over- cross-hybridization between the pseudogenes and their correspond- lap transcripts identified by these technologies. Remarkably, 93% of ing parent genes may have confounded such analyses. To assess better bases are represented in a primary transcript identified by at least two the extent of pseudogene transcription, 160 pseudogenes(lll pro- independent observations(but potentially using the same techno- cessed and 49 non-processed)were examined for expression using logy ) this figure is reduced to 74% in the case of primary transcripts RACE/tiling-array analysis(see Supplementary Information section detected by at least two different technologies. These increased spans 2.9.2). Transcripts were detected for 14 pseudogenes( 8 processed are not mainly due to cell line rearrangements because they were and 6 non-processed)in at least one of the 12 tested RNA sources, present in multiple tissue experiments that confirmed the spans the majority(9)being in testis(see ref. 23). Additionally, there was (see Supplementary Information section 2.12). These estimates evidence for the transcription of 25 pseudogenes on the basis of their assume that the presence of PETs or RxFrags defining the terminal proximity(within 100 bp of a pseudogene end)to CAGE tags(8), ends of a transcript imply that the entire intervening DNA is tran- PETs(2), or cDNAS/ESTs(21). Overall, we estimate that at least 19% scribed and then processed. Other mechanisms, thought to be of the pseudogenes in the ENCODE regions are transcribed, which is unlikely in the human genome, such as trans-splicing or polymerase consistent with previous estimates umping would also produce these long termini and potentially Non-protein-coding RNA Non-protein-coding RNAs(ncRNAs) should be reconsidered in more detail. clude structural RNAs(for example, transfer RNAs, ribosomal Previous studies have suggested a similar broad amount of tran RNAS, and small nuclear RNAs) and more recently discovered scription across the human 4 and mouse2genomes. Our studies regulatory RNAs(for example, miRNAs). There are only 8 well- confirm these results, and have investigated the genesis of these characterized ncRNA genes within the ENCODE regions (U70, transcripts in greater detail, confirming the presence of substantial ACA36, ACA56, mir-192, mir-194-2, mir-196, mir-483 and H19), intragenic and intergenic transcription. At the same time, many of whereas representatives of other classes, (for example, box C/D the resulting transcripts are neither traditional protein-coding snoRNAs, tRNAs, and functional snRNAs)seem to be completel absent in the ENCODE regions. Tiling-array data provided evidence for transcription in at least one of the assayed rna samples for all of one observation One techn hese ncRNAs, with the exception of mir-483(expression of mir-483 might be specific to fetal liver, which was not tested). There is also two observations evidence for the transcription of 6 out of 8 pseudogenes of ncRNA: (mainly snoRNA-derived ). Similar to the analysis of protein pseudogenes, the hybridization results could also originate from All three the known snoRNa gene elsewhere in the genome Many known nCRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms--EvoFold and RNAz--to predict structured ncRNAs (as well as functional structures in mRNAs)using the multi-species sequence alignments(see below, Supplementary Information section 2. 11 and ref. 26). Using a sensitivity threshold capable of detecting all Figure 4 Coverage of primary transcripts across ENCODE region known miRNAs and snoRNAs, we identified 4986 and 3.707 can- different technologies(integrated annotation from GENCODE, R didate ncRNA loci with Evo Fold and RNAZ, respectively. Only 268 experiments (RxFrags)and PET tags)were used to assess the pr loci(5% and 7%, respectively) were found with both program representing a 1. 6-fold enrichment over that expected by chance; opportunity to have multiple observations of each finding. The proportion the lack of more extensive overlap is due to the two programs having the following scenarios is depicted: detected by all three technologies, by two e experimentally exami50 hese targets using RACE/ and by one technologies, by one technology but wi四m山k optimal sensitivity at different levels of GC content and conservation. of th iling-array analysis for brain and testis tissues(see Supplementary genomic bases without any detectable coverage of primary transcripts. E2007 Nature Publishing Group

gene, with no evidence of sequences from two intervening proteincoding genes (ITSN1 and CRYZL1). Pseudogenes. Pseudogenes, reviewed in refs 21 and 22, are generally considered non-functional copies of genes, are sometimes transcribed and often complicate analysis of transcription owing to close sequence similarity to functional genes. We used various computational methods to identify 201 pseudogenes (124 processed and 77 non-processed) in the ENCODE regions (see Supplementary Information section 2.10 and ref. 23). Tiling-array analysis of 189 of these revealed that 56% overlapped at least one TxFrag. However, possible cross-hybridization between the pseudogenes and their corresponding parent genes may have confounded such analyses. To assess better the extent of pseudogene transcription, 160 pseudogenes (111 processed and 49 non-processed) were examined for expression using RACE/tiling-array analysis (see Supplementary Information section 2.9.2). Transcripts were detected for 14 pseudogenes (8 processed and 6 non-processed) in at least one of the 12 tested RNA sources, the majority (9) being in testis (see ref. 23). Additionally, there was evidence for the transcription of 25 pseudogenes on the basis of their proximity (within 100 bp of a pseudogene end) to CAGE tags (8), PETs (2), or cDNAs/ESTs (21). Overall, we estimate that at least 19% of the pseudogenes in the ENCODE regions are transcribed, which is consistent with previous estimates24,25. Non-protein-coding RNA. Non-protein-coding RNAs (ncRNAs) include structural RNAs (for example, transfer RNAs, ribosomal RNAs, and small nuclear RNAs) and more recently discovered regulatory RNAs (for example, miRNAs). There are only 8 wellcharacterized ncRNA genes within the ENCODE regions (U70, ACA36, ACA56, mir-192, mir-194-2, mir-196, mir-483 and H19), whereas representatives of other classes, (for example, box C/D snoRNAs, tRNAs, and functional snRNAs) seem to be completely absent in the ENCODE regions. Tiling-array data provided evidence for transcription in at least one of the assayed RNA samples for all of these ncRNAs, with the exception of mir-483 (expression of mir-483 might be specific to fetal liver, which was not tested). There is also evidence for the transcription of 6 out of 8 pseudogenes of ncRNAs (mainly snoRNA-derived). Similar to the analysis of proteinpseudogenes, the hybridization results could also originate from the known snoRNA gene elsewhere in the genome. Many known ncRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms—EvoFold and RNAz—to predict structured ncRNAs (as well as functional structures in mRNAs) using the multi-species sequence alignments (see below, Supplementary Information section 2.11 and ref. 26). Using a sensitivity threshold capable of detecting all known miRNAs and snoRNAs, we identified 4,986 and 3,707 candidate ncRNA loci with EvoFold and RNAz, respectively. Only 268 loci (5% and 7%, respectively) were found with both programs, representing a 1.6-fold enrichment over that expected by chance; the lack of more extensive overlap is due to the two programs having optimal sensitivity at different levels of GC content and conservation. We experimentally examined 50 of these targets using RACE/ tiling-array analysis for brain and testis tissues (see Supplementary Information sections 2.11 and 2.9.3); the predictions were validated at a 56%, 65%, and 63% rate for Evofold, RNAz and dual predictions, respectively. Primary transcripts. The detection of numerous unannotated transcripts coupled with increasing knowledge of the general complexity of transcription prompted us to examine the extent of primary (that is, unspliced) transcripts across the ENCODE regions. Three data sources provide insight about these primary transcripts: the GENCODE annotation, PETs, and RxFrag extensions. Figure 4 summarizes the fraction of bases in the ENCODE regions that overlap transcripts identified by these technologies. Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same technology); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies. These increased spans are not mainly due to cell line rearrangements because they were present in multiple tissue experiments that confirmed the spans (see Supplementary Information section 2.12). These estimates assume that the presence of PETs or RxFrags defining the terminal ends of a transcript imply that the entire intervening DNA is transcribed and then processed. Other mechanisms, thought to be unlikely in the human genome, such as trans-splicing or polymerase jumping would also produce these long termini and potentially should be reconsidered in more detail. Previous studies have suggested a similar broad amount of transcription across the human14,15 and mouse27 genomes. Our studies confirm these results, and have investigated the genesis of these transcripts in greater detail, confirming the presence of substantial intragenic and intergenic transcription. At the same time, many of the resulting transcripts are neither traditional protein-coding No coverage One technology, one observation One technology, two observations Two technologies All three technologies Figure 4 | Coverage of primary transcripts across ENCODE regions. Three different technologies (integrated annotation from GENCODE, RACE-array experiments (RxFrags) and PET tags) were used to assess the presence of a nucleotide in a primary transcript. Use of these technologies provided the opportunity to have multiple observations of each finding. The proportion of genomic bases detected in the ENCODE regions associated with each of the following scenarios is depicted: detected by all three technologies, by two of the three technologies, by one technology but with multiple observations, and by one technology with only one observation. Also indicated are genomic bases without any detectable coverage of primary transcripts. 33,900,000 33,950,000 34,000,000 34,050,000 34,100,000 34,150,000 34,200,000 RxFrag DONSON CRYZL1 ATP5O PETs (–) strand (–) strand (+) strand ITSN1 DONSON Cloned RT-PCR product ATP5O Chr. 21 GENCODE reference genes Figure 3 | Overview of RACE experiments showing a gene fusion. Transcripts emanating from the region between the DONSON and ATP5O genes. A 330-kbinterval of human chromosome 21 (within ENm005) is shown, which contains four annotated genes:DONSON,CRYZL1,ITSN1 andATP5O. The 59 RACE products generated from small intestine RNA and detected by tiling-array analyses (RxFrags) are shown along the top. Along the bottom is shown the placement of a cloned and sequenced RT–PCR product that has two exons from the DONSON gene followed by three exons from the ATP5O gene; these sequences are separated by a 300 kb intron in the genome. A PET tag shows the termini of a transcript consistent with this RT–PCR product. NATURE| Vol 447|14 June 2007 ARTICLES 803 ©2007 NaturePublishingGroup

ARTICLES NATURE Vol 447 14 June 2007 transcripts nor easily explained as structural non-coding RNAs. generated data about sequence-specific transcription factor binding Other studies have noted complex transcription around specific loci and clusters of regulatory elements. Finally, we describe how this or chimaeric-gene structures( for example refs 28-30), but these have information can be integrated to make predictions about transcrip- often been considered exceptions; our data show that complex inter- tional regulation alated transcription is common at many loci. The results presented Transcription start site catalogue. We analysed two data sets in the next section show extensive amounts of regulatory factors to catalogue TSSs in the ENCODE regions: the 5'ends of around novel TSSs, which is consistent with this extensive transcrip- GENCODE-annotated transcripts and the combined results of two tion. The biological relevance of these unannotated transcripts 5'-end-capture technologies--CAGE and PET-tagging. The initial remains unanswered by these studies. Evolutionary information results suggested the potential presence of 16,051 unique TSSs. (detailed below) is mixed in this regard; for example, it indicates that However, in many cases, multiple TSSs resided within a single small than many other annotated features. As with other ENCODE- taining TSSs with many very close precise initiation sites". To nor detected elements, it is difficult to identify clear biological roles for malize for this effect, we grouped TSSs that were 60 or fewer bases the majority of these transcripts; such experiments are challenging to apart into a single cluster, and in each case considered the most perform on a large scale and, furthermore, it seems likely that many frequent CAGE or PET tag(or the 5-most TSS in the case of TSSs of the corresponding biochemical events may be evolutionarily neut- identified only from GENCODE data)as representative of that clus ral (see below ) ter for downstream analys The above effort yielded 7, 157 TSS clusters in the ENCODE Regulation of transcription Overview. A significant challenge in biology is to identify the tran- regions. We classified these TSSs into three categories: known(pre- transcript and to understand how the function of these elements is by other evidence)and unsupported. The novel TSSs were further riptional regulatory elements that control the expression of each sent at the end of GENCODE-defined transcripts), novel(supported ubdivided on the basis of the nature of the supporting evidence(see coordinated to execute complex cellular processes. A simple, com- Table 3 and Supplementary Information section 3.5), with all four of monplace view of transcriptional regulation involves five types of the resulting subtypes showing significant overlap with experimental cis-acting regulatory sequences -promoters, enhancers, silencers, evidence using the GSC statistic. Although there is a larger relative insulators and locus control regions 1. Overall, transcr whereby the restricted to only singleton tags, the novel TSSs continue to have lation involves the interplay of multiple component otional regu- proportion of singleton tags in the novel category, when analysis is availability of specific transcription factors and the accessibility of highly significant overlap with supporting evidence(see Supplemen ecific genomic regions determine whether a transcript is gener ated"l.However, the current view of transcriptional regulation is Correlating genomic features with chromatin structure and tran- known to be overly simplified, with many details remaining to be scription factor binding By measuring relative sensitivity to DNasel established. For ple, the consensus sequences of transcription digestion(see Supplementary Information section 3.3),we identified factor binding sites(typically 6 to 10 bases) have relatively little nformation content and are present numerous times in the genome, and TSSs both reflect genomic regions thought to be enriched for regulation. Does chromatin structure then determine whether such a partitioned dHSs into those within 2.5kb of a TSS(958; 46.5%)and sequence has a regulatory role, re there complex inter-factor inter. the remaining ones, which were classified as distal (1, 102; 53.5%).We from different distal regulatory elements coupled without affecting all then cross-analysed the TSSs and DHSs with data sets relating to histone modifications, chromatin accessibility and sequence-specific neighbouring genes? Meanwhile, our understanding of the repertoire transcription factor binding by summarizing these signals in aggreg- of transcriptional events is becoming more complex, with an increas ing appreciation of alternative TSSs233 and the presence of non- ate relative to the distance from TSSs or DHSs. Fi gure 5 shows rep- ding2. and anti-sense transcripts resentative profiles of specific histone modifications, Pol Il and selected transcription factor binding for the different categories of To better understand transcriptional regulation, we sought to TSSs. Further profiles and statistical analysis of these studies can b ENCODE regions. For this pilot project, we mainly focused on the found in Supplementary Information 3.6 binding of regulatory proteins and chromatin structure involved in In the case of the three TSS categories(known, novel and unsup- transcriptional regulation. We analysed over 150 data sets, mainly ported ) known and novel TSSs are both associated with similar from ChIP-chip2-, ChIP-PET and STAGE studies(see Sup plementary Information section 3.1 and 3.2). These methods use through DNasel accessibility), whereas unsupported TSSs are not. chromatin immunoprecipitation with specific antibodies to enrich for DNA in physical contact with the targeted epitope. This enriched Table 3 Different categories of TSSs defined on the basis of support from DNA can then be analyse either microarrays( ChIP-chip) different transcript-survey method high-throughput sequencing( ChIP-PET and STAGE). The assays Category Transcript survey Number of TSS P? Singleton included 18 sequence-specific transcription factors and components (non-redundant) of the general transcription machinery( for example, RNA polyme ase II (Pol ID), TAFI and TFIIB/GTF2B) In addition, we tested more Known GENCODE5′ends1.730 25(74 overall GENCODE sense 1.437 than 600 potential promoter fragments for transcriptional activity by transient-transfection reporter assays that used 16 human cell lines GENCODE 3×10-865 We also examined chromatin structure by studying the ENCODE antisense exons 63 regions for DNasel sensitivity(by quantitative PCR"2 and tiling 7×10-6371 on sury rrays,,see Supplementary Information section 3.3), histone com- 4×10-9060 position", histone modifications(using ChIP-chip assays)6, and 2,666 83.4 histone displacement(using FAIRE, see Supplementary Information TSS clusters with this support, excluding TSSs from higher catego section 3.4). Below, we detail these analyses, starting with the efforts to define and classify the 5 ends of transcripts with respect to their t Per cent of clusters with only one tag For the known' category this was calculated as the per associated regulatory signals. Following that are summaries of cent of GENCODE 5'ends with tag support (25%)or overall (74%) E2007 Nature Publishing Group

transcripts nor easily explained as structural non-coding RNAs. Other studies have noted complex transcription around specific loci or chimaeric-gene structures (for example refs 28–30), but these have often been considered exceptions; our data show that complex intercalated transcription is common at many loci. The results presented in the next section show extensive amounts of regulatory factors around novel TSSs, which is consistent with this extensive transcription. The biological relevance of these unannotated transcripts remains unanswered by these studies. Evolutionary information (detailed below) is mixed in this regard; for example, it indicates that unannotated transcripts show weaker evolutionary conservation than many other annotated features. As with other ENCODEdetected elements, it is difficult to identify clear biological roles for the majority of these transcripts; such experiments are challenging to perform on a large scale and, furthermore, it seems likely that many of the corresponding biochemical events may be evolutionarily neutral (see below). Regulation of transcription Overview. A significant challenge in biology is to identify the transcriptional regulatory elements that control the expression of each transcript and to understand how the function of these elements is coordinated to execute complex cellular processes. A simple, commonplace view of transcriptional regulation involves five types of cis-acting regulatory sequences—promoters, enhancers, silencers, insulators and locus control regions31. Overall, transcriptional regulation involves the interplay of multiple components, whereby the availability of specific transcription factors and the accessibility of specific genomic regions determine whether a transcript is generated31. However, the current view of transcriptional regulation is known to be overly simplified, with many details remaining to be established. For example, the consensus sequences of transcription factor binding sites (typically 6 to 10 bases) have relatively little information content and are present numerous times in the genome, with the great majority of these not participating in transcriptional regulation. Does chromatin structure then determine whether such a sequence has a regulatory role? Are there complex inter-factor interactions that integrate the signals from multiple sites? How are signals from different distal regulatory elements coupled without affecting all neighbouring genes? Meanwhile, our understanding of the repertoire of transcriptional events is becoming more complex, with an increasing appreciation of alternative TSSs32,33 and the presence of noncoding27,34 and anti-sense transcripts35,36. To better understand transcriptional regulation, we sought to begin cataloguing the regulatory elements residing within the 44 ENCODE regions. For this pilot project, we mainly focused on the binding of regulatory proteins and chromatin structure involved in transcriptional regulation. We analysed over 150 data sets, mainly from ChIP-chip37–39, ChIP-PET and STAGE40,41 studies (see Supplementary Information section 3.1 and 3.2). These methods use chromatin immunoprecipitation with specific antibodies to enrich for DNA in physical contact with the targeted epitope. This enriched DNA can then be analysed using either microarrays (ChIP-chip) or high-throughput sequencing (ChIP-PET and STAGE). The assays included 18 sequence-specific transcription factors and components of the general transcription machinery (for example, RNA polymerase II (Pol II), TAF1 and TFIIB/GTF2B). In addition, we tested more than 600 potential promoter fragments for transcriptional activity by transient-transfection reporter assays that used 16 human cell lines33. We also examined chromatin structure by studying the ENCODE regions for DNaseI sensitivity (by quantitative PCR42 and tiling arrays43,44, see Supplementary Information section 3.3), histone composition45, histone modifications (using ChIP-chip assays)37,46, and histone displacement (using FAIRE, see Supplementary Information section 3.4). Below, we detail these analyses, starting with the efforts to define and classify the 59 ends of transcripts with respect to their associated regulatory signals. Following that are summaries of generated data about sequence-specific transcription factor binding and clusters of regulatory elements. Finally, we describe how this information can be integrated to make predictions about transcriptional regulation. Transcription start site catalogue. We analysed two data sets to catalogue TSSs in the ENCODE regions: the 59 ends of GENCODE-annotated transcripts and the combined results of two 59-end-capture technologies—CAGE and PET-tagging. The initial results suggested the potential presence of 16,051 unique TSSs. However, in many cases, multiple TSSs resided within a single small segment (up to ,200 bases); this was due to some promoters containing TSSs with many very close precise initiation sites47. To normalize for this effect, we grouped TSSs that were 60 or fewer bases apart into a single cluster, and in each case considered the most frequent CAGE or PET tag (or the 59-most TSS in the case of TSSs identified only from GENCODE data) as representative of that cluster for downstream analyses. The above effort yielded 7,157 TSS clusters in the ENCODE regions. We classified these TSSs into three categories: known (present at the end of GENCODE-defined transcripts), novel (supported by other evidence) and unsupported. The novel TSSs were further subdivided on the basis of the nature of the supporting evidence (see Table 3 and Supplementary Information section 3.5), with all four of the resulting subtypes showing significant overlap with experimental evidence using the GSC statistic. Although there is a larger relative proportion of singleton tags in the novel category, when analysis is restricted to only singleton tags, the novel TSSs continue to have highly significant overlap with supporting evidence (see Supplementary Information section 3.5.1). Correlating genomic features with chromatin structure and transcription factor binding. By measuring relative sensitivity to DNaseI digestion (see Supplementary Information section 3.3), we identified DNaseI hypersensitive sites throughout the ENCODE regions. DHSs and TSSs both reflect genomic regions thought to be enriched for regulatory information and many DHSs reside at or near TSSs. We partitioned DHSs into those within 2.5 kb of a TSS (958; 46.5%) and the remaining ones, which were classified as distal (1,102; 53.5%). We then cross-analysed the TSSs and DHSs with data sets relating to histone modifications, chromatin accessibility and sequence-specific transcription factor binding by summarizing these signals in aggregate relative to the distance from TSSs or DHSs. Figure 5 shows representative profiles of specific histone modifications, Pol II and selected transcription factor binding for the different categories of TSSs. Further profiles and statistical analysis of these studies can be found in Supplementary Information 3.6. In the case of the three TSS categories (known, novel and unsupported), known and novel TSSs are both associated with similar signals for multiple factors (ranging from histone modifications through DNaseI accessibility), whereas unsupported TSSs are not. Table 3 | Different categories of TSSs defined on the basis of support from different transcript-survey methods Category Transcript survey method Number of TSS clusters (non-redundant)* P value{ Singleton clusters{ (%) Known GENCODE 59 ends 1,730 2 3 10270 25 (74 overall) Novel GENCODE sense exons 1,437 6 3 10239 64 GENCODE antisense exons 521 3 3 1028 65 Unbiased transcription survey 639 7 3 10263 71 CpG island 164 4 3 10290 60 Unsupported None 2,666 - 83.4 * Number of TSS clusters with this support, excluding TSSs from higher categories. { Probability of overlap between the transcript support and the PET/CAGE tags, as calculated by the Genome Structure Correction statistic (see Supplementary Information section 1.3). { Per cent of clusters with only one tag. For the ‘known’ category this was calculated as the per cent of GENCODE 59 ends with tag support (25%) or overall (74%). ARTICLES NATURE|Vol 447| 14 June 2007 804 ©2007 NaturePublishingGroup

NATURE Vol 447 14 June 2007 ARTICLES The enrichments seen with chromatin modifications and sequence- signal of histone modifications is mainly attributable to active TSSs specific factors, along with the significant clustering of this evidence,(Fig. 5), in particular those near CpG islands. Pronounced doublet indicate that the novel TSSs do not reflect false positives and probably peaks at the TSS can be seen with these large signals(similar to use the same biological machinery as other promoters. Sequence- previous work in yeast)owing to the chromatin accessibility at specific transcription factors show a marked increase in binding the TSS. Many of the histone marks and Pol ll signals are now clearly across the broad region that encompasses each TSS. This increase asymmetrical, with a persistent level of pol ll into the genic region, as downstream of a TSS(see Supplementary Information section 3.7 m pected. However, the sequence-specific factors remain largely sym- is notably symmetric, with binding equally likely upstream or ethically distributed. TSSs near CpG islands show a broader distri for an explanation of why this symmetrical signal is not an artefact bution of histone marks than those not near CpG islands(see of the analysis of the signals). Furthermore, there is enrichment Supplementary Information section 3. 6). The binding of some tran of SMARCCI binding (a member of the SWI/SNF chromatin- scription factors(E2F1, E2F4 and MYC) is extensive in the case of modifying complex), which persists across a broader extent than active genes, and is lower(or absent) in the case of inactive genes other factors. The broad signals with this factor indicate that the Chromatin signature of distal elements. Distal DHSs show char ChIP-chip results reflect both specific enrichment at the TSS and acteristic patterns of histone modification that are the inverse of broader enrichments across -5-kb regions(this is not due to tech- TSSs, with high H3K4mel accompanied by lower levels of ues, see Supplementary Information section 3.8) H3K4Me3 and H3Ac(Fig. 5). Many factors with high occupancy at We selected 577 GENCODE-defined TSSs at the 5'ends of a pro- TSSs(for example, E2F4)show little enrichment at distal DHSs, tein-coding transcript with over 3 exons, to assess expression status. whereas other factors( for example, MYC) are enriched at both Each transcript was classified as: (1)'active(gene on) or inactive TSSs and distal DHSs. A particularly interesting observation is (gene off)on the basis of the unbiased transcript surveys, and(2) the relative enrichment of the insulator-associated factor CTCP at residing near a CpG island or not (non-CpG island)(see Sup- both distal DHSs and TSSs; this contrasts with SWI/SNF components plementary Information section 3.17). As expected, the aggregate SMARCC2 and SMARCCl, which are TSS-centric Such differential b Novel Tss 量 iance to nearest TsS-.5 C Unsupported tags d Distal DHS <Amea Distance to nearest Tss -n. Disance to newest Tss to negret DHs-Os e gene on CpG f Gene off CpG ARCC1 Figure 5 Aggregate signals of tiling-array experiments from either ChIP. factors: FAIRE and DNasel sensi s assays of chromatin chip or chromatin structure assays, represented for different classes of and H3K4mel, H3K4me2, H3K4me3, H3ac and H4ac histone modifications TSSs and DHS. For each plot, the signal was first normalized with a mean of (as indicated); the right plot shows the data for additional factors, namely 0 and standard deviation of 1, and then the normalized scores were summed MYC, E2F1, E2F4, CTCF, SMARCCI and Pol ll. The columns provide data at each position for that class of TSS or DHS and smoothed using a kernel for the different classes of TSS or dHS (unsmoothed data and density method(see Supplementary Information section 3.6). For each class analysis shown in Supplementary Information section 3.6, d statistical of sites there are two adjacent plots. The left plot depicts the data for general E2007 Nature Publishing Group

The enrichments seen with chromatin modifications and sequencespecific factors, along with the significant clustering of this evidence, indicate that the novel TSSs do not reflect false positives and probably use the same biological machinery as other promoters. Sequencespecific transcription factors show a marked increase in binding across the broad region that encompasses each TSS. This increase is notably symmetric, with binding equally likely upstream or downstream of a TSS (see Supplementary Information section 3.7 for an explanation of why this symmetrical signal is not an artefact of the analysis of the signals). Furthermore, there is enrichment of SMARCC1 binding (a member of the SWI/SNF chromatinmodifying complex), which persists across a broader extent than other factors. The broad signals with this factor indicate that the ChIP-chip results reflect both specific enrichment at the TSS and broader enrichments across ,5-kb regions (this is not due to technical issues, see Supplementary Information section 3.8). We selected 577 GENCODE-defined TSSs at the 59 ends of a protein-coding transcript with over 3 exons, to assess expression status. Each transcript was classified as: (1) ‘active’ (gene on) or ‘inactive’ (gene off) on the basis of the unbiased transcript surveys, and (2) residing near a ‘CpG island’ or not (‘non-CpG island’) (see Supplementary Information section 3.17). As expected, the aggregate signal of histone modifications is mainly attributable to active TSSs (Fig. 5), in particular those near CpG islands. Pronounced doublet peaks at the TSS can be seen with these large signals (similar to previous work in yeast48) owing to the chromatin accessibility at the TSS. Many of the histone marks and Pol II signals are now clearly asymmetrical, with a persistent level of Pol II into the genic region, as expected. However, the sequence-specific factors remain largely symmetrically distributed. TSSs near CpG islands show a broader distribution of histone marks than those not near CpG islands (see Supplementary Information section 3.6). The binding of some transcription factors (E2F1, E2F4 and MYC) is extensive in the case of active genes, and is lower (or absent) in the case of inactive genes. Chromatin signature of distal elements. Distal DHSs show characteristic patterns of histone modification that are the inverse of TSSs, with high H3K4me1 accompanied by lower levels of H3K4Me3 and H3Ac (Fig. 5). Many factors with high occupancy at TSSs (for example, E2F4) show little enrichment at distal DHSs, whereas other factors (for example, MYC) are enriched at both TSSs and distal DHSs49. A particularly interesting observation is the relative enrichment of the insulator-associated factor CTCF50 at both distal DHSs and TSSs; this contrasts with SWI/SNF components SMARCC2 and SMARCC1, which are TSS-centric. Such differential −5000 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 Distance to nearest DHS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 Distance to nearest DHS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −1.0 −0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 −1.0 −0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 −1.0 −0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 −1.0 −0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 Distance to nearest TSS −5,000 −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 a GENCODE TSS −3,000 −1,000 0 1,000 3,000 5,000 −0.5 0 0.5 1.0 −3,000 −1,000 0 1,000 Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity Aggregate normalized intensity 3,000 5,000 –5,000 −0.5 0 0.5 1.0 b Novel TSS c Unsupported tags d Distal DHS e Gene on CpG f Gene off CpG Distance to nearest TSS Distance to nearest TSS H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II H3K4me1 H3K4me2 H3K4me3 H3ac H4ac FAIRE DNaseI MYC E2F1 E2F4 CTCF SMARCC1 Pol II Figure 5 | Aggregate signals of tiling-array experiments from either ChIPchip or chromatin structure assays, represented for different classes of TSSs and DHS. For each plot, the signal was first normalized with a mean of 0 and standard deviation of 1, and then the normalized scores were summed at each position for that class of TSS or DHS and smoothed using a kernel density method (see Supplementary Information section 3.6). For each class of sites there are two adjacent plots. The left plot depicts the data for general factors: FAIRE and DNaseI sensitivity as assays of chromatin accessibility and H3K4me1, H3K4me2, H3K4me3, H3ac and H4ac histone modifications (as indicated); the right plot shows the data for additional factors, namely MYC, E2F1, E2F4, CTCF, SMARCC1 and Pol II. The columns provide data for the different classes of TSS or DHS (unsmoothed data and statistical analysis shown in Supplementary Information section 3.6). NATURE| Vol 447|14 June 2007 ARTICLES 805 ©2007 NaturePublishingGroup

ARTICLES NATURE Vol 447 14 June 2007 behaviour of sequence-specific factors points to distinct biological cross-integrating data generated using all transcription factor and differences, mediated by transcription factors, between distal regula- histone modification assays, including results falling below an arbi ry sites and TSSs. rary threshold in individual experiments. Specifically, we used four Unbiased maps of sequence-specific regulatory factor binding. complementary methods to integrate the data from 129 ChIP-chip The previous section focused on specific positions defined by TSSs data sets(see Supplementary Information section 3. 13 and ref. 58 or DHSs. We then analysed sequence-specific transcription factor These four methods detect different classes of regulatory clusters and binding data in an unbiased fashion. We refer to regions with as a whole identified 1, 393 clusters. Of these, 344 were identified by all enriched binding of regulatory factors as RFBRs. RFBRs were iden- four methods, with another 500 found by three methods(see tified on the basis of ChIP-chip data in two ways: first, each invest- Supplementary Information section 3. 13.5).67% of the 344 regula high-enrichment regions, and second (and independently ), a strin- 1, 393)reside within 2.5 kb of a known or novel TSS(as defined above; gent false discovery rate(FDR) method was applied to analyse all see Table 3 and Supplementary Information section 3.14 for abreak data using three cut-offs(1%, 5% and 10%). The laboratory-specific down by category). Restricting this analysis to previously annotated nd FDR-based methods were highly correlated, particularly for TSSs( for example, RefSeq or Ensembl)reveals that roughly 25% of regions with strong signals. 1. For consistency, we used the results the regulatory clusters are close to a previously identified TSS. These btained with the FDR-based method(see Supplementary Infor- results suggest that many of the regulatory clusters identified by mation section 3.10). These RFBRs can be used to find sequence integrating the ChIP-chip data sets are undiscovered promoters or motifs(see Supplementary Information section $3.11) th transcrip RFBRs are associated with the 5'ends of transcripts. The distri- test these possibilities, sets of 126 and 28 non-GENCODE-based bution of RFBRs is non-random(see ref 10)and correlates with the gulatory clusters were tested for promoter activit positions of TSSs. We examined the distribution of specific RFBRs mentary Information section 3. 15)and by RACE, respectively relative to the known TSSs. Different transcription factors and his- These studies revealed that 24.6% of the 126 tested regulatory clusters tone modifications vary with respect to their association with TSSs had promoter activity and that 78.6% of the 28 regulatory clusters (Fig. 6; see Supplementary Information section 3. 12 for modelling of analysed by RACE yielded products consistent with a TSSs.The andom expectation). Factors for which binding sites are most ChlP-chip data sets were generated on a mixture of cell lines, pre- enriched at the 5 ends of genes include histone modifications, dominantly HeLa and GM06990, and were different from the CAGE TAFI and RNA Pol ll with a hypo-phosphorylated carboxy-terminal PeT data, meaning that tissue specificity contributes to the presence of unique TSSs and regulatory clusters. The large increase in pre that E2F1, a sequence-specific factor that regulates the expression of moter proximal regulatory clusters identified by including the addi many genes at the Gl to S transition 2, is also tightly associated with tional novel TSSs coupled with the positive promoter and RACE TSSs, this association is as strong as that of TAFl, the well-known lys suggests that most of the regulatory regions identifiable by TATA box-binding protein associated factor 1 (ref. 53). These results these clustering methods represent bona fide promoters(see suggest that E2FI has a more general role in transcription than prev- Supplementary Information 3.16). Although the regulatory factor cale assays did not support the promoter binding that was found in many of the sites from these experiments would have previously smaller-scale studies(for example, on SIRTI and SPIl(PUl)). Integration of data on sequence-specific factors. We expect that place use of RefSeq- or Ensembl-based gene definition to define regulatory information is not dispersed independently across the distal sites promoter proximity will dramatically overestimate the number of genome but rather is clustered into distinct regions". We refer to Predicting SSs and transcriptional activity on the basis of chro- regions that contain multiple regulatory elements as regulatory clus- matin structure. The strong association between TSSs and both his- ters. We sought to predict the location of regulatory clusters by tone modifications and DHSs prompted us to investigate whether the location and activity of TSSs could be predicted solely on the basis of chromatin structure information. We trained a support vector amce specie a oo: machine(SVM)by using histone modification data anchored around DHSs to discriminate between DHSs near TSSs and those distant from TSSs. We used a selected 2,573 DHSs, split roughly between TSS- proximal DHSs and TSS-distal DHSs, as a training set. The SVM Information section 3.17). Using this SVM, we then predicted TSSs using information about DHSs and histone modifications 110 high-scoring predicted TSSs, 81 resided within 2.5 kb of a novel TSS. As expected, these show a significant overlap to the novel TSS groups(defined above) but without a strong bias towards any par ticular category(see Supplementary Information section 3. 17.1.5) To investigate the relationship between chromatin structure and gene expression, we examined transcript levels in two cell lines using a transcript-tiling array. We compared this transcript data with the 0.3 results of ChIP-chip experiments that measured histone modifica- Fraction of tsss near RFBRs tions across the ENCODE regions. From this, we developed a variety Ire 6 Distribution of RFBRs relative to GENCODE TSSs. Different of predictors of expression status using chromatin modifications as FBRS fr variables; these were derived using both decision trees and SVMs(see plotted showing their relative distribution near TSSs. The xaxis indicates the Supplementary Information section 3. 17). The best of these correctly roportion of TSSs close(within 2.5 kb)to the specified factor. The yaxis predicts expression status(transcribed versus non-transcribed)in indicates the proportion of RFBRs close to TSSs. The size of the circle 91% of cases. This success rate did not decrease dramatically when provides an indication of the number of RFBRs for each factor. A handful of the predicting algorithm incorporated the results from one cell line to representative factors are labelled. predict the expression status of another cell line. Interestingly, despite E2007 Nature Publishing Group

behaviour of sequence-specific factors points to distinct biological differences, mediated by transcription factors, between distal regulatory sites and TSSs. Unbiased maps of sequence-specific regulatory factor binding. The previous section focused on specific positions defined by TSSs or DHSs. We then analysed sequence-specific transcription factor binding data in an unbiased fashion. We refer to regions with enriched binding of regulatory factors as RFBRs. RFBRs were identified on the basis of ChIP-chip data in two ways: first, each investigator developed and used their own analysis method(s) to define high-enrichment regions, and second (and independently), a stringent false discovery rate (FDR) method was applied to analyse all data using three cut-offs (1%, 5% and 10%). The laboratory-specific and FDR-based methods were highly correlated, particularly for regions with strong signals10,11. For consistency, we used the results obtained with the FDR-based method (see Supplementary Information section 3.10). These RFBRs can be used to find sequence motifs (see Supplementary Information section S3.11). RFBRs are associated with the 59 ends of transcripts. The distribution of RFBRs is non-random (see ref. 10) and correlates with the positions of TSSs. We examined the distribution of specific RFBRs relative to the known TSSs. Different transcription factors and histone modifications vary with respect to their association with TSSs (Fig. 6; see Supplementary Information section 3.12 for modelling of random expectation). Factors for which binding sites are most enriched at the 59 ends of genes include histone modifications, TAF1 and RNA Pol II with a hypo-phosphorylated carboxy-terminal domain51—confirming previous expectations. Surprisingly, we found that E2F1, a sequence-specific factor that regulates the expression of many genes at the G1 to S transition52, is also tightly associated with TSSs52; this association is as strong as that of TAF1, the well-known TATA box-binding protein associated factor 1 (ref. 53). These results suggest that E2F1 has a more general role in transcription than previously suspected, similar to that for MYC54–56. In contrast, the largescale assays did not support the promoter binding that was found in smaller-scale studies (for example, on SIRT1 and SPI1 (PU1)). Integration of data on sequence-specific factors. We expect that regulatory information is not dispersed independently across the genome, but rather is clustered into distinct regions57. We refer to regions that contain multiple regulatory elements as ‘regulatory clusters’. We sought to predict the location of regulatory clusters by cross-integrating data generated using all transcription factor and histone modification assays, including results falling below an arbitrary threshold in individual experiments. Specifically, we used four complementary methods to integrate the data from 129 ChIP-chip data sets (see Supplementary Information section 3.13 and ref. 58. These four methods detect different classes of regulatory clusters and as a whole identified 1,393 clusters. Of these, 344 were identified by all four methods, with another 500 found by three methods (see Supplementary Information section 3.13.5). 67% of the 344 regulatory clusters identified by all four methods (or 65% of the full set of 1,393) reside within 2.5 kb of a known or novel TSS (as defined above; see Table 3 and Supplementary Information section 3.14 for a breakdown by category). Restricting this analysis to previously annotated TSSs (for example, RefSeq or Ensembl) reveals that roughly 25% of the regulatory clusters are close to a previously identified TSS. These results suggest that many of the regulatory clusters identified by integrating the ChIP-chip data sets are undiscovered promoters or are somehow associated with transcription in another fashion. To test these possibilities, sets of 126 and 28 non-GENCODE-based regulatory clusters were tested for promoter activity (see Supplementary Information section 3.15) and by RACE, respectively. These studies revealed that 24.6% of the 126 tested regulatory clusters had promoter activity and that 78.6% of the 28 regulatory clusters analysed by RACE yielded products consistent with a TSS58. The ChIP-chip data sets were generated on a mixture of cell lines, predominantly HeLa and GM06990, and were different from the CAGE/ PET data, meaning that tissue specificity contributes to the presence of unique TSSs and regulatory clusters. The large increase in promoter proximal regulatory clusters identified by including the additional novel TSSs coupled with the positive promoter and RACE assays suggests that most of the regulatory regions identifiable by these clustering methods represent bona fide promoters (see Supplementary Information 3.16). Although the regulatory factor assays were more biased towards regions associated with promoters, many of the sites from these experiments would have previously been described as distal to promoters. This suggests that commonplace use of RefSeq- or Ensembl-based gene definition to define promoter proximity will dramatically overestimate the number of distal sites. Predicting TSSs and transcriptional activity on the basis of chromatin structure. The strong association between TSSs and both histone modifications and DHSs prompted us to investigate whether the location and activity of TSSs could be predicted solely on the basis of chromatin structure information. We trained a support vector machine (SVM) by using histone modification data anchored around DHSs to discriminate between DHSs near TSSs and those distant from TSSs. We used a selected 2,573 DHSs, split roughly between TSSproximal DHSs and TSS-distal DHSs, as a training set. The SVM performed well, with an accuracy of 83% (see Supplementary Information section 3.17). Using this SVM, we then predicted new TSSs using information about DHSs and histone modifications—of 110 high-scoring predicted TSSs, 81 resided within 2.5 kb of a novel TSS. As expected, these show a significant overlap to the novel TSS groups (defined above) but without a strong bias towards any particular category (see Supplementary Information section 3.17.1.5). To investigate the relationship between chromatin structure and gene expression, we examined transcript levels in two cell lines using a transcript-tiling array. We compared this transcript data with the results of ChIP-chip experiments that measured histone modifications across the ENCODE regions. From this, we developed a variety of predictors of expression status using chromatin modifications as variables; these were derived using both decision trees and SVMs (see Supplementary Information section 3.17). The best of these correctly predicts expression status (transcribed versus non-transcribed) in 91% of cases. This success rate did not decrease dramatically when the predicting algorithm incorporated the results from one cell line to predict the expression status of another cell line. Interestingly, despite 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of TSSs near RFBRs Fraction of RFBRs near TSSs E2F1 Pol II TAF1 MYC CTCF SIRT1 SPI1 H3K27me3 STAT1 SMARCC1 SMARCC2 H3K4me2 H3K4me3 H3K4me1 Sequence-specific >200 >100 > 50 > 25 ≤ 25 General >200 >100 > 50 > 25 ≤ 25 Figure 6 | Distribution of RFBRs relative to GENCODE TSSs. Different RFBRs from sequence-specific factors (red) or general factors (blue) are plotted showing their relative distribution near TSSs. The x axis indicates the proportion of TSSs close (within 2.5 kb) to the specified factor. The y axis indicates the proportion of RFBRs close to TSSs. The size of the circle provides an indication of the number of RFBRs for each factor. A handful of representative factors are labelled. ARTICLES NATURE|Vol 447| 14 June 2007 806 ©2007 NaturePublishingGroup

NATURE Vol 447 14 June 2007 ARTICLES the striking difference in histone modification enrichments in TSSs The ENCODE Project provided a unique opportunity to examine residing near versus those more distal to CPG islands(see Fig. 5 and whether individual histone modifications on human chromatin can Supplementary Information section 3.6), including information be correlated with the time of replication and whether such correla This suggests that despite the marked differences in histone modifi- early replication. Our studies also tested whether segments showing cations among these TSS classes, a single predictor can be made, interallelic variation in the time of replication have two different sing the interactions between the different histone modification types of histone modifications consistent with an interallelic vari levels ation in chromatin state In summary, we have integrated many data sets to provide a more DNA replication data set. We mapped replication timing across the mplete view of regulatory information, both around specific sites ENCODE regions by analysing Brd-U-labelled fractions from syn (TSSs and DHSs)and in an unbiased manner From analysing mul- chronized Hela cells(collected at 2 h intervals throughout S phase tiple data sets, we find 4, 491 known and novel TSSs in the ENCODE on tiling arrays (see Supplementary Information section 4.1). regions, almost tenfold more than the number of established genes. Although the HeLa cell line has a considerably altered karyotype, This large number of TSSs might explain the extensive transcription correlation of these data with other cell line data(see below)suggests described above; it also begins to change our perspective about reg- the results are relevant to other cell types. The results are expressed as ulatory information--without such a large TSS catalogue, many of the time at which 50% of any given genomic position is replicated the regulatory clusters would have been classified as residing distal to (TR50), with higher values signifying later replication times. In add promoters. In addition to this revelation about the abundance of tion to the five 'activating histone marks, we also correlated the TR50 promoter-proximal regulatory elements, we also identified a consid- with H3K27me3, a modification associated with polycomb-mediated the basis of the presence of DHSs. Our study of distal regulatory framework, the histone data were smoothed to 100-kb resolution, elements was probably most hindered by the paucity of data gener- and then correlated with the TR50 data by a sliding window correla- ated using distal-element-associated transcription factors; neverthe- tion analysis(see Supplementary Information section 4.2). The less, we clearly detected a set of distal-DHS-associated segments continuous profiles of the activating marks, histone H3K4 mono- bound by CTCF or MYC. Finally, we showed that information about di-, and tri-methylation and histone H3 and H4 acetylation, ar hromatin structure alone could be used to make effective predic- generally anti-correlated with the TR50 signal(Fig. 7a and St tions about both the location and activity of TSSs. low a predominantly positive correlation with late-replicating seg Replication ments(Fig. 7a; see Supplementary Information section 4.3 for addi Overview. DNA replication must be carefully coordinated, both tional analysis across the genome and with respect to development. On a larger scale, Although most genomic regions replicate in a temporally specific early replication in S phase is broadly correlated with gene density window in S phase, other regions demonstrate an atypical pattern of and transcriptional activity-66; however, this relationship is not replication(Pan-S) where replication signals are seen in multiple universal, as some actively transcribed genes replicate late and vice parts of s phase. We have suggested that such a pattern of replication versa. Importantly, the relationship between transcription and stems from interallelic variation in the chromatin structure..If one DNA replication emerges only when the signal of transcription is allele is in active chromatin and the other in repressed chromatin, averaged over a large window(>100 kb), suggesting that larger- both types of modified histones are expected to be enriched in the scale chromosomal architecture may be more important than the Pan-S segments. An ENCODE region was classified as non-specifi ctivity of specific genes (or Pan-S)regions when >60% of the probes in a 10-kb window 1.6Mb 口Eary 3]32m3 00153000153200153400153600163800 Genomic positio Figure 7 Correlation between replication timing and histone odifications. a, Comparison of two histone modifications(H3K4me2 and H3K27me3), plotted and the time for 50% ofthe DNA to replicate(TR50), indicated for ENCODE region ENm006. The colours on the curves reflect the correlation strength in a sliding 250-kb window. b, Differing levels of histone modification for 807 E2007 Nature Publishing Group

the striking difference in histone modification enrichments in TSSs residing near versus those more distal to CpG islands (see Fig. 5 and Supplementary Information section 3.6), including information about the proximity to CpG islands did not improve the predictors. This suggests that despite the marked differences in histone modifications among these TSS classes, a single predictor can be made, using the interactions between the different histone modification levels. In summary, we have integrated many data sets to provide a more complete view of regulatory information, both around specific sites (TSSs and DHSs) and in an unbiased manner. From analysing multiple data sets, we find 4,491 known and novel TSSs in the ENCODE regions, almost tenfold more than the number of established genes. This large number of TSSs might explain the extensive transcription described above; it also begins to change our perspective about regulatory information—without such a large TSS catalogue, many of the regulatory clusters would have been classified as residing distal to promoters. In addition to this revelation about the abundance of promoter-proximal regulatory elements, we also identified a considerable number of putative distal regulatory elements, particularly on the basis of the presence of DHSs. Our study of distal regulatory elements was probably most hindered by the paucity of data generated using distal-element-associated transcription factors; nevertheless, we clearly detected a set of distal-DHS-associated segments bound by CTCF or MYC. Finally, we showed that information about chromatin structure alone could be used to make effective predictions about both the location and activity of TSSs. Replication Overview. DNA replication must be carefully coordinated, both across the genome and with respect to development. On a larger scale, early replication in S phase is broadly correlated with gene density and transcriptional activity59–66; however, this relationship is not universal, as some actively transcribed genes replicate late and vice versa61,64–68. Importantly, the relationship between transcription and DNA replication emerges only when the signal of transcription is averaged over a large window (.100 kb)63, suggesting that largerscale chromosomal architecture may be more important than the activity of specific genes69. The ENCODE Project provided a unique opportunity to examine whether individual histone modifications on human chromatin can be correlated with the time of replication and whether such correlations support the general relationship of active, open chromatin with early replication. Our studies also tested whether segments showing interallelic variation in the time of replication have two different types of histone modifications consistent with an interallelic variation in chromatin state. DNA replication data set. We mapped replication timing across the ENCODE regions by analysing Brd-U-labelled fractions from synchronized HeLa cells (collected at 2 h intervals throughout S phase) on tiling arrays (see Supplementary Information section 4.1). Although the HeLa cell line has a considerably altered karyotype, correlation of these data with other cell line data (see below) suggests the results are relevant to other cell types. The results are expressed as the time at which 50% of any given genomic position is replicated (TR50), with higher values signifying later replication times. In addition to the five ‘activating’ histone marks, we also correlated the TR50 with H3K27me3, a modification associated with polycomb-mediated transcriptional repression70–74. To provide a consistent comparison framework, the histone data were smoothed to 100-kb resolution, and then correlated with the TR50 data by a sliding window correlation analysis (see Supplementary Information section 4.2). The continuous profiles of the activating marks, histone H3K4 mono-, di-, and tri-methylation and histone H3 and H4 acetylation, are generally anti-correlated with the TR50 signal (Fig. 7a and Supplementary Information section 4.3). In contrast, H3K27me3 marks show a predominantly positive correlation with late-replicating segments (Fig. 7a; see Supplementary Information section 4.3 for additional analysis). Although most genomic regions replicate in a temporally specific window in S phase, other regions demonstrate an atypical pattern of replication (Pan-S) where replication signals are seen in multiple parts of S phase. We have suggested that such a pattern of replication stems from interallelic variation in the chromatin structure59,75. If one allele is in active chromatin and the other in repressed chromatin, both types of modified histones are expected to be enriched in the Pan-S segments. An ENCODE region was classified as non-specific (or Pan-S) regions when .60% of the probes in a 10-kb window H3k27me3 1.6 Mb (ENm006) Enrichment Enrichment TR50 152,800 4 3 2 1 0 2.5 1.5 0.5 4.0 3.5 3.0 153,000 153,200 153,400 153,600 153,800 H3k4me2 Genomic position (kb) a bPer cent enrichment Early Mid Late Pan-S –80 –40 0 40 80 120 H3K27me3.HeLa H3K4me1.HeLa H3K4me2.HeLa H3K4me3.HeLa H3ac.HeLa H4ac.HeLa H3K4me1.GM H3K4me2.GM H3K4me3.GM H3ac.GM H4ac.GM Figure 7 | Correlation between replication timing and histone modifications. a, Comparison of two histone modifications (H3K4me2 and H3K27me3), plotted as enrichment ratio from the Chip-chip experiments and the time for 50% of the DNA to replicate (TR50), indicated for ENCODE region ENm006. The colours on the curves reflect the correlation strength in a sliding 250-kb window. b, Differing levels of histone modification for different TR50 partitions. The amounts of enrichment or depletion of different histone modifications in various cell lines are depicted (indicated along the bottom as ‘histone mark.cell line’; GM 5 GM06990). Asterisks indicate enrichments/depletions that are not significant on the basis of multiple tests. Each set has four partitions on the basis of replication timing: early, mid, late and Pan-S. NATURE| Vol 447|14 June 2007 ARTICLES 807 ©2007 NaturePublishingGroup

ARTICLES NATURE Vol 447 14 June 2007 replicated in multiple intervals in S phase. The remaining regions into open'and'closed chromatin territories that represent higher were sub-classified into early-, mid-or late-replicating based on the order functional domains. We explored how different chromatin average TR50 of the temporally specific probes within a 10-kb win- features, particularly histone modifications, correlate with chro- dow? For regions of each class of replication timing, we determined matin structure, both over short and long distances the relative enrichment of various histone modification peaks in Chromatin accessibility and histone modifications. We used his HeLa cells(Fig. 7b; Supplementary Information section 4.4). The tone modification studies and DNasel sensitivity data sets(intro- orrelations of activating and repressing histone modification peaks duced above) to examine general chromatin accessibility without ith TR50 are confirmed by this analysis(Fig. 7b). Intriguingly, the focusing on the specific DHS sites(see Supplementary Informa t th the o ique in being enriched for both activating tion sections 3. 1, 3.3 and 3.4). A fundamental difficulty in analysing Pan-S segments are u (H3K4me2, H3ac and H4ac)and repressing(H3K27me3)histones, continuous data across large genomic regions is determining the consistent with the suggestion that the Pan-S replication pattern appropriate scale for analysis( for example, 2 kb, 5kb, 20 kb, and so arises from interallelic variation in chromatin structure and time of on). To address this problem, we developed an approach based on replication. This observation is also consistent with the Pan-S rep- wavelet analysis, a mathematical tool pioneered in the field of signal lication pattern seen for the H19/IGF2 locus, a known imprinted processing that has recently been applied to continuous-value geno region with differential epigenetic modifications across the two mic analyses. Wavelet analysis provides a means for consistently ransforming continuous signals into different scales, enabling the The extensive rearrangements in the genome of HeLa cells led us to correlation of different phenomena independently at differing scales sk whether the detected correlations between TR50 and chromatin in a consistent manner state are seen with other cell lines. The histone modification data with Global correlations of chromatin accessibility and histone modi- GM06990 cells allowed us to test whether the time of replication of fications. We computed the regional correlation between dNasel genomic segments in HeLa cells correlated with the chromatin state sensitivity and each histone modification at multiple scales using a in GM06990 cells. Early-and late-replicating segments in HeLa cells wavelet approach( Fig. 8 and Supplementary Information section are enriched and depleted, respectively, for activating marks in 4.2). To make quantitatie mish histo of correlation values be- s between different histone rangements(see Supplementary Information section 2.12), the TR50 tween DNasel sensitivity and isograms of correlation values be- 106990 cells(Fig. 7b). Thus, despite the presence of genomic rear- modifications, we computed hi each histone modification at several and chromatin state in HeLa cells are not far from a constitutive scales and then tested these for significance at specific scales. Figure baseline also seen with a cell line from a different lineage. The enrich- 8c shows the distribution of correlation values at a 16-kb scale, which ment of multiple activating histone modifications and the depletion is considerably larger than individual cis-acting regulatory elements. f a repressive modification from segments that replicate early in s At this scale, H3K4me2, H3K4me3 and H3ac show similarly high phase extends previous work in the field at a level of detail and scale correlation. However, they are significantly distinguished from not attempted before in mammalian cells. The duality of histone H3K4mel and H4ac modifications(P; see Supple- odification patterns in Pan-S areas of the HeLa genome, and the mentary Information section 4.5), which show lower correlation with concordance of chromatin marks and replication time across two DNasel sensitivity. These results suggest that larger-scale relation- disparate cell lines(HeLa and GM06990)confirm the coordination ships between chromatin accessibility and histone modifications of histone modifications with replication in the human genom are dominated by sub-regions in which higher average DNasel sens- itivity is accompanied by high levels of H3K4me2, H3K4me3 and Chromatin architecture and genomic domains H3ac modifications Overview. The packaging of genomic DNA into chromatin is inti- Local correlations of chromatin accessibility and histone mately connected with the control of gene expression and other cations Narrowing to a scale of -2 kb revealed a more chromosomal processes. We next examined chromatin structure situation, in which H3K4me2 is the histone modification that is over a larger scale to ascertain its relation to transcription and other best correlated with DNasel sensitivity. However, there is no clear processes. Large domains(50 to >200 kb) of generalized DNasel combination of marks that correlate with DNasel sensitivity in a sensitivity have been detected around developmentally regulated way that is analogous to that seen at a larger scale(see Supplemen gene clusters", prompting speculation that the genome is organized tary Information section 4.3). One explanation for the increased (ENm013) Genomic position (kb) H3k4me2: DNasel correlation by scale Correlation value Figure 8 Wavelet correlations of histone marks and DNasel sensitivity. ffering scales decomposed by the wavelet analysis from ple, correlations between DNasel sensitivity and H3K4me2(both (in kb); the colour at each point in the heatmap represents in the GM06990 cell line)over a 1. 1-Mb region on chromosome 7(ENCODE tion at the given scale, measured in a 20 kb window region ENm013)are shown. a, The relationship between histone position. c, Distribution of correlation values at the modification H3K4me2 (upper plot) and DNasel sensitivity(lower plot)is the indicated histone marks. The yaxis is the density of shown for ENCODE region ENm013. The curves are coloured with the ralues across ENCODE; all modifications show a peak at a str. e. the s te data cos ie a ire aet resented cslewtavp let chelation The positive-correlation value E2007 Nature Publishing Group

replicated in multiple intervals in S phase. The remaining regions were sub-classified into early-, mid- or late-replicating based on the average TR50 of the temporally specific probes within a 10-kb window75. For regions of each class of replication timing, we determined the relative enrichment of various histone modification peaks in HeLa cells (Fig. 7b; Supplementary Information section 4.4). The correlations of activating and repressing histone modification peaks with TR50 are confirmed by this analysis (Fig. 7b). Intriguingly, the Pan-S segments are unique in being enriched for both activating (H3K4me2, H3ac and H4ac) and repressing (H3K27me3) histones, consistent with the suggestion that the Pan-S replication pattern arises from interallelic variation in chromatin structure and time of replication75. This observation is also consistent with the Pan-S replication pattern seen for the H19/IGF2 locus, a known imprinted region with differential epigenetic modifications across the two alleles76. The extensive rearrangements in the genome of HeLa cells led us to ask whether the detected correlations between TR50 and chromatin state are seen with other cell lines. The histone modification data with GM06990 cells allowed us to test whether the time of replication of genomic segments in HeLa cells correlated with the chromatin state in GM06990 cells. Early- and late-replicating segments in HeLa cells are enriched and depleted, respectively, for activating marks in GM06990 cells (Fig. 7b). Thus, despite the presence of genomic rearrangements (see Supplementary Information section 2.12), the TR50 and chromatin state in HeLa cells are not far from a constitutive baseline also seen with a cell line from a different lineage. The enrichment of multiple activating histone modifications and the depletion of a repressive modification from segments that replicate early in S phase extends previous work in the field at a level of detail and scale not attempted before in mammalian cells. The duality of histone modification patterns in Pan-S areas of the HeLa genome, and the concordance of chromatin marks and replication time across two disparate cell lines (HeLa and GM06990) confirm the coordination of histone modifications with replication in the human genome. Chromatin architecture and genomic domains Overview. The packaging of genomic DNA into chromatin is intimately connected with the control of gene expression and other chromosomal processes. We next examined chromatin structure over a larger scale to ascertain its relation to transcription and other processes. Large domains (50 to .200 kb) of generalized DNaseI sensitivity have been detected around developmentally regulated gene clusters77, prompting speculation that the genome is organized into ‘open’ and ‘closed’ chromatin territories that represent higherorder functional domains. We explored how different chromatin features, particularly histone modifications, correlate with chromatin structure, both over short and long distances. Chromatin accessibility and histone modifications. We used histone modification studies and DNaseI sensitivity data sets (introduced above) to examine general chromatin accessibility without focusing on the specific DHS sites (see Supplementary Information sections 3.1, 3.3 and 3.4). A fundamental difficulty in analysing continuous data across large genomic regions is determining the appropriate scale for analysis (for example, 2 kb, 5 kb, 20 kb, and so on). To address this problem, we developed an approach based on wavelet analysis, a mathematical tool pioneered in the field of signal processing that has recently been applied to continuous-value genomic analyses. Wavelet analysis provides a means for consistently transforming continuous signals into different scales, enabling the correlation of different phenomena independently at differing scales in a consistent manner. Global correlations of chromatin accessibility and histone modifications. We computed the regional correlation between DNaseI sensitivity and each histone modification at multiple scales using a wavelet approach (Fig. 8 and Supplementary Information section 4.2). To make quantitative comparisons between different histone modifications, we computed histograms of correlation values between DNaseI sensitivity and each histone modification at several scales and then tested these for significance at specific scales. Figure 8c shows the distribution of correlation values at a 16-kb scale, which is considerably larger than individual cis-acting regulatory elements. At this scale, H3K4me2, H3K4me3 and H3ac show similarly high correlation. However, they are significantly distinguished from H3K4me1 and H4ac modifications (P , 1.5 3 10233; see Supplementary Information section 4.5), which show lower correlation with DNaseI sensitivity. These results suggest that larger-scale relationships between chromatin accessibility and histone modifications are dominated by sub-regions in which higher average DNaseI sensitivity is accompanied by high levels of H3K4me2, H3K4me3 and H3ac modifications. Local correlations of chromatin accessibility and histone modifications. Narrowing to a scale of ,2 kb revealed a more complex situation, in which H3K4me2 is the histone modification that is best correlated with DNaseI sensitivity. However, there is no clear combination of marks that correlate with DNaseI sensitivity in a way that is analogous to that seen at a larger scale (see Supplementary Information section 4.3). One explanation for the increased 1.11 Mb (ENm013) 25 15 16 8 4 2 0 0 4 8 H3k4me2 DNaseI sensitivity 89,600 89,800 90,000 90,200 90,400 H3k4me2 : DNaseI correlation by scale Genomic position (kb) Negative Positive Genomic position (kb) Correlation a c bSignal/control Scale (kb) H3k4me2 H3k4me3 H3Ac H3k4me1 H4Ac –1.0 –0.5 0 0.5 1.0 Correlation value Density 16-kb scale 1.2 1.0 0.8 0.6 0.4 0.2 0 Figure 8 | Wavelet correlations of histone marks and DNaseI sensitivity. As an example, correlations between DNaseI sensitivity and H3K4me2 (both in the GM06990 cell line) over a 1.1-Mb region on chromosome 7 (ENCODE region ENm013) are shown. a, The relationship between histone modification H3K4me2 (upper plot) and DNaseI sensitivity (lower plot) is shown for ENCODE region ENm013. The curves are coloured with the strength of the local correlation at the 4-kb scale (top dashed line in panel b). b, The same data as in a are represented as a wavelet correlation. The y axis shows the differing scales decomposed by the wavelet analysis from large to small scale (in kb); the colour at each point in the heatmap represents the level of correlation at the given scale, measured in a 20 kb window centred at the given position. c, Distribution of correlation values at the 16 kb scale between the indicated histone marks. The y axis is the density of these correlation values across ENCODE; all modifications show a peak at a positive-correlation value. ARTICLES NATURE|Vol 447| 14 June 2007 808 ©2007 NaturePublishingGroup

点击下载完整版文档（PDF格式）

共18页，试读已结束，阅读完整版请下载

点击下载（PDF格式）

浏览记录