正在加载图片...
NATURE Vol 447 14 June 2007 ARTICLES augmented by multiple other websites(see Supplementary Informa- compared with the total RNA in a cell, suggesting that there are tion section 1.1) numerous RNA species yet to be classified-. In addition, studies A common feature of genomic analyses is the need to assess the of specific loci have indicated the presence of RNA transcripts that ignificance of the co-occurrence of features or of other statistical have a role in chromatin maintenance and other regulatory control. e44 across the genome. We have developed and used a statistical frame- encoded RNA molecule work that mitigates many of these hidden correlations by adjusting Transcript maps. We used three methods to identify transcripts he appropriate null distribution of the test statistics. We term this emanating from the ENCODE regions: hybridization of rNa(either correction procedure genome structure correction(GSC)(see Sup- total or polyA-selected)to unbiased tiling arrays(see Supplementary plementary Information section 1.3) Information section 2.1), tag sequencing of cap-selected RNA at the In the next five sections, we detail the various biological insights of 5 or joint 5 /3 ends(see Supplementary Information sections 2.2 the pilot phase of the ENCODE Project. and S2.3), and integrated annotation of available complementary DNA and EST sequences involving computational, manual, and Transcript experimental approaches(see Supplementary Information section Overview. RNA transcripts are involved in many cellular functions, 2.4). We abbreviate the regions identified by unbiased tiling arrays as either directly as biologically active molecules or indirectly by encod- Tx Frags, the cap-selected RNAs as CAGE or PET tags(see Box 1),and ons other active molecules. In the conventional view of genome the integrated annotation as GENCODE transcripts. When a TxFrag ganization, sets of RNA transcripts(for example, messenger does no lap a GENCODE annotation, we call it an Un. TxFrag RNAs)are encoded by distinct loci, with each usually dedicated to Validation of these various studies is described in papers reporting a single biological role( for example, encoding a specific protein). these data sets(see Supplementary Information sections 2.1.4 and However, this picture has substantially grown in complexity in recent 2.1.5) years 2. Other forms of RNA molecules(such as small nucleolar These methods recapitulate previous findings, but provide RNAs and micro(mi)RNAs)are known to exist, and often these enhanced resolution owing to the larger number of tissues sampled are encoded by regions that intercalate with protein-coding genes. and the integration of results across the three approaches(see Table 2) These observations are consistent with the well-known discrepancy To begin with, our studies show that 14.7% of the bases represented in between the levels of observable mRNAs and large structural RNAs the unbiased tiling arrays are transcribed in at least one tissue sample Consistent with previous work. s, many (63%)Tx Frags reside out- side of GENCODE annotations, both in intronic(40.9%)and inter Table 1 Summary of types of experimental techniques used in ENCODE genic(22.6%)regions. GENCODE annotations are richer than the more-conservative RefSeq or Ensembl annotations, with 2, 608 tran- data points scripts clustered into 487 loci, leading to an average of 5. 4 transcripts 63348656 per locus. Finally, extensive testing of predicted protein-coding sequences outside of GENCODE annotations was positive in only annotation 2% of cases 6, suggesting that GENCODE annotations cover nearly Tag sequencing PET, CAGE 121 864,964 all protein-coding sequences. The GENCODE annotations are cate transcripts gorized both by likely function (mainly, the presence of an open Tiling array Histone 4,401,291 reading frame)and by classification evidence(for example, transcripts based solely on ESTs are distinguished from other scenarios ); this Chromatin QT-PCR, tiling DHS, FAIRE 42 15.318.324 classification is not strongly correlated with expression levels(see upplementary Information sections 2.4.2 and 2.4.3 Analyses of more biological samples have allowed a richer descrip tion of the transcription specificity(see Fig. I and Supplementary Tiling array, tag STAGE, ChIP- 41, 52 324, 846,018 Information section 2.5). We found that 40%of Tx Frags are preser promoter assays Chip, chIP-PET, 11, 1. in only one sample, whereas only 2% are present in all sampl Although exon-containing Tx Frags are more likely(74%)to be expressed in more than one sample, 45% of unannotated TxFrags are also expressed in multiple samples. GENCODE annotations of separate loci often(42%)overlap with respect to their genomic ates, in p plication Tiling array TR50 analysis of GENCODE-annotated sequences with respect to the posi- Computational Computational CC, RFBR cluster tions of open reading frames revealed that some component exons do not have the expected synonymous versus non-synonymous substi- tution patterns of protein-coding sequence(see Supplement Infor mation section 2.6)and some have deletions incompatible with Table 2 Bases detected in processed transcripts either as a GENCODE exon, a TxFrag, or as either a gENCODE exon or a Tx Frag GENCODE exon Either GENCODE exon TxFrag T e1,776,157(59%)1,369611(46%)2519,280(84%) transcripts(bases) copy number Transcripts detected1,447,192(98%)1,369611(93%)2163303(14.7%) ariation Not all da ENCODE Project. t Histone code nomenclature follows the Brno nomenclature as described in ref. 129 Percentages are of total bases in ENCODE in the first row and bases tiled in arrays in the second tAlso contains histone modification. E2007 Nature Publishing Groupaugmented by multiple other websites (see Supplementary Informa￾tion section 1.1). A common feature of genomic analyses is the need to assess the significance of the co-occurrence of features or of other statistical tests. One confounding factor is the heterogeneity of the genome, which can produce uninteresting correlations of variables distributed across the genome. We have developed and used a statistical frame￾work that mitigates many of these hidden correlations by adjusting the appropriate null distribution of the test statistics. We term this correction procedure genome structure correction (GSC) (see Sup￾plementary Information section 1.3). In the next five sections, we detail the various biological insights of the pilot phase of the ENCODE Project. Transcription Overview. RNA transcripts are involved in many cellular functions, either directly as biologically active molecules or indirectly by encod￾ing other active molecules. In the conventional view of genome organization, sets of RNA transcripts (for example, messenger RNAs) are encoded by distinct loci, with each usually dedicated to a single biological role (for example, encoding a specific protein). However, this picture has substantially grown in complexity in recent years12. Other forms of RNA molecules (such as small nucleolar RNAs and micro (mi)RNAs) are known to exist, and often these are encoded by regions that intercalate with protein-coding genes. These observations are consistent with the well-known discrepancy between the levels of observable mRNAs and large structural RNAs compared with the total RNA in a cell, suggesting that there are numerous RNA species yet to be classified13–15. In addition, studies of specific loci have indicated the presence of RNA transcripts that have a role in chromatin maintenance and other regulatory control. We sought to assay and analyse transcription comprehensively across the 44 ENCODE regions in an effort to understand the repertoire of encoded RNA molecules. Transcript maps. We used three methods to identify transcripts emanating from the ENCODE regions: hybridization of RNA (either total or polyA-selected) to unbiased tiling arrays (see Supplementary Information section 2.1), tag sequencing of cap-selected RNA at the 59 or joint 59/39 ends (see Supplementary Information sections 2.2 and S2.3), and integrated annotation of available complementary DNA and EST sequences involving computational, manual, and experimental approaches16 (see Supplementary Information section 2.4). We abbreviate the regions identified by unbiased tiling arrays as TxFrags, the cap-selected RNAs as CAGE or PET tags (see Box 1), and the integrated annotation as GENCODE transcripts. When a TxFrag does not overlap a GENCODE annotation, we call it an Un.TxFrag. Validation of these various studies is described in papers reporting these data sets17 (see Supplementary Information sections 2.1.4 and 2.1.5). These methods recapitulate previous findings, but provide enhanced resolution owing to the larger number of tissues sampled and the integration of results across the three approaches (see Table 2). To begin with, our studies show that 14.7% of the bases represented in the unbiased tiling arrays are transcribed in at least one tissue sample. Consistent with previous work14,15, many (63%) TxFrags reside out￾side of GENCODE annotations, both in intronic (40.9%) and inter￾genic (22.6%) regions. GENCODE annotations are richer than the more-conservative RefSeq or Ensembl annotations, with 2,608 tran￾scripts clustered into 487 loci, leading to an average of 5.4 transcripts per locus. Finally, extensive testing of predicted protein-coding sequences outside of GENCODE annotations was positive in only 2% of cases16, suggesting that GENCODE annotations cover nearly all protein-coding sequences. The GENCODE annotations are cate￾gorized both by likely function (mainly, the presence of an open reading frame) and by classification evidence (for example, transcripts based solely on ESTs are distinguished from other scenarios); this classification is not strongly correlated with expression levels (see Supplementary Information sections 2.4.2 and 2.4.3). Analyses of more biological samples have allowed a richer descrip￾tion of the transcription specificity (see Fig. 1 and Supplementary Information section 2.5). We found that 40% of TxFrags are present in only one sample, whereas only 2% are present in all samples. Although exon-containing TxFrags are more likely (74%) to be expressed in more than one sample, 45% of unannotated TxFrags are also expressed in multiple samples. GENCODE annotations of separate loci often (42%) overlap with respect to their genomic coor￾dinates, in particular on opposite strands (33% of loci). Further analysis of GENCODE-annotated sequences with respect to the posi￾tions of open reading frames revealed that some component exons do not have the expected synonymous versus non-synonymous substi￾tution patterns of protein-coding sequence (see Supplement Infor￾mation section 2.6) and some have deletions incompatible with Table 1 | Summary of types of experimental techniques used in ENCODE Feature class Experimental technique(s) Abbreviations References Number of experimental data points Transcription Tiling array, integrated annotation TxFrag, RxFrag, GENCODE 117 118 19 119 63,348,656 59 ends of transcripts* Tag sequencing PET, CAGE 121 13 864,964 Histone modifications Tiling array Histone nomenclature{, RFBR 46 4,401,291 Chromatin{ structure QT-PCR, tiling array DHS, FAIRE 42 43 44 122 15,318,324 Sequence￾specific factors Tiling array, tag sequencing, promoter assays STAGE, ChIP￾Chip, ChIP-PET, RFBR 41,52 11,120 123 81 34,51 124 49 33 40 324,846,018 Replication Tiling array TR50 59 75 14,735,740 Computational analysis Computational methods CCI, RFBR cluster 80 125 10 16 126 127 NA Comparative sequence analysis* Genomic sequencing, multi￾sequence alignments, computational analyses CS 87 86 26 NA Polymorphisms* Resequencing, copy number variation CNV 103 128 NA * Not all data generated by the ENCODE Project. { Histone code nomenclature follows the Brno nomenclature as described in ref. 129. {Also contains histone modification. Table 2 | Bases detected in processed transcripts either as a GENCODE exon, a TxFrag, or as either a GENCODE exon or a TxFrag GENCODE exon TxFrag Either GENCODE exon or TxFrag Total detectable transcripts (bases) 1,776,157 (5.9%) 1,369,611 (4.6%) 2,519,280 (8.4%) Transcripts detected in tiled regions of arrays (bases) 1,447,192 (9.8%) 1,369,611 (9.3%) 2,163,303 (14.7%) Percentages are of total bases in ENCODE in the first row and bases tiled in arrays in the second row. NATURE| Vol 447|14 June 2007 ARTICLES 801 ©2007 NaturePublishingGroup
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有