正在加载图片...
NATURE Vol 447 14 June 2007 ARTICLES ch.2133.900000,33950000134000.000 34,150000 ATP50 H+ ↓4‖↓ Figure 3 Overview of RACE experiments showing a gene fusion. ray analyses(RxFrags) are shown along the top. Along th Transcripts emanating from the region between the doNSON and ATP5O genes. A 330-kbinterval ofhuman chromosome 21(within ENm005)is shot om the DONSON gene f and sequenced RT-PCR productt ollowed by three exons from the APso genes which contains four annotated genes: DONSON, CRYZLI, ITSNI and ATP50 ences are separated by a 300 kb intron in the genome. A PET tag The 5" RACE products generated from small intestine RNA and detected by termini of a transcript consistent with this RT-PCR product. gene, with no evidence of sequences from two intervening protein- Information sections 2 11 and 2.9.3); the predictions were validated PseudogenesPseudogenes,reviewed in refs 21 and 22, are generally respec%, and 63% rate for Evofold, RNAz and dual predictions, coding genes(ITSNI and CRYZLI) ata56%,65% Dat of genes, are sometimes tran- Primary transcripts. The detection of numerous unannotated scribed and often complicate analysis of transcription owing to close transcripts coupled with increasing knowledge of the general com- ei quence similarity to functional genes. We used various computa- plexity of transcription prompted us to examine the extent of prim onal methods to identify 201 pseudogenes(124 processed and 77 ary(that is, unspliced) transcripts across the ENCODE regions. non-processed)in the ENCODE regions(see Supplementary Infor- Three data sources provide insight about these primary transcripts mation section 2.10 and ref 23). Tiling-array analysis of 189 of these the GENCODE annotation, PETs, and RxFrag extensions. Figure 4 revealed that 56% overlapped at least one TxFrag. However, possible summarizes the fraction of bases in the ENCODE regions that over- cross-hybridization between the pseudogenes and their correspond- lap transcripts identified by these technologies. Remarkably, 93% of ing parent genes may have confounded such analyses. To assess better bases are represented in a primary transcript identified by at least two the extent of pseudogene transcription, 160 pseudogenes(lll pro- independent observations(but potentially using the same techno- cessed and 49 non-processed)were examined for expression using logy ) this figure is reduced to 74% in the case of primary transcripts RACE/tiling-array analysis(see Supplementary Information section detected by at least two different technologies. These increased spans 2.9.2). Transcripts were detected for 14 pseudogenes( 8 processed are not mainly due to cell line rearrangements because they were and 6 non-processed)in at least one of the 12 tested RNA sources, present in multiple tissue experiments that confirmed the spans the majority(9)being in testis(see ref. 23). Additionally, there was (see Supplementary Information section 2.12). These estimates evidence for the transcription of 25 pseudogenes on the basis of their assume that the presence of PETs or RxFrags defining the terminal proximity(within 100 bp of a pseudogene end)to CAGE tags(8), ends of a transcript imply that the entire intervening DNA is tran- PETs(2), or cDNAS/ESTs(21). Overall, we estimate that at least 19% scribed and then processed. Other mechanisms, thought to be of the pseudogenes in the ENCODE regions are transcribed, which is unlikely in the human genome, such as trans-splicing or polymerase consistent with previous estimates umping would also produce these long termini and potentially Non-protein-coding RNA Non-protein-coding RNAs(ncRNAs) should be reconsidered in more detail. clude structural RNAs(for example, transfer RNAs, ribosomal Previous studies have suggested a similar broad amount of tran RNAS, and small nuclear RNAs) and more recently discovered scription across the human 4 and mouse2genomes. Our studies regulatory RNAs(for example, miRNAs). There are only 8 well- confirm these results, and have investigated the genesis of these characterized ncRNA genes within the ENCODE regions (U70, transcripts in greater detail, confirming the presence of substantial ACA36, ACA56, mir-192, mir-194-2, mir-196, mir-483 and H19), intragenic and intergenic transcription. At the same time, many of whereas representatives of other classes, (for example, box C/D the resulting transcripts are neither traditional protein-coding snoRNAs, tRNAs, and functional snRNAs)seem to be completel absent in the ENCODE regions. Tiling-array data provided evidence for transcription in at least one of the assayed rna samples for all of one observation One techn hese ncRNAs, with the exception of mir-483(expression of mir-483 might be specific to fetal liver, which was not tested). There is also two observations evidence for the transcription of 6 out of 8 pseudogenes of ncRNA: (mainly snoRNA-derived ). Similar to the analysis of protein pseudogenes, the hybridization results could also originate from All three the known snoRNa gene elsewhere in the genome Many known nCRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms--EvoFold and RNAz--to predict structured ncRNAs (as well as functional structures in mRNAs)using the multi-species sequence alignments(see below, Supplementary Information section 2. 11 and ref. 26). Using a sensitivity threshold capable of detecting all Figure 4 Coverage of primary transcripts across ENCODE region known miRNAs and snoRNAs, we identified 4986 and 3.707 can- different technologies(integrated annotation from GENCODE, R didate ncRNA loci with Evo Fold and RNAZ, respectively. Only 268 experiments (RxFrags)and PET tags)were used to assess the pr loci(5% and 7%, respectively) were found with both program representing a 1. 6-fold enrichment over that expected by chance; opportunity to have multiple observations of each finding. The proportion the lack of more extensive overlap is due to the two programs having the following scenarios is depicted: detected by all three technologies, by two e experimentally exami50 hese targets using RACE/ and by one technologies, by one technology but wi四m山k optimal sensitivity at different levels of GC content and conservation. of th iling-array analysis for brain and testis tissues(see Supplementary genomic bases without any detectable coverage of primary transcripts. E2007 Nature Publishing Groupgene, with no evidence of sequences from two intervening protein￾coding genes (ITSN1 and CRYZL1). Pseudogenes. Pseudogenes, reviewed in refs 21 and 22, are generally considered non-functional copies of genes, are sometimes tran￾scribed and often complicate analysis of transcription owing to close sequence similarity to functional genes. We used various computa￾tional methods to identify 201 pseudogenes (124 processed and 77 non-processed) in the ENCODE regions (see Supplementary Infor￾mation section 2.10 and ref. 23). Tiling-array analysis of 189 of these revealed that 56% overlapped at least one TxFrag. However, possible cross-hybridization between the pseudogenes and their correspond￾ing parent genes may have confounded such analyses. To assess better the extent of pseudogene transcription, 160 pseudogenes (111 pro￾cessed and 49 non-processed) were examined for expression using RACE/tiling-array analysis (see Supplementary Information section 2.9.2). Transcripts were detected for 14 pseudogenes (8 processed and 6 non-processed) in at least one of the 12 tested RNA sources, the majority (9) being in testis (see ref. 23). Additionally, there was evidence for the transcription of 25 pseudogenes on the basis of their proximity (within 100 bp of a pseudogene end) to CAGE tags (8), PETs (2), or cDNAs/ESTs (21). Overall, we estimate that at least 19% of the pseudogenes in the ENCODE regions are transcribed, which is consistent with previous estimates24,25. Non-protein-coding RNA. Non-protein-coding RNAs (ncRNAs) include structural RNAs (for example, transfer RNAs, ribosomal RNAs, and small nuclear RNAs) and more recently discovered regulatory RNAs (for example, miRNAs). There are only 8 well￾characterized ncRNA genes within the ENCODE regions (U70, ACA36, ACA56, mir-192, mir-194-2, mir-196, mir-483 and H19), whereas representatives of other classes, (for example, box C/D snoRNAs, tRNAs, and functional snRNAs) seem to be completely absent in the ENCODE regions. Tiling-array data provided evidence for transcription in at least one of the assayed RNA samples for all of these ncRNAs, with the exception of mir-483 (expression of mir-483 might be specific to fetal liver, which was not tested). There is also evidence for the transcription of 6 out of 8 pseudogenes of ncRNAs (mainly snoRNA-derived). Similar to the analysis of protein￾pseudogenes, the hybridization results could also originate from the known snoRNA gene elsewhere in the genome. Many known ncRNAs are characterized by a well-defined RNA secondary structure. We applied two de novo ncRNA prediction algorithms—EvoFold and RNAz—to predict structured ncRNAs (as well as functional structures in mRNAs) using the multi-species sequence alignments (see below, Supplementary Information section 2.11 and ref. 26). Using a sensitivity threshold capable of detecting all known miRNAs and snoRNAs, we identified 4,986 and 3,707 can￾didate ncRNA loci with EvoFold and RNAz, respectively. Only 268 loci (5% and 7%, respectively) were found with both programs, representing a 1.6-fold enrichment over that expected by chance; the lack of more extensive overlap is due to the two programs having optimal sensitivity at different levels of GC content and conservation. We experimentally examined 50 of these targets using RACE/ tiling-array analysis for brain and testis tissues (see Supplementary Information sections 2.11 and 2.9.3); the predictions were validated at a 56%, 65%, and 63% rate for Evofold, RNAz and dual predictions, respectively. Primary transcripts. The detection of numerous unannotated transcripts coupled with increasing knowledge of the general com￾plexity of transcription prompted us to examine the extent of prim￾ary (that is, unspliced) transcripts across the ENCODE regions. Three data sources provide insight about these primary transcripts: the GENCODE annotation, PETs, and RxFrag extensions. Figure 4 summarizes the fraction of bases in the ENCODE regions that over￾lap transcripts identified by these technologies. Remarkably, 93% of bases are represented in a primary transcript identified by at least two independent observations (but potentially using the same techno￾logy); this figure is reduced to 74% in the case of primary transcripts detected by at least two different technologies. These increased spans are not mainly due to cell line rearrangements because they were present in multiple tissue experiments that confirmed the spans (see Supplementary Information section 2.12). These estimates assume that the presence of PETs or RxFrags defining the terminal ends of a transcript imply that the entire intervening DNA is tran￾scribed and then processed. Other mechanisms, thought to be unlikely in the human genome, such as trans-splicing or polymerase jumping would also produce these long termini and potentially should be reconsidered in more detail. Previous studies have suggested a similar broad amount of tran￾scription across the human14,15 and mouse27 genomes. Our studies confirm these results, and have investigated the genesis of these transcripts in greater detail, confirming the presence of substantial intragenic and intergenic transcription. At the same time, many of the resulting transcripts are neither traditional protein-coding No coverage One technology, one observation One technology, two observations Two technologies All three technologies Figure 4 | Coverage of primary transcripts across ENCODE regions. Three different technologies (integrated annotation from GENCODE, RACE-array experiments (RxFrags) and PET tags) were used to assess the presence of a nucleotide in a primary transcript. Use of these technologies provided the opportunity to have multiple observations of each finding. The proportion of genomic bases detected in the ENCODE regions associated with each of the following scenarios is depicted: detected by all three technologies, by two of the three technologies, by one technology but with multiple observations, and by one technology with only one observation. Also indicated are genomic bases without any detectable coverage of primary transcripts. 33,900,000 33,950,000 34,000,000 34,050,000 34,100,000 34,150,000 34,200,000 RxFrag DONSON CRYZL1 ATP5O PETs (–) strand (–) strand (+) strand ITSN1 DONSON Cloned RT-PCR product ATP5O Chr. 21 GENCODE reference genes Figure 3 | Overview of RACE experiments showing a gene fusion. Transcripts emanating from the region between the DONSON and ATP5O genes. A 330-kbinterval of human chromosome 21 (within ENm005) is shown, which contains four annotated genes:DONSON,CRYZL1,ITSN1 andATP5O. The 59 RACE products generated from small intestine RNA and detected by tiling-array analyses (RxFrags) are shown along the top. Along the bottom is shown the placement of a cloned and sequenced RT–PCR product that has two exons from the DONSON gene followed by three exons from the ATP5O gene; these sequences are separated by a 300 kb intron in the genome. A PET tag shows the termini of a transcript consistent with this RT–PCR product. NATURE| Vol 447|14 June 2007 ARTICLES 803 ©2007 NaturePublishingGroup
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有