正在加载图片...
ARTICLES NATURE Vol 447 14 June 2007 protein structure. Such exons are on average less expressed (25% detected using RACE followed by hybridization to tiling arrays as versus 87% by RT-PCR; see Supplementary Information section 2.7) Rx Frags. We performed RACE to examine 399 protein-coding loci than exons involved in more than one transcript(see Supple- (those loci found entirely in ENCODE regions)using RNA derived mentary Information section 2.4.3), but when expressed have a tissue from 12 tissues, and were able to unambiguously detect 4,573 distribution comparable to well-established genes. RxFrags for 359 loci(see Supplementary Information section 2.9) Critical questions are raised by the presence of a large amount of Almost half of these RxFrags (2, 324)do not overlap a GENCODE unannotated transcription with respect to how the corresponding exon, and most(90%)loci have at least one novel RxFrag, which sequences are organized in the genome--do these reflect longer tran- often extends a considerable distance beyond the 5 end of the locus. ripts that include known loci, do they link known loci, or are they Figure 2 shows the distribution of distances between these new mpletely separate from known loci? We further investigated these RACE-detected ends and the previously annotated TSS of each locus. issues using both computational and new experimental techniques. The average distance of the extensions is between 50 kb and 100 kb, Unannotated transcription. Consistent with previous findings, the with many extensions(20%)being more than 200 kb. Consistent UnT Exsa information section 2.8). One might expect Un Tx Frags our findings reveal evidence for an overlapping gene at 224 loci, with did not show evidence of encoding proteins(see Sup- with the known presence of overlapping genes in the human genome, ent to be linked within transcripts that exhibit coordinated expression transcripts from 180 of these loci (-50% of the RACE-positive loci) and have similar conservation profiles across species. To test this, we appearing to have incorporated at least one exon from an upstream clustered Un Tx Frags using two methods. The first methodused gene expression levels in 11 cell lines or conditions, dinucleotide composi- To characterize further the 5 Rx Frag extensions, we performed tion, location relative to annotated genes, and evolutionary conser- RT-PCR followed by cloning and sequencing for 550 of the 5 vation profiles to cluster Tx Frags(both unannotated and annotated ) RxFrags(including the 261 longest extensions identified for each loci,and 21% could be clustered into 200 novel loci (with an average is a combination method previously described and validated in sev- of -7TxFrags per locus). We experimentally examined these novel eral studies 4.170 Hybridization of the RT-PCR products to tiling loci to study the connectivity of transcripts amongst Un Tx Frags and arrays confirmed connectivity in almost 60%of the cases. Sequenced between UnTx Frags and known exons. Overall, about 40% of the clones confirmed transcript extensions. Longer extensions were connections(18 out of 46)were validated by RT-PCR. The second harder to clone and sequence, but 5 out of 18 RT-PCR-positive clustering method involved analysing a time course(0, 2, 8 and 32 h) extensions over 100 kb were verified by sequencing(see Supple- of expression changes in human HL60 cells following retinoic-acid mentary Information section 2.9.7 and ref. 17). The detection of stimulation. There is a coordinated program of expression changes numerous RxFrag extensions coupled with evidence of considerable from annotated loci, which can be shown by plotting Pearson intronic transcription indicates that protein-coding loci are more correlation values of the expression levels of exons inside annotated transcriptionally complex than previously thought. Instead of the loci versus unrelated exons(see Supplementary Information sec- traditional view that many genes have one or more alternative tran Un TxFrags, albeit lower, though still significantly different from gene may both encode multiple protein products and produce other randomized sets. Both clustering methods indicate that there is coor- transcripts that include sequences from both strands and from neigh dinated behaviour of many Un. Tx Frags, consistent with them res- bouring loci(often without encoding a different protein).Figure 3 ding in connected transcripts illustrates such a case, in which a new fusion transcript is expressed in Transcript connectivity. We used a combination of RACe and tiling he small intestine, and consists of at least three coding exons from rrayszo to investigate the diversity of transcripts emanating from the ATP50 gene and at least two coding exons from the DONSON protein-coding loci. Analogous to TxFrags, we refer to transcript 1/112113114115/1 a Intronic proximal hill 宽×889x6 Figure 1 Annotated and unannotated TxFrags detected in different cell lines. The proportion of different types of transcripts detected in the indicated number of cell lines(from 1/ll at the far left to 11/11 at the far t)is shown. The data for annotated and unannotated TxFrags are indicated separately, and also split into different cat based on Extension length(kb) GENCODE classification: exonic, intergenic(proximal being within 5kb of a Figure 2 Length of genomic extensions to GENCODE-annotated gene and distal being otherwise), intronic(proximal being within 5 kb of an the basis of RACE experiments followed by array hybridizations ( intron and distal being otherwise), and matching other ESTs not used in the The indicated bars reflect the frequency of extension lengths amon GENCODE annotation(principally because they were unspliced). The yaxis length classes. The solid line shows the cumulative frequency of indicates the per cent of tiling array nucleotides present in that class for that of that length or greater. Most of the extensions are greater than 50kb from number of samples(combination of cell lines and tissues the annotated gene(see text for details) E2007 Nature Publishing Groupprotein structure18. Such exons are on average less expressed (25% versus 87% by RT–PCR; see Supplementary Information section 2.7) than exons involved in more than one transcript (see Supple￾mentary Information section 2.4.3), but when expressed have a tissue distribution comparable to well-established genes. Critical questions are raised by the presence of a large amount of unannotated transcription with respect to how the corresponding sequences are organized in the genome—do these reflect longer tran￾scripts that include known loci, do they link known loci, or are they completely separate from known loci? We further investigated these issues using both computational and new experimental techniques. Unannotated transcription. Consistent with previous findings, the Un.TxFrags did not show evidence of encoding proteins (see Sup￾plementary Information section 2.8). One might expect Un.TxFrags to be linked within transcripts that exhibit coordinated expression and have similar conservation profiles across species. To test this, we clustered Un.TxFrags using two methods. The first method19 used expression levels in 11 cell lines or conditions, dinucleotide composi￾tion, location relative to annotated genes, and evolutionary conser￾vation profiles to cluster TxFrags (both unannotated and annotated). By this method, 14% of Un.TxFrags could be assigned to annotated loci, and 21% could be clustered into 200 novel loci (with an average of ,7 TxFrags per locus). We experimentally examined these novel loci to study the connectivity of transcripts amongst Un.TxFrags and between Un.TxFrags and known exons. Overall, about 40% of the connections (18 out of 46) were validated by RT–PCR. The second clustering method involved analysing a time course (0, 2, 8 and 32 h) of expression changes in human HL60 cells following retinoic-acid stimulation. There is a coordinated program of expression changes from annotated loci, which can be shown by plotting Pearson correlation values of the expression levels of exons inside annotated loci versus unrelated exons (see Supplementary Information sec￾tion 2.8.2). Similarly, there is coordinated expression of nearby Un.TxFrags, albeit lower, though still significantly different from randomized sets. Both clustering methods indicate that there is coor￾dinated behaviour of many Un.TxFrags, consistent with them res￾iding in connected transcripts. Transcript connectivity. We used a combination of RACE and tiling arrays20 to investigate the diversity of transcripts emanating from protein-coding loci. Analogous to TxFrags, we refer to transcripts detected using RACE followed by hybridization to tiling arrays as RxFrags. We performed RACE to examine 399 protein-coding loci (those loci found entirely in ENCODE regions) using RNA derived from 12 tissues, and were able to unambiguously detect 4,573 RxFrags for 359 loci (see Supplementary Information section 2.9). Almost half of these RxFrags (2,324) do not overlap a GENCODE exon, and most (90%) loci have at least one novel RxFrag, which often extends a considerable distance beyond the 59 end of the locus. Figure 2 shows the distribution of distances between these new RACE-detected ends and the previously annotated TSS of each locus. The average distance of the extensions is between 50 kb and 100 kb, with many extensions (.20%) being more than 200 kb. Consistent with the known presence of overlapping genes in the human genome, our findings reveal evidence for an overlapping gene at 224 loci, with transcripts from 180 of these loci (,50% of the RACE-positive loci) appearing to have incorporated at least one exon from an upstream gene. To characterize further the 59 RxFrag extensions, we performed RT–PCR followed by cloning and sequencing for 550 of the 59 RxFrags (including the 261 longest extensions identified for each locus). The approach of mapping RACE products using microarrays is a combination method previously described and validated in sev￾eral studies14,17,20. Hybridization of the RT–PCR products to tiling arrays confirmed connectivity in almost 60% of the cases. Sequenced clones confirmed transcript extensions. Longer extensions were harder to clone and sequence, but 5 out of 18 RT–PCR-positive extensions over 100 kb were verified by sequencing (see Supple￾mentary Information section 2.9.7 and ref. 17). The detection of numerous RxFrag extensions coupled with evidence of considerable intronic transcription indicates that protein-coding loci are more transcriptionally complex than previously thought. Instead of the traditional view that many genes have one or more alternative tran￾scripts that code for alternative proteins, our data suggest that a given gene may both encode multiple protein products and produce other transcripts that include sequences from both strands and from neigh￾bouring loci (often without encoding a different protein). Figure 3 illustrates such a case, in which a new fusion transcript is expressed in the small intestine, and consists of at least three coding exons from the ATP5O gene and at least two coding exons from the DONSON 1/11 2/11 3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 cell lines Intronic proximal Intronic distal Intergenic proximal Intergenic distal Other ESTs GENCODE exonic 12 Annotated transcripts Novel transcripts 10 8 6 4 2 0 2 Tiling array nucleotides (%) 4 6 8 10 12 Figure 1 | Annotated and unannotated TxFrags detected in different cell lines. The proportion of different types of transcripts detected in the indicated number of cell lines (from 1/11 at the far left to 11/11 at the far right) is shown. The data for annotated and unannotated TxFrags are indicated separately, and also split into different categories based on GENCODE classification: exonic, intergenic (proximal being within 5 kb of a gene and distal being otherwise), intronic (proximal being within 5 kb of an intron and distal being otherwise), and matching other ESTs not used in the GENCODE annotation (principally because they were unspliced). The y axis indicates the per cent of tiling array nucleotides present in that class for that number of samples (combination of cell lines and tissues). Per cent of RxFrag extensions (shaded boxes) 0 5 10 15 Extension length (kb) Cumulative per cent of extensions this length or greater (line) < 0.5 0.5–1 5–10 10–25 25–50 50–100 100–200 200–300 300–400 400–500 ≥ 1–5 500 0 10 20 30 40 50 60 70 80 90 100 Figure 2 | Length of genomic extensions to GENCODE-annotated genes on the basis of RACE experiments followed by array hybridizations (RxFrags). The indicated bars reflect the frequency of extension lengths among different length classes. The solid line shows the cumulative frequency of extensions of that length or greater. Most of the extensions are greater than 50 kb from the annotated gene (see text for details). ARTICLES NATURE|Vol 447| 14 June 2007 802 ©2007 NaturePublishingGroup
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有