正在加载图片...
ARTICLES NATURE Vol 447 14 June 2007 behaviour of sequence-specific factors points to distinct biological cross-integrating data generated using all transcription factor and differences, mediated by transcription factors, between distal regula- histone modification assays, including results falling below an arbi ry sites and TSSs. rary threshold in individual experiments. Specifically, we used four Unbiased maps of sequence-specific regulatory factor binding. complementary methods to integrate the data from 129 ChIP-chip The previous section focused on specific positions defined by TSSs data sets(see Supplementary Information section 3. 13 and ref. 58 or DHSs. We then analysed sequence-specific transcription factor These four methods detect different classes of regulatory clusters and binding data in an unbiased fashion. We refer to regions with as a whole identified 1, 393 clusters. Of these, 344 were identified by all enriched binding of regulatory factors as RFBRs. RFBRs were iden- four methods, with another 500 found by three methods(see tified on the basis of ChIP-chip data in two ways: first, each invest- Supplementary Information section 3. 13.5).67% of the 344 regula high-enrichment regions, and second (and independently ), a strin- 1, 393)reside within 2.5 kb of a known or novel TSS(as defined above; gent false discovery rate(FDR) method was applied to analyse all see Table 3 and Supplementary Information section 3.14 for abreak data using three cut-offs(1%, 5% and 10%). The laboratory-specific down by category). Restricting this analysis to previously annotated nd FDR-based methods were highly correlated, particularly for TSSs( for example, RefSeq or Ensembl)reveals that roughly 25% of regions with strong signals. 1. For consistency, we used the results the regulatory clusters are close to a previously identified TSS. These btained with the FDR-based method(see Supplementary Infor- results suggest that many of the regulatory clusters identified by mation section 3.10). These RFBRs can be used to find sequence integrating the ChIP-chip data sets are undiscovered promoters or motifs(see Supplementary Information section $3.11) th transcrip RFBRs are associated with the 5'ends of transcripts. The distri- test these possibilities, sets of 126 and 28 non-GENCODE-based bution of RFBRs is non-random(see ref 10)and correlates with the gulatory clusters were tested for promoter activit positions of TSSs. We examined the distribution of specific RFBRs mentary Information section 3. 15)and by RACE, respectively relative to the known TSSs. Different transcription factors and his- These studies revealed that 24.6% of the 126 tested regulatory clusters tone modifications vary with respect to their association with TSSs had promoter activity and that 78.6% of the 28 regulatory clusters (Fig. 6; see Supplementary Information section 3. 12 for modelling of analysed by RACE yielded products consistent with a TSSs.The andom expectation). Factors for which binding sites are most ChlP-chip data sets were generated on a mixture of cell lines, pre- enriched at the 5 ends of genes include histone modifications, dominantly HeLa and GM06990, and were different from the CAGE TAFI and RNA Pol ll with a hypo-phosphorylated carboxy-terminal PeT data, meaning that tissue specificity contributes to the presence of unique TSSs and regulatory clusters. The large increase in pre that E2F1, a sequence-specific factor that regulates the expression of moter proximal regulatory clusters identified by including the addi many genes at the Gl to S transition 2, is also tightly associated with tional novel TSSs coupled with the positive promoter and RACE TSSs, this association is as strong as that of TAFl, the well-known lys suggests that most of the regulatory regions identifiable by TATA box-binding protein associated factor 1 (ref. 53). These results these clustering methods represent bona fide promoters(see suggest that E2FI has a more general role in transcription than prev- Supplementary Information 3.16). Although the regulatory factor cale assays did not support the promoter binding that was found in many of the sites from these experiments would have previously smaller-scale studies(for example, on SIRTI and SPIl(PUl)). Integration of data on sequence-specific factors. We expect that place use of RefSeq- or Ensembl-based gene definition to define regulatory information is not dispersed independently across the distal sites promoter proximity will dramatically overestimate the number of genome but rather is clustered into distinct regions". We refer to Predicting SSs and transcriptional activity on the basis of chro- regions that contain multiple regulatory elements as regulatory clus- matin structure. The strong association between TSSs and both his- ters. We sought to predict the location of regulatory clusters by tone modifications and DHSs prompted us to investigate whether the location and activity of TSSs could be predicted solely on the basis of chromatin structure information. We trained a support vector amce specie a oo: machine(SVM)by using histone modification data anchored around DHSs to discriminate between DHSs near TSSs and those distant from TSSs. We used a selected 2,573 DHSs, split roughly between TSS- proximal DHSs and TSS-distal DHSs, as a training set. The SVM Information section 3.17). Using this SVM, we then predicted TSSs using information about DHSs and histone modifications 110 high-scoring predicted TSSs, 81 resided within 2.5 kb of a novel TSS. As expected, these show a significant overlap to the novel TSS groups(defined above) but without a strong bias towards any par ticular category(see Supplementary Information section 3. 17.1.5) To investigate the relationship between chromatin structure and gene expression, we examined transcript levels in two cell lines using a transcript-tiling array. We compared this transcript data with the 0.3 results of ChIP-chip experiments that measured histone modifica- Fraction of tsss near RFBRs tions across the ENCODE regions. From this, we developed a variety Ire 6 Distribution of RFBRs relative to GENCODE TSSs. Different of predictors of expression status using chromatin modifications as FBRS fr variables; these were derived using both decision trees and SVMs(see plotted showing their relative distribution near TSSs. The xaxis indicates the Supplementary Information section 3. 17). The best of these correctly roportion of TSSs close(within 2.5 kb)to the specified factor. The yaxis predicts expression status(transcribed versus non-transcribed)in indicates the proportion of RFBRs close to TSSs. The size of the circle 91% of cases. This success rate did not decrease dramatically when provides an indication of the number of RFBRs for each factor. A handful of the predicting algorithm incorporated the results from one cell line to representative factors are labelled. predict the expression status of another cell line. Interestingly, despite E2007 Nature Publishing Groupbehaviour of sequence-specific factors points to distinct biological differences, mediated by transcription factors, between distal regula￾tory sites and TSSs. Unbiased maps of sequence-specific regulatory factor binding. The previous section focused on specific positions defined by TSSs or DHSs. We then analysed sequence-specific transcription factor binding data in an unbiased fashion. We refer to regions with enriched binding of regulatory factors as RFBRs. RFBRs were iden￾tified on the basis of ChIP-chip data in two ways: first, each invest￾igator developed and used their own analysis method(s) to define high-enrichment regions, and second (and independently), a strin￾gent false discovery rate (FDR) method was applied to analyse all data using three cut-offs (1%, 5% and 10%). The laboratory-specific and FDR-based methods were highly correlated, particularly for regions with strong signals10,11. For consistency, we used the results obtained with the FDR-based method (see Supplementary Infor￾mation section 3.10). These RFBRs can be used to find sequence motifs (see Supplementary Information section S3.11). RFBRs are associated with the 59 ends of transcripts. The distri￾bution of RFBRs is non-random (see ref. 10) and correlates with the positions of TSSs. We examined the distribution of specific RFBRs relative to the known TSSs. Different transcription factors and his￾tone modifications vary with respect to their association with TSSs (Fig. 6; see Supplementary Information section 3.12 for modelling of random expectation). Factors for which binding sites are most enriched at the 59 ends of genes include histone modifications, TAF1 and RNA Pol II with a hypo-phosphorylated carboxy-terminal domain51—confirming previous expectations. Surprisingly, we found that E2F1, a sequence-specific factor that regulates the expression of many genes at the G1 to S transition52, is also tightly associated with TSSs52; this association is as strong as that of TAF1, the well-known TATA box-binding protein associated factor 1 (ref. 53). These results suggest that E2F1 has a more general role in transcription than prev￾iously suspected, similar to that for MYC54–56. In contrast, the large￾scale assays did not support the promoter binding that was found in smaller-scale studies (for example, on SIRT1 and SPI1 (PU1)). Integration of data on sequence-specific factors. We expect that regulatory information is not dispersed independently across the genome, but rather is clustered into distinct regions57. We refer to regions that contain multiple regulatory elements as ‘regulatory clus￾ters’. We sought to predict the location of regulatory clusters by cross-integrating data generated using all transcription factor and histone modification assays, including results falling below an arbit￾rary threshold in individual experiments. Specifically, we used four complementary methods to integrate the data from 129 ChIP-chip data sets (see Supplementary Information section 3.13 and ref. 58. These four methods detect different classes of regulatory clusters and as a whole identified 1,393 clusters. Of these, 344 were identified by all four methods, with another 500 found by three methods (see Supplementary Information section 3.13.5). 67% of the 344 regula￾tory clusters identified by all four methods (or 65% of the full set of 1,393) reside within 2.5 kb of a known or novel TSS (as defined above; see Table 3 and Supplementary Information section 3.14 for a break￾down by category). Restricting this analysis to previously annotated TSSs (for example, RefSeq or Ensembl) reveals that roughly 25% of the regulatory clusters are close to a previously identified TSS. These results suggest that many of the regulatory clusters identified by integrating the ChIP-chip data sets are undiscovered promoters or are somehow associated with transcription in another fashion. To test these possibilities, sets of 126 and 28 non-GENCODE-based regulatory clusters were tested for promoter activity (see Supple￾mentary Information section 3.15) and by RACE, respectively. These studies revealed that 24.6% of the 126 tested regulatory clusters had promoter activity and that 78.6% of the 28 regulatory clusters analysed by RACE yielded products consistent with a TSS58. The ChIP-chip data sets were generated on a mixture of cell lines, pre￾dominantly HeLa and GM06990, and were different from the CAGE/ PET data, meaning that tissue specificity contributes to the presence of unique TSSs and regulatory clusters. The large increase in pro￾moter proximal regulatory clusters identified by including the addi￾tional novel TSSs coupled with the positive promoter and RACE assays suggests that most of the regulatory regions identifiable by these clustering methods represent bona fide promoters (see Supplementary Information 3.16). Although the regulatory factor assays were more biased towards regions associated with promoters, many of the sites from these experiments would have previously been described as distal to promoters. This suggests that common￾place use of RefSeq- or Ensembl-based gene definition to define promoter proximity will dramatically overestimate the number of distal sites. Predicting TSSs and transcriptional activity on the basis of chro￾matin structure. The strong association between TSSs and both his￾tone modifications and DHSs prompted us to investigate whether the location and activity of TSSs could be predicted solely on the basis of chromatin structure information. We trained a support vector machine (SVM) by using histone modification data anchored around DHSs to discriminate between DHSs near TSSs and those distant from TSSs. We used a selected 2,573 DHSs, split roughly between TSS￾proximal DHSs and TSS-distal DHSs, as a training set. The SVM performed well, with an accuracy of 83% (see Supplementary Information section 3.17). Using this SVM, we then predicted new TSSs using information about DHSs and histone modifications—of 110 high-scoring predicted TSSs, 81 resided within 2.5 kb of a novel TSS. As expected, these show a significant overlap to the novel TSS groups (defined above) but without a strong bias towards any par￾ticular category (see Supplementary Information section 3.17.1.5). To investigate the relationship between chromatin structure and gene expression, we examined transcript levels in two cell lines using a transcript-tiling array. We compared this transcript data with the results of ChIP-chip experiments that measured histone modifica￾tions across the ENCODE regions. From this, we developed a variety of predictors of expression status using chromatin modifications as variables; these were derived using both decision trees and SVMs (see Supplementary Information section 3.17). The best of these correctly predicts expression status (transcribed versus non-transcribed) in 91% of cases. This success rate did not decrease dramatically when the predicting algorithm incorporated the results from one cell line to predict the expression status of another cell line. Interestingly, despite 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 0.1 0.15 0.2 0.25 0.3 Fraction of TSSs near RFBRs Fraction of RFBRs near TSSs E2F1 Pol II TAF1 MYC CTCF SIRT1 SPI1 H3K27me3 STAT1 SMARCC1 SMARCC2 H3K4me2 H3K4me3 H3K4me1 Sequence-specific >200 >100 > 50 > 25 ≤ 25 General >200 >100 > 50 > 25 ≤ 25 Figure 6 | Distribution of RFBRs relative to GENCODE TSSs. Different RFBRs from sequence-specific factors (red) or general factors (blue) are plotted showing their relative distribution near TSSs. The x axis indicates the proportion of TSSs close (within 2.5 kb) to the specified factor. The y axis indicates the proportion of RFBRs close to TSSs. The size of the circle provides an indication of the number of RFBRs for each factor. A handful of representative factors are labelled. ARTICLES NATURE|Vol 447| 14 June 2007 806 ©2007 NaturePublishingGroup
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有