正在加载图片...
RESEARCH ARTICLE ally it uses refer of predefin otial and n r to e at httn ell lin at the urther these analyses ar ctedsgRArca MAC s pe that n ith all the NAs reshold of FDR thresholdequalt (ADaM)to identify Es or msibe er of tests performed in the th at lcast 1 Glass's△>1fo (f ibed furt der t .aue for pan-cance canc r types for a gene sho be pred 1s tto the sd.of the tw ted fror in th utation of the tar tability,an t be d ou ysis.To ave gene oaded from the GTEx Portal hed and cal and computational relate 33 ncer as well as ind 41RESEARCH Article Additionally, it uses reference sets of predefined essential and non-essential genes30. However, in order to avoid their status (essential or non-essential) being defined a priori, we removed any high-confidence cancer driver genes as defined previously7 from these sets. The resulting curated reference gene sets are available as built-in data objects in the R implementation of BAGEL (curated_BAGEL_essential.rdata and curated_BAGEL_nonEssential.rdata, both available at https://github.com/ francescojm/BAGELR/tree/master/data). A statistical significance threshold for gene-level Bayesian factors was determined for each cell line as described previ￾ously8 . Each gene was assigned a scaled Bayesian factor computed by subtracting the Bayesian factor at the 5% FDR threshold defined for each cell line from the original Bayesian factor, and a binary fitness score equal to 1 if the resulting scaled Bayesian factor was greater than 0. Further details on these analyses are included in the Supplementary Information. In addition, CRISPRcleanR-corrected sgRNA treatment counts were derived from the corrected sgRNA-level count fold changes (using the ccr.correctCounts function of CRISPRcleanR) and used as input into MAGeCK35 to compute the depletion significance using mean–variance modelling. This was performed using the MAGeCK Python package (version 0.5.3), specifying in the command line call that no normalization was required (as this was already performed by CRISPRcleanR). At the end of this stage, the following gene-level depletion score matrices were produced for each cell line: raw count fold changes, copy num￾ber bias-corrected count fold changes, Bayesian factors, scaled Bayesian factors, binary fitness scores and MAGeCK depletion FDRs. All scores are summarized for each cell line and available at https://cog.sanger.ac.uk/cmp/download/essenti￾ality_matrices.zip, together with all the sgRNAs raw count files (available at https:// cog.sanger.ac.uk/cmp/download/raw_sgrnas_counts.zip). High-level CRISPR screen data analyses. Adaptive daisy model (ADaM) to identify core fitness genes. We designed the adaptive daisy model (ADaM), an heuristic algo￾rithm for the identification of core fitness genes, implemented it in an R package and made it publicly available at https://github.com/francescojm/ADaM. ADaM is based on the daisy model8 , but it adaptively determines the minimal number of cell lines m from a given cancer type in which a gene should exert a significant fitness effect for that gene to be considered a core fitness gene for that cancer type. ADaM is described further in the Supplementary Information. In order to identify pan-cancer core fitness genes, we applied the same method to determine the minimal number k of cancer types for which a gene should be predicted as a pan-cancer core fitness gene. Characterization of ADaM pan-cancer core fitness genes. Reference sets of essential and non-essential genes were extracted from a previously published study30. Other reference gene sets (used while characterizing the ADaM pan-cancer core fitness genes, described below) were derived from the Molecular Signature Database (MSigDB36) and post-processed as described previously32. A more recent set of a priori known essential genes was derived from a previously published study9 . The pan-cancer core fitness genes that did not belong to any of the aforementioned gene sets were tested for gene family enrichments (using a hypergeometric test) by deriving gene annotations using the BioMart R package37 and biological path￾way enrichments using a comprehensive collection of pathways gene sets from Pathway Commons38 (post-processed to reduce redundancies across different sets as described previously39). All enrichment P values were corrected using the Benjamini–Hochberg method. Results are shown in Supplementary Table 4. Comparison between the ADaM pan-cancer core fitness genes and other reference sets of essential genes. We compared the pan-cancer core fitness genes identified by ADaM with the BAGEL reference set of essential genes30, and a more recently proposed larger set of essential genes9 in terms of size, estimated precision (number of included true positive genes/number of included genes) and recall (number of included true positive genes/total number of true positive genes). In these com￾parisons, we used gold-standard essential genes involved in cell essential processes (downloaded from the MSigDB36 and post-processed as described previously32). In addition, we estimated FDRs for the three gene sets (number of included false positive genes/total number of false positive genes) considering genes predicted to be strongly context-specific essential (thus not core-fitness essential) to be false-positive genes according to a previous publication12, and using three dif￾ferent confidence levels, as further described in the Supplementary Information. Basal expression of cancer-type specific core fitness genes in normal tissues. Basal gene median reads per kilobase of transcript per million mapped reads in normal human tissues were downloaded from the GTEx Portal40, log-transformed and quantile-normalized on a tissue-type basis. Statistical and computational analyses. ANOVA to identify genomic correlates with gene fitness. We performed a systematic ANOVA to test associations between gene-level fitness effects and the presence of 484 cancer driver events (CDEs; 151 single-nucleotide variants and 333 copy number variants)7 or MSI status at the pan-cancer as well as individual cancer-type levels. In total, 10 cancer types with at least 10 screened cell lines were analysed (breast carcinoma, colorectal carcinoma, gastric carcinoma, head and neck carcinoma, lung adenocarcinoma, neuroblastoma, oral cavity carcinoma, ovarian carcinoma, pancreatic carcinoma and squamous cell lung carcinoma). The remaining cancer types were collapsed on a tissue basis (annotation in Supplementary Table 1) and the resulting tissues with at least 10 cell lines were included in the analysis (bone, central nervous system, oesophagus, haematopoietic and lymphoid). A total of 14 analyses (referred for simplicity as cancer-type-specific ANOVAs in the main text and below) plus a pan-cancer analysis including all screened cell lines were performed. Each ANOVA was performed using the analytical framework described previously7 and imple￾mented in a Python package41 (https://github.com/CancerRxGene/gdsctools). Only genes that did not belong to any set of prior known essential genes (defined in the previous sections) and not predicted by ADaM to be core fitness genes were included in the analyses. For all tested gene fitness–CDE associations, effect size estimations versus pooled s.d. (quantified using Cohen’s d), effect sizes versus individual s.d. (quantified using two different Glass’s Δ metrics, for the CDE￾positive and the CDE-negative populations separately), CDE P values and all other statistical scores were obtained from the fitted models. An association was tested only if at least three cell lines were contained in the two sets resulting from the dichotomy induced by CDE status (that is, at least three CDE-positive and three CDE-negative cell lines). The P values from all ANOVAs were corrected together using the Tibshirani–Storey method42. Subsequently, MSI status was also tested for statistical associations with differential gene fitness effects for pan-cancer and can￾cer types with at least three MSI cell lines. We used the following statistical sig￾nificance and effect size thresholds for category associations between gene fitness effects and genomic markers: Class A marker: a P-value threshold of 10−3 with a FDR threshold equal to 25% (or 5% for MSI) and with Glass’s Δ > 1. Different FDR thresholds were used for associations with CDEs or MSI because the number of tests performed in the former was six orders of magnitude larger than the latter. Class B marker: a FDR threshold of 30% with at least one Glass’s Δ > 1 for pan-cancer associations. Class C marker or weaker: an ANOVA P-value threshold of 10−3 and for pan-cancer associations at least one Glass’s Δ > 1; for weaker, a simple Student’s t-test (for difference assessment of the mean depletion fold change between CDE￾positive/CDE-negative cell lines) P-value threshold of 0.05 and for pan-cancer associations, at least one Glass’s Δ > 1. The additional constraint of Glass’s Δ values (quantifying the effect size with respect to the s.d. of the two involved sub-populations of samples) was considered for the pan-cancer markers in order to account for the significantly larger number of samples analysed in the pan-cancer setting, which might result in highly signifi￾cant P values even for small effect size associations. Further details on this analysis are reported in the Supplementary Information. Target priority scores and target tractability. Computation of the target priority scores and their significance is described in the Supplementary Information. To estimate the likelihood of a target to bind a small molecule or the likelihood of a target to be accessible to an antibody, we made use of a genome-wide target tractability assessment pipeline14. The in silico pipeline integrates data from pub￾lic sources, and assigns human protein-coding genes into hierarchical qualitative buckets. Predicted tractability and confidence in the data increased from bucket 10 to bucket 1; targets in bucket 1 were considered to be the most tractable. Of note, targets in lower buckets (that is, buckets 10 to 8) were considered to have an uncertain tractability, and should not be ruled out as ‘intractable’ without a deep tractability assessment. Further details are provided in the Supplementary Information. Characterization of target protein families and enrichment analysis. To characterize protein families and compute statistical enrichment, we made use of the Panther online tool43. GPX4 differential expression analysis. RNA-sequencing gene expression meas￾urements transformed using voom44 were obtained from a previously published study45. For GPX4 analysis, cell lines were divided into two groups according to their loss-of-fitness response to GPX4 knockout (using BAGEL FDR < 5% as significance threshold for gene depletion) and gene expression fold changes were calculated between the GPX4 non-dependent and dependent cell lines (log2 values of the mean difference). Differential gene expression was statistically assessed using the R package Limma46. Gene set enrichment analysis was performed with ssGSEA36 and cancer hallmark gene sets were used to identify significant enrich￾ment among the top differentially expressed genes. Then, 10,000 random permu￾tations were performed for each signature to calculate empirical P values and a Benjamini–Hochberg FDR correction was applied. WRN dependency in MSI cell lines. Co-competition assay. The sequences of sgRNAs that target WRN and cell lines used in validation experiments are described in Supplementary Table 10. This included two sgRNA from the original screen and two independent sgRNAs. The sgRNAs were cloned into pKLV2-U6gR￾NA5(BbsI)-PGKpuro2ABFP-W (Addgene, 67974). Cell lines were transduced at around 50% efficiency as described above in six-well plates. A co-competition
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有