![]() |
急躁的甜瓜 · 全日制用工和非全日制用工是什么意思?它们有什 ...· 11 月前 · |
![]() |
谦和的帽子 · 如何利用Social ...· 1 年前 · |
![]() |
面冷心慈的花卷 · 极氪发新车,先和理想杠上_手机新浪网· 1 年前 · |
![]() |
完美的镜子 · 科学生活:“计算机皮肤”的元凶到底是谁?· 1 年前 · |
![]() |
很酷的手套 · 新Edge浏览器起始页被360导航劫持,如何 ...· 1 年前 · |
![]() |
腼腆的手链
1 年前 |
To compare our analysis to an independent source, we downloaded the raw human and mouse TF–target interactions from the TRRUST database. Mouse genes and TFs were mapped to HGNC-approved symbols, and the human and mouse data were combined. TF–target interactions with unknown directionality were discarded. TFs that had at least 20 targets with known directionality were retained for the analysis.
The server-side of ChEA3 was written in Java and runs on Tomcat 9. Java servlets process gene list submissions from the front end. The user interface of ChEA3 is implemented with jQuery ( 29 ), the templating application Mobirise 4.8.1, and Bootstrap v4 ( 30 ). The interactive TF network visualization is implemented with D3.js v4 ( 31 ). The front and back end components are compiled and assembled together into a JAR file. The web application is running in a Docker container ( 32 ) and the Docker image is deposited in Docker Hub ( https://hub.docker.com/r/maayanlab/chea3 ). ChEA3 also provides API access to the service. The results from the API are returned in JavaScript Object Notation (JSON) format. The complete ChEA3 web service code is available on GitHub at https://github.com/maayanlab/chea3web .
To create an interactive global view of the human TF regulatory network, Weighted Gene Co-expression Network Analysis (WGCNA) ( 33 ) was applied on GTEx ( 12 ), ARCHS4 ( 20 ) and TCGA expression data. The quantile-normalized GTEx gene expression dataset was filtered to only include TFs. WGCNA was applied on the reduced TF GTEx matrix using the WGCNA R package with default parameters. Similarly, 100 random RNA-seq samples for each of 18 tissue types were pulled from the ARCHS4 database and were quantile normalized. The expression dataset was filtered to include only TFs, and WGCNA was applied with default parameters. To generate the TCGA network, TCGA primary tumor samples were randomly sampled such that we obtained a set of 26 cancer types with 100 samples for each type. The expression dataset was quantile-normalized, filtered to include only TFs, and WGCNA was applied with default parameters. The three resulting networks were visualized using Cytoscape ( 34 ) with the Allegro Edge-Repulsive Strong Clustering plugin. Node positions were exported from Cytoscape and visualized on the ChEA3 results page using D3.js.
To annotate the GTEx network, module eigengenes were correlated to GTEx tissue sample labels. Nodes were colored by the most significant tissue correlation to their parent module. GO Biological Pathway enrichment was conducted on the network-module-gene-members using the topGO R package ( 35 ) with the set of TFs as the background gene universe. Nodes were colored by the most significant result from this enrichment analysis. To annotate the TCGA network, module eigengenes were correlated to TCGA tumor sample types. Nodes were colored by the most significant tumor correlation to their parent module. To annotate the ARCHS4 network, module eigengenes were correlated to ARCHS4 tissue sample labels. Nodes were colored by the most significant tissue correlation to their parent module.
A transcription factor co-regulatory network was constructed from all TF–TF interactions described by the six ChEA3 primary libraries. Edges that were supported by evidence from two or more different libraries were retained in the network. Edges are directed where ChIP-seq evidence supports the interaction and are undirected in the case of co-occurrence or co-expression evidence only. The network is subset based on the top TF results from a user query and is visualized using D3.js.
From the results of each query, a binary matrix with the top 5 TFs returned by each library on the columns and query genes on the rows is populated according to whether the query gene appears within the target gene set of the library TF. This matrix is submitted to the Clustergrammer API ( 36 ) which returns a URL to an interactive clustergram of the matrix. This URL is displayed in an iframe as part of the ChEA3 results visualizations.
ChEA3 performs TF target overrepresentation analysis against six TF target set libraries covering 1,632 unique TFs (Table (Table1). 1 ). Site-specific DNA-binding TFs were included in ChEA3 as defined in the seminal publication by Lambert et al. ( 1 ). Non-specific transcription factors, cofactors, and chromatin modifiers are excluded. Genes that are highly co-expressed with transcription factors were pulled from the GTEx ( 12 ) RNA-seq data, and from the uniformly reprocessed GEO RNA-seq data from ARCHS4 ( 20 ). Uniformly reprocessed publicly available ChIP-seq data were collected from ENCODE ( 22 ) and ReMap ( 26 ). ChIP-seq data were also manually mined from the literature by curating target lists from supporting materials of individual TF studies as an expansion of the work done for ChEA and ChEA2 ( 16 , 17 ). Finally, we used a wisdom of the crowd based approach to identify TF targets by mining user-submitted queries to the web tool Enrichr ( 23 , 24 ) to identify genes that frequently co-occur in submitted gene list queries with all human TFs. ChEA3 uses the pairwise FET to compare a user-submitted gene set query to each gene set in each ChEA3 library. Results are returned for each library separately as a list of TFs ranked by their FET P -value. TFs are given both an integer rank, with 1 corresponding to the most significant matching TF associated gene set, and a scaled rank from 1/ n to 1 where n is the number of unique TFs in the library.
Transcription factor target gene set libraries included in ChEA3. The TF coverage heatmap spans the 1634 human site-specific TFs in ( 1 ) with 1632 of those factors covered by ChEA3
Library | Unique TFs | Unique TF Interactions | Gene Sets |
---|---|---|---|
ARCHS4 Coexpression | 1628 | 480 504 | 1628 human |
ENCODE ChIP-seq | 118 | 392 667 | 552 (470 human, 82 mouse) |
Enrichr Queries | 1404 | 409 279 | 1404 (unknown species) |
GTEx Coexpression | 1607 | 468 672 | 1607 human |
Literature ChIP-seq | 164 | 340 547 | 307 (138 human, 164 mouse, 5 rat) |
ReMap ChIP-seq | 297 | 417 025 | 297 human |
We hypothesized that integrating TF target overrepresentation results from across libraries to generate a composite ranking of TFs might overcome unique biases associated with each library and improve the predictive performance of ChEA3. To this end, we developed two integration techniques: MeanRank and TopRank. For each of the 1632 TFs covered by ChEA3, we take the mean of the integer ranks across all libraries and re-rank based on this mean to generate a composite ranking that we term MeanRank. For TopRank, we take the maximum scaled rank assigned to each TF across all libraries and re-rank to generate a composite ranking. By benchmarking the quality of the ranking using an independent TF–target dataset, we demonstrate that the MeanRank and TopRank approaches indeed outperform the original six TF–target libraries using the benchmarking strategy described below.
To benchmark the predictive performance of ChEA3, gene expression signatures were extracted from 946 single-TF loss-of-function (LOF) and gain-of-function (GOF) human and mouse experiments from GEO. Relevant studies were identified, then control and perturbation samples were tagged, and then signatures were extracted using a uniform pipeline. This was first achieved for microarray studies by contributors to a microtask crowdsourcing project ( 18 ) and then for RNA-seq utilizing the ARCHS4 resource ( 20 ). We generated up, down, and combined up/down gene sets from these signatures and queried the ChEA3 API with each gene set to determine how well each ChEA3 library recovers the perturbed TF. ROC and PR curves were generated from the rankings of the experimentally perturbed TFs which compose the positive class, and sampled rankings of the unperturbed TFs that compose the negative class (Figure (Figure1A 1A - -C). C ). We also looked at the empirical cumulative probability density (ECPD) of the ranks of the positive class D ( r ) for each library as compared to the ECPD of a uniform rank distribution r , which is D ( r ) – r (Figure (Figure1D). 1D ). A greater deviation from zero indicates better recovery of the perturbed TFs. Integrating results across libraries yielded improved predictive performance by multiple metrics that assess the global distribution of ranks. By these metrics, the MeanRank approach performs the best. Interestingly, the Enrichr ‘wisdom of the crowd’ library displays the best performance of the six ChEA3 TF target libraries.
Performance of the ChEA3 libraries and integration techniques in recovering the perturbed TFs from 946 TF LOF and GOF experiments from the TFpertGEOupdn benchmark dataset. ( A ) Mean ROC AUC and mean PR AUC over 5000 bootstrapped ROC and PR curves; ( B ) composite ROC curves generated from 5000 boostrapped curves; ( C ) composite PR curves generated from 5000 bootstrapped curves; ( D ) the deviation of the cumulative distribution from uniform of the scaled rankings of each perturbed TF in the benchmarking dataset. Anderson-Darling test of uniformity: MeanRank P = 6.34 × 10 −7 ; TopRank P = 6.34 × 10 −7 ; ARCHS4 P = 6.34 × 10 −7 ; ENCODE P = 2.06 × 10 −6 ; Enrichr Queries P = 6.83 × 10 −7 ; GTEx P = 6.45 × 10 −7 ; Literature ChIP-seq P = 1.28 × 10 −6 ; ReMap P = 1.02 × 10 −6 .
Arguably, a ChEA3 user is interested only in the top-ranked TFs returned by the tool. Therefore, for each rank percentile, we examined the fraction of the benchmarking dataset TF perturbations that are recovered. We computed this fraction in two ways: one where we considered the entire benchmarking dataset, and one where we considered only the subset of the benchmarking dataset where the perturbed TFs are covered by the library ( Supplementary Figure S1 ). When we examined the fraction of the benchmarking subset recovered in the top percentile of the TF ranks, we observed that the integrated libraries perform comparably to the ChIP-seq libraries, but with much greater TF coverage (Figure (Figure2, 2 , Supplementary Figure S2 ).
Fraction of the TFpertGEOupdn benchmarking dataset subset recovered in the top one percentile of rankings compared to the library TF coverage. ( A ) A heatmap visualizing transcription factor coverage for the ChEA3 libraries. ( B ) The fraction of the TFpertGEOupdn subset TFs recovered in the top percentile of ranks for each ChEA3 library. Only the TFpertGEOupdn gene sets where the perturbed TF was covered by the library were considered when computing the ‘Percent Subset Recovered’.
In our global assessment of the TF rankings returned by the ChEA3 libraries, the ChIP-seq libraries displayed the lowest performance when assessing the global ranks. In contrast, in the subset of the benchmarking experiments where there is ChIP-seq data for the perturbed TF, the ChIP-seq libraries performed well in recovering the perturbed TFs (Figure (Figure2B). 2B ). This may reflect that some TF target sets may be more amenable to determination by ChIP-seq analysis than others. This could be due to several factors. Others have reported a high rate of non-functional binding sites ( 2 ). Further, ‘hyper-ChIPable’ regions of the genome exist near highly expressed genes and these regions show binding of many TFs ( 37 ), possibly further diminishing the specificity of the TF target gene sets in these libraries. Some TFs and their target sets may be more or less tissue- or context-specific than others ( 38 ), and some TFs may regulate their targets more distally ( 39 ).
In order to assess aspects of a signature that could be contributing to better predictions, the benchmarking dataset was separated into four groups of TF perturbation gene sets. These four groups consist of: (i) upregulated sets of genes from TF GOF experiments (Figure (Figure3A; 3A ; Supplementary Figures S3A and S4A ); (ii) downregulated sets of genes from TF GOF experiments (Figure (Figure3B; 3B ; Supplementary Figures S3B and S4B ); (iii) upregulated sets of genes from TF LOF experiments (Figure (Figure3C; 3C ; Supplementary Figures S3C and S4C ), and (iv) downregulated sets of genes from TF LOF experiments (Figure (Figure3D; 3D ; Supplementary Figures S3D and S4D ). For queries where the perturbed upstream TF had a loss of function, the ChIP-seq libraries perform best when queried with the downregulated genes. Conversely, for signatures where the upstream TFs were over-expressed, the ChIP-seq libraries perform best when queried with the upregulated genes. This observed behavior aligns with the notion that most transcription factors are activators. The co-expression and co-occurrence libraries recover the upstream TFs comparably well across TF perturbation and query types. We also assessed human and mouse TF LOF/GOF experiment-associated gene sets separately using the human hsTFpertGEOupdn and mouse mmTFpertGEOupdn datasets and found comparable ChEA3 performance for both species ( Supplementary Figures S5–S8 ). Finally, we assessed the effect of input size on predictive performance using the TFpertGEO200 and TFpertGEO1000 benchmark sets and found that the performance of ChEA3 is robust to a range of input gene set sizes ( Supplementary Figures S9–S14 ).
Effect of input type on ChEA3 performance. The deviation of the cumulative distribution from uniform of the scaled rankings of perturbed TFs in the benchmarking dataset for: ( A ) TF overexpression or chemical activation experiments from TFpertGEOup ; ( B ) TF overexpression or chemical activation experiments from TFpertGEOdn ; ( C ) TF knockdown, knockout or chemical inactivation experiments from TFpertGEOup ; and ( D ) TF knockdown, knockout or chemical inactivation experiments from TFpertGEOdn .
There are several existing tools that perform TF prioritization given gene sets or signatures as input (Table (Table2). 2 ). Since these tools were built with human data, we used the human TF LOF and GOF experiments for benchmarking them. For tools that accept discrete gene sets as input, which include BART, TFEA.ChIP and MAGICACT, we used the 443 single TF GOF and LOF experiments from the hsTFpertGEOupdn benchmarking dataset. VIPER and DoRothEA v2 require full gene expression signatures as input. We benchmarked DoRothEA regulons and an ARACNe-AP regulon generated from GTEx data on the 443 full signatures in the hsTFpertGEOsig benchmarking dataset. These results were compared to ChEA3 benchmarked on the hsTFpertGEOupdn dataset (Figure (Figure4 4 – 5 , Supplementary Figure S15 ). We show that both ChEA3 integration strategies, MeanRank and TopRank, outperform all the tools we tested when benchmarked against the hsTFpertGEO benchmarking datasets and also have greater TF coverage.
Summary of tools benchmarked against ChEA3
Tool | TF coverage a | Required input | Method | Data used to make predictions | Availability |
---|---|---|---|---|---|
TFEA.ChIP ( 9 ) | 271 | Gene set or sorted list of DEGs | FET or GSEA ( 15 ) | ENCODE ( 22 ) and GEO ( 19 ) ChIP-seq experiments | R package, Web Server https://www.iib.uam.es/TFEA.ChIP/ |
BART ( 8 ) | 273 | Gene set | Correlates cis-regulatory profile derived from query gene set with TF genomic binding profiles | DNAse I hypersensitivity, TF ChIP-seq | Standalone Application, Web Server http://bartweb.uvasomrc.io |
VIPER ( 5 ) | 454 b , 731 c , 1,607 d | Gene signature | aREA (analytic Rank-based Enrichment Analysis) ( 5 ) | ARACNe-generated gene regulatory network in same tissue type as query | R package |
DoRothEA v2 ( 7 ) | Reg. A: 163 Reg. B: 188 Reg. C: 313 Reg. D: 416 Reg. E: 1306 Top10Score: 1306 | Gene signature | aREA ( 5 ) | Literature, ReMap ChIP-seq ( 26 ), TF motif ( 47 , 48 ), GTEx co-expression ( 12 ) | R object for use with VIPER R package |
MAGICACT ( 11 ) | 109 | Gene set | Mann-Whitney test | ENCODE ( 22 ) ChIP-seq | Standalone Application |
a HGNC-mappable TFs that are considered site-specific TFs as defined by Lambert ( 1 ). Some tools contain additional general transcription factors, co-factors, or chromatin modifiers.
b Published B-cell regulatory network available in the bcellviper R package.
c ARACNe-AP built network using expression data from {"type":"entrez-geo","attrs":{"text":"GSE50588","term_id":"50588"}} GSE50588 ( 2 , 28 ).
d ARACNe-AP built network using expression data from GTEx ( 12 ).
Comparison of available TF prediction tools with ChEA3 with the hsTFpertGEO benchmarking dataset. ( A ) Composite ROC curves generated from 5000 bootstrapped curves; ( B ) composite PR curves generated from 5000 bootstrapped curves; ( C ) the deviation of the cumulative distribution from uniform of the scaled rankings of each perturbed TF in the benchmarking dataset; Anderson–Darling test of uniformity: VIPER GTEx Regulon P = 1.39 × 10 −6 , MAGICACT P = 6.58 × 10 −5 , TFEA.ChIP P = 2.47 × 10 −6 , BART P = 2.34 × 10 −6 , DoRothEA Regulon A P = 2.39 × 10 −6 ; DoRothEA Regulon B P = 2.22 × 10 −6 , DoRothEA Regulon C P = 1.92 × 10 −6 , DoRothEA Regulon D P = 1.71 × 10 −6 , DoRothEA Regulon E P = 1.46 × 10 −6 , DoRothEA Regulon TOP10score P = 1.46 × 10 −6 ; ( D ) mean ROC AUC and mean PR AUC over 5000 bootstrapped ROC and PR curves for available TF prediction tools as compared with ChEA3 benchmarked with hsTFpertGEO.
Comparison of available TF prediction tools with ChEA3. ( A ) The percent of the perturbed TFs recovered by the tool in the top one percentile of ranks as compared to TF coverage of the tool. For the ‘Percent Subset Recovered’ metric, we consider only the subset of the hsTFpertGEO TF perturbation experiments where the TF is covered by the tool. ( B ) The percent of the perturbed TFs recovered by the tool in the top one percentile of ranks as compared to TF coverage of the tool. For the ‘Percent Total Recovered’ metric, we consider all 443 TF perturbation experiments in the hsTFpertGEO benchmarking datasets. ( C ) Mean AUROC over 5000 bootstrapped curves compared to tool TF coverage. ( D ) Mean AUPR over 5000 bootstrapped curves compared to tool TF coverage.
VIPER was designed for a gene regulatory network inferred from the same cell or tissue type as the query. Therefore, we generated signatures from 49 TF shRNA experiments applied to B-lymphoblastoid cell line ( 2 ) as described in the methods. We tested two regulatory networks for VIPER: a published B-cell ARACNe regulatory network ( 40 , 41 ), and a network we built from all expression data described in the 49 TF shRNA study ( 2 ) using ARACNe-AP ( 28 ). We also derived discrete gene sets from the same TF shRNA dataset for input into ChEA3 for comparison ( Supplementary Figure S16 and S17 ). We show that both ChEA3 integration strategies, MeanRank and TopRank, outperform both B-cell regulons tested. Notably, TF-regulons derived from ARCHS4 and GTEx co-expression data, which includes many disparate cell and tissue types, performs better on the benchmarking dataset than the VIPER A GM19238 B-cell regulon. It should be noted that the VIPER A regulon was derived from the same gene expression dataset that was also used to generate the benchmarking query signatures.
Integrating libraries of putative TF targets allows for global analyses of transcription factor activity. To understand the general repressing or activating characteristics of the TFs in the ChEA3 libraries, we computed odds ratios (ORs) using gene sets from the TFpertGEOupdn benchmarking dataset and the ChEA3 ChIP-seq libraries. For the odds ratio computation, we define the numerator as the number of genes that are both upregulated upon perturbation of the TF and are ChIP-seq targets of the TF divided by the number of upregulated genes that are not ChIP-seq targets. We define the denominator as the number of genes that are both downregulated on perturbation of the TF and are ChIP-seq targets of the TF, divided by the number of downregulated genes that are not ChIP-seq targets. If the numerator is greater, then it can be said that the targets tend to be upregulated and the OR > 1. If the denominator is greater, then it can be said that the targets tend to be downregulated and the OR < 1. We then consider whether the perturbation increased or decreased the activity of the TF. If the TF acts as an activator and the experiment was a GOF perturbation, then we expect the OR > 1. Conversely if the TF acts as an activator and the experiment was a LOF perturbation, then we expect the OR < 1. If the TF acts as a repressor and the experiment was a GOF perturbation, then we expect the OR < 1. Finally, if the TF acts as a repressor and the experiment was a LOF perturbation, then we expect the OR > 1. Therefore, the negative log(OR) was considered for TF LOF experiments, while the positive log(OR) was considered for TF GOF experiments (Figure (Figure6A, B). 6A , B). These values will be positive if the TF is an activator and negative if the TF is a repressor. We recover known activators and repressors in this analysis, for example, REST is a known repressor of neuronal genes in non-neuronal cell types. REST is shown to predominately be downregulating its targets. CTCF is predominantly considered an insulator ( 42 ), but also shown to be an activator ( 43 ). In our analysis, CTCF appears to be predominately activating its putative targets. This observation points to a potential role in tethering distant enhancers to their promoters ( 44 ). MYC, while predominately shown to be an activating TF, also has significant ORs that suggest a repressor role in some contexts, which is supported by previous studies ( 45 ). We also compared our analysis to mouse and human TF–target interactions mined from literature in the TRRUST v2 reference database (Figure (Figure6C) 6C ) ( 46 ). The TRRUST database contained signed and directed connections mined from the literature. Overall, our automated analysis agreed with the trends observed from TRRUST.
Scatterplots showing activating/repressing activity across TFs. Significant ORs ( P < 0.05) are plotted. For uniformity, when examining loss-of-function TF perturbations, we consider –log(OR), as this will be positive if the TF acts as an activator of its targets and negative if it acts as a repressor. Conversely, we consider log(OR) for gain-of-function perturbations, which will be positive if the TF is an activator and negative if the TF acts as a repressor. Red arrows indicate TFs discussed in the results. ( A ) ORs from gain-of-function TF perturbations; ( B ) ORs from loss-of-function TF perturbations; ( C ) TF–target interactions from the TRRUST v2 database. For each TF, the percent of activating TF–target interactions (red) or repressive TF–target interactions (blue) from the subset of TF–target interactions in TTRUST v2 for which directionality is available.
The ChEA3 landing page contains an input form for users to submit their gene list. Following submission of a gene set, searchable, sortable and exportable results tables appear for each of the six ChEA3 libraries, and for the two integration methods: MeanRank and TopRank. These tables appear in the order of how well the library, or the integration technique, performed in our benchmark. This is implemented to aid users with deciding which table is most relevant for hypotheses generation. We project the results from these tables onto three global edgeless TF co-expression networks, and also generate local TF co-regulatory networks for each library with the top TF results. A clustergram tab shows the overlapping query gene targets among the top library results ( 36 ), and a bar chart shows the contributions of each library to the top TF rankings from the MeanRank integration method. The global co-expression networks serve to provide context for the user about how the most enriched TFs fit within the larger TF co-regulatory network. The local TF co-regulatory networks contain directed and undirected edges to communicate how the top returned TFs may co-regulate one another. The clustergram provides visualization of consensus targets and enriched TFs across libraries. The networks and diagrams are exportable as publication-quality figures in vector graphics format. The ChEA3 landing page also contains information about the methods, benchmarking results, a brief tutorial, and example code for demonstrating how to submit queries through the ChEA3 API. These informational sections are accessible using the navigation bar at the top of the page, or by scrolling.
ChEA3 is a web server application that predicts TFs associated with user-submitted gene sets using data from multiple orthogonal omics sources. Other sources for TF–target association are also available from the ChEA3 site, and this collection is expected to continually grow. We benchmarked the performance of six primary libraries within ChEA3 and show that integrating enrichment analyses from multiple libraries improves the recovery of the ‘correct’ upstream TFs associated with a user gene set. This data integration approach highlights the strength of combining evidence from independent sources. Interestingly, the ‘wisdom-of-the-crowd’ gene set library created from the Enrichr queries outperformed all other libraries in the global analysis of rankings. This passive form of discovery, resulting from the usage of a bioinformatics tool, can be applied to other tasks, for example, gene function prediction.
We also show that ChEA3 outperformed TF ranking when compared with other existing tools. Such a benchmarking approach should be challenged, tested by others, and used in future studies to compare similar tools. We also demonstrate how integrating data from two assay types, namely ChIP-seq and genome-wide mRNA expression, enables global analysis that can determine whether a TF is mostly an activator or a repressor. Integrating ChIP-seq and genome-wide mRNA expression from TF perturbation studies can be used to construct signed directed networks that can be further analyzed to better understand the topology of the human TF regulatory network. Overall, ChEA3 can guide many future experimental and computational studies that aim to explore gene expression regulatory mechanisms in mammalian cells.
Supplementary Data are available at NAR Online.
NIH [U54-HL127624 (LINCS-DCIC), U24-CA224260 (IDG-KMC), T32-GM062754 (Pharmacological Sciences Training Program), and OT3-OD025467 (NIH Data Commons)]. Funding for open access charge: NIH [U54-HL127624 (LINCS-DCIC)].
Conflict of interest statement . None declared.