Improving QSAR Modeling for Predictive Toxicology using Publicly Aggregated Semantic Graph Data and Graph Neural Networks

2.4. Baseline QSAR classifiers

For comparison, we built 2 additional (non-NN) QSAR models that represent rigorously defined benchmarks consistent with current practice in predictive toxicology: A random forest classifier, ¹⁹ and a gradient boosting classifier. ²⁰ Each model was trained on the aforementioned MACCS fingerprints of chemicals computed from SMILES strings, with an 80%/20% training/testing split. We tuned 6 hyperparameters for each random forest model, and 5 for each gradient boosting model, as described in Table S1 . These were tuned using grid search, where the optimal hyperparameter set is defined as the one that minimized binary cross entropy between predicted labels and true labels on the training data.

3. Results

3.1. GNN node classification performance vs. baseline QSAR models

Of the 68 total assays in the Tox21 database, we retained 52 for analysis in the QSAR experiments. The remaining 16 assays were not used due to either a low number of active chemicals or underrepresentation of screened chemicals in the ComptoxAI graph database. Additionally, we discarded compound labels for chemicals with inconclusive or ambiguous screening results.

As shown in Figure 3 , the GNN model significantly outperforms both the random forest (Wilcoxon signed-rank test p -value 2.3·10 ⁻⁴ ) and gradient boosting ( p -value 2.6·10 ⁻³ ) models in terms of area under the receiver operating characteristic curve (AUROC), with a mean AUROC of 0.883 (compared to 0.834 for random forest and 0.851 for gradient boosting). This is robust evidence that the GNN model tends to substantially outperform ‘traditional’ QSAR models. A notable characteristic of the GNN AUROCs is that their distribution has a higher variance than either the random forest or gradient boosting AUROCs. Anecdotally, this may be due to diminished sensitivity of the GNN model when trained on assays with few positive examples—neural networks tend to struggle as data become more sparse, which seems to be the case here. We also compared F1-score distributions between the 3 model types; however, the differences between the 3 models are not statistically significant. The relatively low F1-scores in the 3 model types is a result of the class imbalance in the QSAR toxicity assays—all of the assays contain far more negative samples (assay is inactive) than positive samples (assay is active), which results in any false negatives having a magnified impact on F1. The same increased variance observed in GNN model AUROCs is shown in the GNN F1-scores.

An external file that holds a picture, illustration, etc. Object name is nihms-1760612-f0003.jpg

Open in a separate window

Fig. 3.

Overall performance metrics of the 3 QSAR model types on each of the Tox21 assays—a.) AUROC and b.) F1 score. The mean AUROC is significantly higher for the GNN model than for either of the two baseline models. Differences in F1 scores are not statistically significant. The GNN achieves poor F1 scores on assays with relatively few (e.g., < 100) “active” annotations in Tox21, which is consistent with known performance of neural networks on data with sparse labels. p -values correspond to Wilcoxon signed-rank tests on means, with a significance level of 0.05.

We performed further review of model performance on two selected assays of interest: PXR agonism (labeled tox21-pxr-p1 in Tox21) and HepG2 cell viability (tox21-rt-viability-hepg2-p1). We selected these assays because: (1.) Both are semantically distinct from all other Tox21 assays (i.e., there are no other assays measuring pregnane X activity or cell viability), and therefore we do not expect an information leak from other highly correlated Tox21 assays present in the GNN, and (2.) both have a sufficient number of positive chemicals such that their ROC curves attain high resolution at all values of the decision rule. Figure 4 shows that the GNN outperforms the random forest and gradient boosting models at virtually all discrimination thresholds in both cases. The high performance of the GNN on HepG2 cell viability is especially noteworthy—cell viability is notoriously challenging to predict in chemical screening experiments. Many of the other 50 Tox21 assays showed similar patterns in performance. All ROC plots are available in Supplemental Materials .

An external file that holds a picture, illustration, etc. Object name is nihms-1760612-f0004.jpg

Open in a separate window

Fig. 4.

Receiver operating characteristic (ROC) curves for two selected Tox21 assays: a.) PXR agonism (tox21-pxr-p1) and b.) HepG2 cell viability (tox21-rt-viability-hepg2-p1). In both cases, the area under the curve (AUC) is significantly higher for the GNN model than either the Random Forest or Gradient Boosting models. AUC values are given with 95% confidence intervals.

To verify that the improved performance of the GCN over the baseline QSAR models is not solely due to the use of multilayer neural networks, we performed an additional analysis of the cell viability assay using a multilayer perceptron (MLP) model. The MLP model never achieved better than an AUROC of 0.76, which is comparable to the performance of the other baseline models, yet still significantly less than the GCN’s AUROC of 0.85. Complete details on the MLP analysis are available in Supplemental Materials .

3.2. Ablation analysis of graph components’ influence on the trained model

To better understand how the GNN model outperforms the random forest and gradient boosting models, we performed an ablation analysis on the two previously mentioned assays—pregnane X agonism and HepG2 cell viability. For both of the assays, we re-trained the model after removing specific components from the GNN:

All assay nodes.
All gene nodes.
MACCS fingerprints for chemical nodes (replacing them with dummy variables so the structure of the network would remain the same).

ROC plots for these experiments are shown in Figure 5 . For both assays, the full GNN model performed best, although only modestly better (in terms of AUROC) than the versions without MACCS fingerprints or gene nodes. However, the performance of the GNN drops substantially—barely better than guessing labels at random (i.e., AUROC = 0.5)—when assay nodes are removed from the graph. In other words, much of the inferential capacity of the GNN models is conferred by chemicals’ connections to assays other than the one for which activity is being predicted. Similarly, MACCS fingerprints are not—on their own—enough for the GNN to attain equal performance to the baseline QSAR models, which only use MACCS fingerprints as predictive features. Therefore, although the GNN achieves significantly better performance than the two baseline models, it is only able to do so with the added context of network relationships between chemicals, assays, and (to a lesser degree) genes.

An external file that holds a picture, illustration, etc. Object name is nihms-1760612-f0005.jpg

Open in a separate window

Fig. 5.

Receiver Operator Characteristic (ROC) curves for two selected Tox21 assays using different configurations of the GNN model. ‘GNN - full’ is the complete model as described in §2.3.1 . ‘GNN - no structure’ omits the MACCS chemical descriptors and replaces them with node embeddings. ‘GNN - no gene’ omits gene nodes and their incident edges. ‘GNN - no assay’ removes all assay nodes and incident edges, so predictions are made solely using chemicals, genes, the remaining edges, and the MACCS fingerprints as chemical node features. AUC values include 95% confidence intervals.

4. Discussion

4.1. GNNs versus traditional ML for QSAR modeling

The toxicology community largely agrees that QSAR underperforms on many tasks, and that new methodological advances are desperately needed. In this study, we demonstrate that GNNs significantly outperform the current gold-standard techniques in the field. Aside from the fact that neural networks can more easily adapt to nonlinear objectives than non-neural network models, ²¹ this is likely a natural consequence of incorporating biomedical knowledge that goes beyond chemical structure characteristics. Gene interactions provide clues about how chemicals influence metabolic and signaling pathways in vivo , and non-target assays (i.e., other assays in the graph aside from the one currently being predicted) may correlate with activity of the target assay.

4.2. Interpretability of GNNs in QSAR

Chemical fingerprints—such as MACCS, which we use in this study—provide a valuable approach to representing chemicals that is suitable for machine learning. However, models based on fingerprints are challenging to interpret. ^{7

,

22} Although each field of a MACCS fingerprint corresponds to meaningful chemical properties (such as whether the chemical contains multiple aromatic rings, or at least one nitrogen atom), the fingerprint is largely inscrutable in QSAR applications, since biological activity is the result of many higher-order interactions between the chemical of interest and biomolecules.

In this study, the knowledge representation-based heterogeneous graph data represent easily interpretable relationships between entity types that mediate toxic responses to chemicals. Although not implemented in this particular study, a GNN architecture known as a graph attention network explicitly highlights portions of a graph that are influential in predictions, providing a logical next step for continuing this body of work on GNNs in QSAR modeling. Other, simpler approaches also provide avenues for exploring interpretability, such as visualizing the edge weights for edges starting at assay nodes in the trained GNN. Often, the sheer size of graphs make this approach intractable, but since our graph only contains 52 assays it is relatively straightforward to inspect their weights. For example, the highest weighted assays for the HepG2 cell viability prediction task are HepG2 Caspase-3/7 mediated cytotoxicity and NIH/3T3 Sonic hedgehog antagonism (a marker of developmental toxicity). The first of these makes sense from an intuitive standpoint, as it measures toxic response in the same cell line as the predicted assay. The second, on the other hand, does not have an immediately obvious connection to the predicted assay, but may be linked to the fact that Shh antagonists can induce apoptosis. ²³ Either way, it is easy to see that assay weights can be used to generate specific hypotheses for future targeted studies of mechanisms that underlie toxicity.

We provide all assay weights for these two assays in Supplemental Materials .

4.3. Sources of bias and their effects on QSAR for toxicity prediction

Like any meta-analysis technique, QSAR is subject to multiple sources of bias that can be introduced at several levels, not the least of which is in the original experiments used to generate toxic activity annotations for training data samples. This was a greater issue historically, when known activities for chemicals were compiled either from published scientific journal article results or from reporting guidelines for in vivo experiments. ²⁴ Publication bias caused negative activity annotations to be extremely incomplete, and techniques for imputing negative annotations were inconsistent. Older QSAR studies often did not state the original sources of their data, so verification and reproducibility of results are immensely challenging (if not impossible).

Fortunately, modern large-scale screening efforts (including Tox21) were created to directly address these and other issues. ²⁵ While our training data are still subject to batch effects, bias in selecting assays and chemicals for screening, and other systematic and experimental errors that are propagated along to the final QSAR model, we are relatively confident that publication bias, reporting bias, and other issues that plagued early QSAR studies have been substantially decreased. Furthermore, our GNN approach to QSAR modeling may be more robust to these sources of bias than non-GNN approaches, because (a.) the graph incorporates multiple levels of biological knowledge that can ‘fill in gaps’ left by incomplete or inaccurate data at other levels and (b.) GNNs—and heterogeneous GNNs in particular—exhibit properties that make them inherently robust to noise. ^{26

,

27} In the future, we will continue to refine novel predictions of the GNN by fusing electronic health record data via drug exposures to chemicals in the graph, and assess whether predicted toxicity outcomes are observed in real-world data.

5. Conclusions

In this study, we introduce a novel GNN-based approach to QSAR modeling for toxicity prediction, and evaluate it on data from 52 assays to show that it significantly outperforms existing methods. GNNs comprise an incredibly active emerging topic within artificial intelligence research, and as one of the first GNN applications in computational toxicology we hope that our results serve as a ‘jumping off point’ for a vast body of similar work. We plan to evaluate graph attention networks, new data modalities, and network regulization techniques in the near future, and encourage contributions from the toxicology and informatics communities at-large to improve predictive toxicology’s overall data ecosystem.

An external file that holds a picture, illustration, etc. Object name is nihms-1760612-f0001.jpg

Open in a separate window

Fig. 1.

Overview of the graph machine learning approach used in this study. We build a toxicology-focused graph database (named ComptoxAI) using data aggregated from diverse public databases, and extract a subgraph for QSAR analysis containing chemicals, assays, and genes. We then train and evaluate a graph neural network that predicts whether or not a chemical activates specific toxicology-focused assays from the Tox21 database.

Supplementary Material

supplementary

Click here to view. ^{(217K, pdf)}

Acknowledgements

This work was supported by US National Institutes of Health grants K99-LM013646 (PI: Joseph Romano), R01-LM010098, R01-LM012601, R01-AI116794, UL1-TR001878, UC4-DK112217 (PI: Jason Moore), T32-ES019851, and P30-ES013508 (PI: Trevor Penning).

Appendix A. Graph convolutional network architecture

Our GCN implementation uses a message-passing paradigm that combines aspects of the GraphSAGE ²⁸ and R-GCN ¹⁷ architectures. Let $G = (V, E, R)$ be a heterogeneous graph consisting of nodes $v_{i} \in V$ , edges $(v_{i}, r, v_{j}) \in E$ , and a set of edge types $r \in R$ . Each edge is labeled with exactly one edge type. All chemical nodes (represented below as h ⁰ ) are represented by a bit string of length 166 corresponding to its MACCS fingerprint, while all other nodes (assays and genes) are represented by a single decimal-valued ‘embedding feature’ that is learned during optimization. The magnitude of an assay or gene’s embedding is roughly equivalent that node’s importance in the network, and can be introspected for model interpretation.

Each layer of the network is defined as an edge-wise aggregation of adjacent nodes:

h_{i}^{(l)} = σ (\sum_{r \in R} ρ_{j \in N_{i}^{r}} (W_{r}^{(l - 1)} h_{j}^{(l - 1)} + W_{0}^{(l - 1)} h_{i}^{(l - 1)})) .

(A.1)

where $h_{i}^{l}$ is the hidden representation of node i in layer l , $N (i)$ is the set of immediate neighbors of node i , and σ is a nonlinear activation function (either softmax or leaky ReLU, as explained in Appendix B ). ρ can be any differential ‘reducer’ function that combines messages passed from incident edges of a single type; in the case of this study we use summation. Since our graph contains relatively few edge types, regularization of the weight matrices W is not needed.

Appendix B. Node classification model

For classifying chemicals as active or inactive with regards to an assay of interest, we stack 2 GCN layers in the form given by (A.1) , with a leaky ReLU activation between the two layers and softmax applied to the second layer’s output. Since we only classify chemical nodes, we ignore outputs for all other node types (and for chemicals with undefined labels); labels are generated via Algorithm 1 We train the network by minimizing binary cross-entropy between the network’s softmax outputs and true activity values:

L = - \sum_{i \in Y} ℓ (h_{i}^{(0)}) \cdot \ln h_{i}^{(2)} + (1 - ℓ (h_{i}^{(0)})) \cdot \ln (1 - h_{i}^{(2)}) .

(A.2)

where $Y$ is the set of all labeled nodes, $ℓ (h_{i}^{(0)})$ is the true label of node i , and $h_{i}^{(2)}$ is the final layer output of node i .

The relatively shallow architecture of the network allows us to optimize the model using the Adam algorithm applied to the entire training data set, but the model can be adapted to mini-batch training when appropriate or necessary.

Footnotes

^6. Supplemental Materials

All source code pertaining to this study is available on GitHub at https://github.com/EpistasisLab/qsar-gnn . A frozen copy of the code at the time of writing is available at https://doi.org/10.5281/zenodo.5154055 .

Supplemental tables and figures are available on FigShare at https://doi.org/10.6084/m9.figshare.15094083 .

^a The full graph database for ComptoxAI can be found at https://comptox.ai , and will be described in a separate, upcoming publication.

^b To prevent information leakage, since conectivity to the assay would perfectly predict the node labels.

References

1. Raies AB and Bajic VB, In silico toxicology: computational methods for the prediction of chemical toxicity , Wiley Interdisciplinary Reviews: Computational Molecular Science 6 , 147 (2016). [ PMC free article ] [ PubMed ] [ Google Scholar ]

2. Tice RR, Austin CP, Kavlock RJ and Bucher JR, Improving the human hazard characterization of chemicals: a tox21 update , Environmental health perspectives 121 , 756 (2013). [ PMC free article ] [ PubMed ] [ Google Scholar ]

3. Roncaglioni A, Toropov AA, Toropova AP and Benfenati E, In silico methods to predict drug toxicity , Current opinion in pharmacology 13 , 802 (2013). [ PubMed ] [ Google Scholar ]

4. Dudek AZ, Arodz T. and Gálvez J, Computational methods in developing quantitative structure-activity relationships (qsar): a review , Combinatorial chemistry & high throughput screening 9 , 213 (2006). [ PubMed ] [ Google Scholar ]

5. Tropsha A, Best practices for qsar model development, validation, and exploitation , Molecular informatics 29 , 476 (2010). [ PubMed ] [ Google Scholar ]

6. Hansch C. and Fujita T, p- σ-π analysis. a method for the correlation of biological activity and chemical structure , Journal of the American Chemical Society 86 , 1616 (1964). [ Google Scholar ]

7. Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R. et al., Qsar modeling: where have you been? where are you going to? , Journal of medicinal chemistry 57 , 4977 (2014). [ PMC free article ] [ PubMed ] [ Google Scholar ]

8. Maggiora GM, On outliers and activity cliffs why qsar often disappoints (2006). [ PubMed ]

9. Yue X, Wang Z, Huang J, Parthasarathy S, Moosavinasab S, Huang Y, Lin SM, Zhang W, Zhang P. and Sun H, Graph embedding on biomedical networks: methods, applications and evaluations , Bioinformatics 36 , 1241 (2020). [ PMC free article ] [ PubMed ] [ Google Scholar ]

10. Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS et al., The comptox chemistry dashboard: a community data resource for environmental chemistry , Journal of cheminformatics 9 , 1 (2017). [ PMC free article ] [ PubMed ] [ Google Scholar ]

11. Brown GR, Hem V, Katz KS, Ovetsky M, Wallin C, Ermolaeva O, Tolstoy I, Tatusova T, Pruitt KD, Maglott DR et al., Gene: a gene-centered information resource at ncbi , Nucleic acids research 43 , D36 (2015). [ PMC free article ] [ PubMed ] [ Google Scholar ]

12. Durant JL, Leland BA, Henry DR and Nourse JG, Reoptimization of mdl keys for use in drug discovery , Journal of chemical information and computer sciences 42 , 1273 (2002). [ PubMed ] [ Google Scholar ]

13. Himmelstein DS, Lizee A, Hessler C, Brueggeman L, Chen SL, Hadley D, Green A, Khankhanian P. and Baranzini SE, Systematic integration of biomedical knowledge prioritizes drugs for repurposing , Elife 6 , p. e26726 (2017). [ PMC free article ] [ PubMed ] [ Google Scholar ]

14. Kipf TN and Welling M, Semi-supervised classification with graph convolutional networks , arXiv preprint arXiv:1609.02907 (2016).

15. Chen Z-M, Wei X-S, Wang P. and Guo Y, Multi-label image recognition with graph convolutional networks , 5177 (2019). [ Google Scholar ]

16. Zhang C, Song D, Huang C, Swami A. and Chawla NV, Heterogeneous graph neural network , 793 (2019). [ Google Scholar ]

17. Schlichtkrull M, Kipf TN, Bloem P, Van Den Berg R, Titov I. and Welling M, Modeling relational data with graph convolutional networks , 593 (2018). [ Google Scholar ]

18. Kingma DP and Ba J, Adam: A method for stochastic optimization , arXiv preprint arXiv:1412.6980 (2014).

19. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP and Feuston BP, Random forest: a classification and regression tool for compound classification and qsar modeling , Journal of chemical information and computer sciences 43 , 1947 (2003). [ PubMed ] [ Google Scholar ]

20. Sheridan RP, Wang WM, Liaw A, Ma J. and Gifford EM, Extreme gradient boosting as a method for quantitative structure–activity relationships , Journal of chemical information and modeling 56 , 2353 (2016). [ PubMed ] [ Google Scholar ]

21. Hornik K, Stinchcombe M. and White H, Multilayer feedforward networks are universal approximators , Neural networks 2 , 359 (1989). [ Google Scholar ]

22. Matveieva M. and Polishchuk P, Benchmarks for interpretation of qsar models , Journal of cheminformatics 13 , 1 (2021). [ PMC free article ] [ PubMed ] [ Google Scholar ]

23. Wu C, Hu S, Cheng J, Wang G. and Tao K, Smoothened antagonist gdc-0449 (vismodegib) inhibits proliferation and triggers apoptosis in colon cancer cell lines , Experimental and therapeutic medicine 13 , 2529 (2017). [ PMC free article ] [ PubMed ] [ Google Scholar ]

24. Cronin MT, Richarz A-N and Schultz TW, Identification and description of the uncertainty, variability, bias and influence in quantitative structure-activity relationships (qsars) for toxicity prediction , Regulatory Toxicology and Pharmacology 106 , 90 (2019). [ PubMed ] [ Google Scholar ]

25. Shoichet BK, Virtual screening of chemical libraries , Nature 432 , 862 (2004). [ PMC free article ] [ PubMed ] [ Google Scholar ]

26. Xie Y, Xu H, Li J, Yang C. and Gao K, Heterogeneous graph neural networks for noisy few-shot relation classification , Knowledge-Based Systems 194 , p. 105548 (2020). [ Google Scholar ]

27. Nt H. and Maehara T, Revisiting graph neural networks: All we have is low-pass filters , arXiv preprint arXiv:1905.09550 (2019).

28. Hamilton WL, Ying R. and Leskovec J, Inductive representation learning on large graphs , 1025 (2017). [ Google Scholar ]