Computational Toxicology and Exposure Communities of Practice: The Effect of Noise on the Predictive Limit of QSAR Models and The Statistical Effects of Categorizing Continuous Data in QSAR
Date and Time
11:00 am - 12:00 pm EDT
Please feel free to forward to others who may be interested in joining!
You are invited to the EPA CompTox Communities of Practice.
Topic: The Effect of Noise on the Predictive Limit of QSAR Models and The Statistical Effects of Categorizing Continuous Data in QSAR
Who: Dr. Scott Kolmar, Post-Doctoral Researcher in the Center for Computational Toxicology and Exposure
When: June 24, 2021 from 11:00 AM- 12:00 PM EST
Where: Please register through Eventbrite, information to join will be sent after registration.
One key challenge in the field of Quantitative Structure Activity Relationships (QSAR) is how to effectively treat experimental error in the training and evaluation of computational models. It is often assumed in the field of QSAR that models cannot produce predictions which are more accurate than their training data. Additionally, it is implicitly assumed, by necessity, that data points in test sets or validation sets do not contain error, and that each data point is a population mean. This work proposes the hypothesis that QSAR models can make predictions which are more accurate than their training data and that the error-free test set assumption leads to a significant misevaluation of model performance. Using 8 datasets and 5 machine learning algorithms, 15 levels of Gaussian distributed simulated error was added to benchmark datasets and modeled to test this hypothesis. These results have implications for how QSAR models are evaluated, especially for disciplines where experimental error is very large, such as in computational toxicology.
The growing application of QSAR principles to the field of computational toxicology motivates many QSAR modelers to present binary outcomes, rather than continuous outcomes, to their toxicological audiences. This motivation often results in continuous toxicological endpoints being transformed into categorical endpoints before algorithms are trained and predictions are made. On a fundamental mathematical level, this procedure results in a significant loss of information and results predictions which are statistically less significant. The second portion of this presentation investigates whether this fundamental statistical principle results in less predictive QSAR models. Using several benchmark datasets and machine learning algorithms, models which categorize continuous data and then make predictions are compared to models which make predictions on continuous data and then categorize predictions. The results at this stage suggest that the predicted statistical outcome is obscured by the complexity of the machine learning pipeline, perhaps because of the numerous feature variables present in QSAR workflows.
This abstract does not necessarily reflect U.S. EPA policy.
For more information visit the EPA's Computational Toxicology Communities of Practice webpage