Dr. Gerard Dumancas
Institution: Louisiana State University
Department: Mathematics and Physical Sciences
Proposed research ideas:
The objective of this proposal is to implement advanced machine learning techniques in improving the scoring functions of protein-ligand docking models. In molecular docking, a large number of binding poses (orientation + conformation) are evaluated using a scoring function. A scoring function is a mathematical predictive model that produces a score that represents the binding free energy of a binding pose. The result of the docking process is a set of ligands ranked according to their predicted binding scores. The key to computer-aided drug design is the design of an efficient highly accurate scoring function. However, traditional techniques which have been proposed in the past, generally lead to poor performance. In order to improve predictive scoring function, we propose to implement advanced machine learning tools. Using data obtained from he Protein Data Bank (PDB; http://www.rcsb.org/pdb/), we propose to assess the performance of an ensemble learning technique called stacked generalization, which provides a heuristic approach for combining the predictions of multiple learning algorithms in improving the scoring functions of protein-ligand docking models. Stacking is an ensemble learning technique that provides a way of combining predictive models by feeding their predictions/outputs into an algorithm that learns the optimal way of combining them to produce the best aggregate predictor. The idea behind stacking is that combining the predictions of a variety of machine learning algorithms should, in principle, lead to better predictive performance than what any individual machine learning algorithm is capable of producing. As an example, we will focus our attention in determining the binding affinity (via scoring functions) of nucleotide inhibitors to Zika virus (ZIKV) polymerase. In this study, we will feed the predictions of a number of learning algorithms (including logistic regression, elastic net, LASSO, random forests, decision trees, gradient boosting, support vector machines, artificial neural networks, and partial least squares discriminant analysis) into a meta-model scoring function. The area under the ROC curve, or simply AUC, will be used as a measure of the predictive performance of the various classification algorithms. We will implement the pROC package in R to calculate the AUCs. All calculations for the various machine learning techniques will be performed using various packages in R version 3.1.2. We will compare our results with the AutoDock Vina software (a molecular docking and virtual screening program). The development of an effective scoring function will provide the foundation of producing promising drug candidates which can bind to specific proteins, thereby, inhibiting viral progression. Such potential drug candidates can then be synthesized and physically screened using high throughput screening process.
Louisiana State University at Alexandria (LSUA) founded in 1960 is the only state supported undergraduate-only university in Louisiana and operates within the Louisiana State University (LSU) system. Transitioning from a 2-year community college to a 4-year university, LSUA has 3378 students, of which 67.0% are female, and ~19% are African Americans and Hispanics. Mentoring students who are women and from underrepresented minorities is the primary motivation for my participation in this program. This mentorship encompasses both teaching and research. In teaching, I am presently mentoring an undergraduate African American in general chemistry laboratories. The position requires a student to have knowledge and experience in basic mathematical calculations and handling of chemical reagents. Research-wise, I have been mentoring two undergraduate students in research projects involving the applications of chemometrics and machine learning methods in biomedical and food science research. One of these students coming from an underrepresented group represents a pool of promising candidates who will have a bright career in the area of computational science. The demand for qualified STEM professionals is significant, but the present supply of STEM workers to fill these positions is at risk if underrepresented groups are not engaged in these fields. As such, my primary motivation of participating in this program is to support underrepresented minorities in pursuing STEM-related careers. My second motivation for participating in this program is to build strong collaborative relationships between LSUA and the Berkeley Lab Computational Research Division, which could have significant long-term impacts on the development of a diverse workforce by training students in the aforementioned fields. Further, the techniques and collaborations obtained from this workshop will allow me to advance my research program at LSUA, implementing my skills and expertise in machine learning techniques in a multidisciplinary approach. These foreseeable advances will then allow me to pursue sustainable funding from major US national funding agencies, and train undergraduate students to become future scientists.