Rafael Zamora

Name: Rafael Zamora
Pronouns: He/him/his

Biography:
My background is in computational research with training and experience in scientific machine learning and high-performance computing. I participated in Lawrence Berkeley National Laboratory’s VFP & SULI internship programs where I implemented scalable machine learning algorithms for analysis of proteomic data. I received a staff position as a computer systems engineer at LBNL as a domain expert in deep learning applications related to healthcare and biomedical studies. I am currently working as part of the Million Veterans Project, a collaboration with administrators and medical doctors from the Veterans Affairs. I am tasked with providing technical support in the development and implementation of deep learning-enabled electronic health record analysis including the development of statistical language models. Under MVP, I study how NLP can be used to supplement suicide risk factors through identifying socio-economic information in clinical text with low prevalence in structured EHR. I lead the team’s development of large language models trained on VA data for applications in knowledge discovery and information retrieval. The goal of my work is to build systems which can help VA physicians understand expression of medical phenotypes in clinical text to accurately identify risk factors and help in developing scalable models to predict treatment outcomes.

Institution/Lab: Lawrence Berkeley National Laboratory
Website: https://crd.lbl.gov/divisions/amcr/computational-science-dept/acsd/staff/staff-members/rafael-zamora-resendiz/

SRP Collaboration Topic/Title: Scaling Protein Structure Machine Learning Applications Using HPC

Field or research area: Computational Biology

Please select all the topical areas that apply to your project:
Computational Science Applications (i.e., bioscience, cosmology, chemistry, environmental science, nanotechnology, climate, etc.); High-Performance Computing; Machine Learning and AI

Brief Abstract:
Graph convolution and self-attention deep learning models have grown in popularity in the domain of proteomics. These machine learning approaches have been used in several proteomic problems including protein classification and protein-ligand binding affinity prediction. Searching for ligands and poses with high binding affinity have immediate applications in drug design and drug repurposing. While many DL tools have been used to model the site interaction at the residue level, atom-level models would effectively increase the resolution of interactions. Even so, scaling to systems with >1,000 atoms is a non-trivial task which requires HPC. Recently, much work has gone into scaling training of language model architecture like GPT and BERT using HPC. The proposed project will explore utilizing language model architectures to scale the modeling of protein structures in the context of protein-ligand binding. Model parallelism will be used to enable modeling of proteins at the atomic level. After training on large datasets of binding site interactions, these models will be tested against out-of-training examples to observe its utility in imputing binding affinities across diverse protein and ligand databases. Methods for interpreting model parameters will be developed to provide biological meaningful insights that can help in drug design and repurposing.

Desired relevant skills, background, or interests:
Relevant Skills, Background or Interests: Good understanding of programming and machine learning. Some understanding of parallel programming and working with high performance computing systems. Interested in the application of computational skills to the domain of biology and medicine. A deep desire towards improving their skills of writing well-documented and reusable scientific code.

Other comments:

Do any special requirements apply? U.S. Citizen Only
Other, specify:

Keywords:
computational biology; high-performance computing; machine-learning; large language models; structural proteomics;

Lightning Talk Title: Improving Protein Structure Discovery Using HPC & Large Language Models