Project

AFib Risk Prediction

Overview

At BAIR and UCSF Computational Precision Health, I worked on predictive modeling systems for atrial fibrillation using longitudinal electronic health record trajectories across more than 25 million patients and over 100 million clinical encounters. The goal was to predict AFib onset up to twelve months in advance while understanding the temporal and clinical structure underlying disease progression.

A major challenge was simply making the dataset usable. The raw OSHPD records were distributed across massive unordered parquet files that ballooned in memory and lacked patient-level structure. I built distributed preprocessing pipelines using PySpark and PyArrow to partition encounters into temporal patient trajectories, construct efficient lookup systems, and generate longitudinal representations from ICD and CPT code sequences. This included large-scale batching, chunked processing, patient-level grouping, and HPC-based parallel preprocessing workflows.

On the modeling side, we benchmarked multiple approaches for AFib onset and stroke prediction, including gradient-boosted trees, random forests, MLPs, temporal sequence models, and LLM-based classifiers. The baseline models consistently outperformed the clinical CHA2DS2-VASc scoring system while identifying larger high-risk patient groups at comparable precision levels.

One of the more interesting directions involved adapting Llama 3.1 with QLoRA finetuning for longitudinal EHR classification. We transformed structured patient encounters into text trajectories and experimented with transformer-based risk prediction pipelines using instruction tuning and embedding-based classifiers. While the approach was technically successful, the results exposed important limitations of applying general-purpose LLMs to sparse coded medical trajectories: strong dataset priors, modality mismatch, weak interpretability, and minimal gains despite scaling compute.

The project fundamentally changed how I think about ML in healthcare. The hardest problems were not model architecture choices, but representation learning, temporal structure, data engineering, label leakage, clinical interpretability, and understanding where foundation models actually help versus where simpler systems dominate.

Technical Highlights

  • 100M+ encounter longitudinal EHR preprocessing pipeline
  • PySpark and PyArrow distributed trajectory generation
  • AFib onset prediction across 12-month horizons
  • XGBoost, MLP, LSTM, and Llama 3.1 QLoRA finetuning
  • Clinical feature attribution and risk modeling
  • HPC-scale preprocessing and model training