Project

UniTrial

Overview

UniTrial is a retrieval-augmented system for matching patients to clinical trials using longitudinal EHR data, medical embeddings, and live trial metadata. The project started from a simple observation: patient recruitment remains one of the largest bottlenecks in clinical research, especially in oncology and neurodegenerative disease trials where eligibility criteria are fragmented across unstructured records, biomarkers, and evolving trial requirements.

The system ingests patient EHR PDFs or condition-level prompts, extracts structured and unstructured medical concepts, and matches them against live ClinicalTrials.gov trials using a multimodal RAG pipeline. To reduce hallucinations and improve retrieval quality, I built a retrieval schema grounded in FHIR entities and MESH identifiers rather than relying purely on free-form generation. The retrieval stack combined MED-BERT embeddings, BGE embeddings, ChromaDB vector search, and custom filtering pipelines for biomarker extraction and eligibility alignment.

A large part of the work involved building reliable healthcare data infrastructure around the retrieval system. We implemented preprocessing and de-identification workflows for EHR ingestion, extended ClinicalTrials.gov querying through a custom REST API layer, and designed separate interfaces for patients and CROs to search over both trial and patient representations securely.

What I found especially interesting was the systems side of healthcare AI. The bottleneck was rarely “just use an LLM.” The real challenge was designing representations and retrieval systems that could handle noisy clinical data, sparse patient histories, hallucination risk, and constantly changing eligibility criteria while remaining interpretable enough for real clinical workflows.

Technical Stack

  • Python
  • LlamaIndex
  • LangChain
  • ChromaDB
  • MED-BERT
  • BGE embeddings
  • FHIR and MESH-ID schema alignment
  • Firebase and Streamlit