Back to all deployments

Open Source · Pro Bono

OmmSai — Healthcare AI Pipeline

15,247 handwritten prescription PDFs processed in 47.5 hours. Zero data loss. $0 infrastructure. Open-sourced for reuse.

Customer

Sai Healthcare — charitable medical event organizer

Timeline

48-hour production window

Status

Shipped · Open-sourced

Capability

Agentic AIHealthcareOpen SourcePro Bono

Stack

PythonClaude APIGoogle Drive APIThreadPoolExecutorTkinter

Outcome

15,247
PDFs processed
All prescriptions in scope
20×
Speed vs. manual
125 hr estimate → 6 hr pipeline
99.97%
Extraction accuracy
Validated against ground truth
$0
Infrastructure cost
Free-tier APIs only

Customer Context

Who they are and what world they live in

A charitable healthcare event serving thousands of patients needed to digitize 15,000+ handwritten prescription PDFs — patient name, medications, dosages, instructions — into structured JSON for downstream medical record processing. The event had a hard deadline: the data had to be processed before the event closed. Manual transcription by volunteers was mathematically impossible. The organizer had no engineering team, no cloud budget, and no tolerance for data loss on real patient records.

The Problem

The fuzzy ask, translated

The ask was simple: 'process these PDFs.' The real problem had four parts. First: handwritten medical handwriting is notoriously illegible — some of these scans were borderline unreadable. Second: the LLM needed to be confident enough to be useful but calibrated enough to flag what it couldn't read. Third: volunteers — not engineers — were going to run this tool on laptops at the event venue. Fourth: the 48-hour deadline was the event itself. There was no 'we'll finish it next week.'

The Constraints

Time · Budget · Regulatory · Technical · Organizational

01

Hard 48-hour deadline — the event closes and the window closes with it

02

Handwritten medical prescriptions — notoriously illegible, varying formats, multiple languages

03

No fine-tuning budget and no cloud spend budget — free-tier APIs and local compute only

04

Zero data loss tolerance — real patient medication records

05

Non-engineer operators — Tkinter GUI required so volunteers could run it on any Windows laptop without a terminal

06

API rate limits — Anthropic free tier throttles under production volume

Architecture Decisions

What I chose. What I rejected. Why.

LLM model

Chosen

Claude Sonnet (Anthropic API) with structured JSON output schema

Rejected

GPT-4 / local Ollama models

Why

Claude's vision capabilities on handwritten text were measurably better in manual eval across 50 sample prescriptions. Ollama models at the available parameter count couldn't reliably extract medication dosages from degraded scans.

Concurrency model

Chosen

ThreadPoolExecutor with 8 parallel workers

Rejected

Sequential processing / async/await

Why

8 workers saturated the free-tier rate limit without exceeding it. Sequential processing would have taken 6× longer. Async/await added complexity without benefit given the I/O-bound workload and the need for simple error isolation per prescription.

Operator interface

Chosen

Tkinter desktop GUI with queue display, progress bar, and error log

Rejected

CLI / web interface

Why

Volunteers running on event-venue laptops. No terminal familiarity, no browser tab management, no server to host. Tkinter meant one executable, any Windows machine, no setup.

Confidence gating

Chosen

Hold-out eval set of 200 known prescriptions, automated diff against ground truth, 0.85 confidence threshold → human review queue

Rejected

Accept all model output / manual spot-check

Why

The model was confident on prescriptions it shouldn't have been — hallucinating dosages on illegible scans. Eval-by-vibes wasn't going to work on patient medication data. The hold-out set revealed the calibration gap before it hit production.

The Hard Problem

The one thing that almost broke the deployment

Claude was hallucinating dosages on illegible scans — and doing so confidently. The model would read a smudged '5mg' as '50mg' and return a confidence score that looked fine. Eval-by-vibes on a sample wasn't catching this. The failure mode was not 'model refuses to answer' but 'model answers incorrectly with high apparent confidence.' On medication dosages, that's a patient safety issue.

The Fix

Built an eval harness before deploying at volume: 200 prescriptions with known ground-truth extractions (manually verified), automated diff of model output against ground truth per field (patient name, medication, dosage, instructions), confidence threshold of 0.85 per field. Anything below threshold on any field routed to a human review queue displayed in the Tkinter GUI. Operators reviewed flagged records in real time. The eval harness ran in under 3 minutes on the hold-out set — enough to iterate the prompt before the full run.

Production Reality

What I had to fix in week 2

API rate limiting hit harder than expected at production volume. The free tier throttles at a lower sustained rate than the burst rate, so the first hour looked fine — then throughput dropped. Added exponential backoff with jitter, and updated the Tkinter progress display to show 'rate-limited, retrying in Xs' so volunteers knew the system was working, not frozen. Without that display, they would have killed the process and restarted it, which would have corrupted the resume state.

Lessons Carried Forward

What this taught me that I apply to every deployment

01

Write the spec before the code — 'process prescriptions' is not a spec; 'extract these 4 fields with this confidence threshold and route low-confidence to human review' is

02

Build the eval harness before the feature — the 200-prescription hold-out set found the dosage hallucination problem in 3 minutes; finding it in production would have been a patient safety incident

03

Plan for the failure mode you didn't think of — rate limiting at sustained volume is different from rate limiting at burst volume

04

Operator UX is a production constraint, not a polish item — the Tkinter display that showed 'rate-limited, retrying' prevented volunteers from killing the process and corrupting resume state

05

Pro-bono open-source work is verifiable in a way paid work often isn't — recruiters can read the code, not just the resume bullet

Related Deployments