Fine-Tuning

How XY.AI Labs Built Customer-Specific EOB Parsers with Serverless Fine-Tuning

  • 77% → 87%

    EOB parsing accuracy

  • 2–3×

    lower infrastructure costs

  • Multiple per day

    fine-tuning iterations

Executive Summary

XY.AI builds automated end‑to‑end workflows for mid‑sized healthcare providers, targeting the  friction that slows operations — including revenue cycle management. To operate at scale amid highly variable healthcare data formats, the team needed specialized models their existing engineers could train, evaluate, and maintain without standing up a dedicated AI infrastructure function.

Together AI’s serverless fine‑tuning platform gave XY.AI a plug‑and‑play workflow for training Qwen 2.5 14B models securely on customer‑specific data using fully managed infrastructure. Iteration velocity accelerated from weekly experiments to multiple runs per day, infrastructure costs dropped 2–3×, and EOB parsing accuracy improved from 77% to 87%.

About XY.AI

XY.AI automates complex workflows for healthcare providers and operators, eliminating manual transcription, data entry, and cross‑system reconciliation that slow operations including revenue cycle management. Lamara De Brouwer — who combines a background in psychology and computer science — founded the company alongside serial entrepreneur Sam De Brouwer. The company closed its seed round in mid‑2024.

The platform combines multimodal browser‑level automation with backend AI to bridge disconnected healthcare portals. A typical workflow extracts data from a payer portal, applies customer‑specific business logic, and populates downstream systems for submission or reconciliation. XY.AI supports customers across the healthcare ecosystem, including small and medium size providers, RCM organizations, EHRs, billing systems, and more.

The challenge

Building specialized models for healthcare data parsing required solving three interrelated problems: 

Infrastructure friction throttled iteration

Initially, XY.AI managed its own fine‑tuning and serving stack using Unsloth and vLLM. Engineers provisioned GPU instances, tuned serving parameters, and debugged model‑specific optimizations by hand. Each new model architecture required additional research and bespoke serving configuration. This overhead capped experimentation at one or two training runs per week — forcing a seed‑stage team to spend scarce engineering time on infrastructure rather than evaluation, data design, and domain‑specific improvements.

Healthcare data varies by provider

Explanation of Benefits (EOB) documents are notoriously inconsistent across the healthcare system. Payers use wildly different formats: Some summarize claims at a high level, while others itemize every adjustment and payment detail. Large or merged healthcare organizations often operate multiple schemas simultaneously — sometimes five or more in parallel. While EOBs are a concrete example, this pattern of structural variance appears broadly across healthcare data. Parsing this variability into reliable structured JSON requires rapid iteration on prompts, data formatting, and hyperparameters.

Per‑customer models require scalable economics

XY.AI’s product strategy depends on customer‑specific fine‑tuning rather than one‑size‑fits‑all deployments. Scaling this approach on self‑hosted infrastructure would require a dedicated AI platform team — economically impractical at seed stage. Although individual training runs were inexpensive, managing GPU clusters and serving endpoints for dozens of customers threatened to overwhelm both cost structure and operational capacity.

The solution

Together AI turned fine-tuning, evaluations, and deployment into a repeatable loop from experimentation to generating a testable endpoint, significantly improving XY.AI’s iteration cadence.

XY.AI migrated from its self‑hosted stack to the Together Fine‑Tuning Platform, replacing custom infrastructure work with a standard API workflow. The team standardized on Qwen 2.5 14B as the base model, training LoRA adapters for structured EOB extraction. On Together’s managed infrastructure, training runs complete in 10–20 minutes at roughly $10 per run. Multiple times a day, the team submits a job, receives an endpoint for a trained adapter, and evaluates the trained model. 

Evaluation includes the following metrics: generation quality (exact match, response rate), field accuracy across 11 fields (plus overall field accuracy), calibration (ECE and accuracy–retention curves), and routing tables mapping confidence → expected accuracy → retention. Engineers iterate on data formatting, prompts, hyperparameters, and post-processing in the same loop. 

This loop supports a confidence-based routing layer in production. Using token-level log probabilities, the inference service flags predictions that clear a calibrated threshold — about 50% of cases — and processes them automatically with ~95% accuracy on that subgroup. Lower-confidence predictions route to human review as a safety check and generate feedback data for future training. Confidence thresholds are validated against previously unseen evaluation data, so the routing policy stays aligned with empirical accuracy over time. 

XY.AI Diagram Embed (Fitted + Zoom Popup)
TRAINING PHASE OFFLINE EVAL & CALIBRATION (HELD-OUT) PRODUCTION (LIVE TRAFFIC + MONITORING) Together-managed Infrastructure Raw EOB Data • X12/835 • Claims Data Prep • De-ID • Format / split • Held-out split (frozen) Fine-Tuning Qwen 2.5 14B LoRA adapter (per customer) LoRA Weights Versioned adapter artifact (customer-specific) Together Serverless Endpoint Base model + selected LoRA Staging: candidate adapter version Prod: promoted adapter version Two modes: staging (candidate) vs prod (promoted) Returns token logprobs Create endpoint (adapter version) XY.AI Held-out Eval • Exact match / resp. rate • Per-field acc (11 fields) • Preds + logprobs + labels Calibration & Routing Tables • ECE + accuracy–retention • Thresholds: conf → exp. accuracy Built/validated on held-out eval Staging / test calls (held-out dataset) Build calibration (ECE) + routing tables Held-out split New EOB Promote to Prod (Gate) Confidence Score Token logprobs → confidence score Confidence Routing Thresholds from held-out routing tables AUTO HUMAN REVIEW Post-process JSON schema check Structured Output JSON: payments, adjustments, status Production Logging + Feedback Capture preds + token logprobs/conf + routing decision + outcomes/labels (delayed) Metrics Store (Held-out + Prod) Stores held-out results + calibration artifacts + prod logs; outcomes/labels may be delayed Monitoring / Drift Checks Accuracy + calibration drift Requires outcomes/labels (when available) log output payload log routing decision + confidence Thresholds / routing policy held-out: preds + token logprobs + ground-truth labels artifacts: ECE curves + routing tables (versioned) prod: preds + token logprobs/conf + routing decision + outcomes/labels (delayed) Pass held-out thresholds Promote Iterate: formatting / prompts / params / post-process Curated feedback labels (prod) -> training set held-out eval set remains frozen

This setup also handles multi-tenancy. Because EOB formats vary by provider, XY.AI trains and validates a LoRA adapter per customer. Managed multi-LoRA serverless capability makes loading and unloading LoRA adapters easy and inference solution cost-effective, while separating customer traffic . With this architecture, a three-person engineering team is able to efficiently maintain distinct production models across many customers — without dedicated per-customer GPU deployments.

Results (Reported by XY.AI)

Together AI’s platform enabled XY.AI to achieve the training velocity and cost efficiency required to automate healthcare workflows at scale.

  • 3× faster training cycles

    Iteration increased from 1–2 runs per week to multiple runs per day

  • 2–3× lower infrastructure costs

    Fully managed fine-tuning and serverless deployment replaced the self-hosted stack

  • 87% EOB parsing accuracy

    Improved from 77% to 87% accuracy on expert human baselines

Rapid iteration drives model quality

Training velocity increased from one or two experiments per week to multiple iterations per day. This pace enabled systematic optimization of data formats, prompts, and calibration logic, driving EOB‑to‑JSON parsing accuracy from 77% to 87% when measured against expert human baselines.

Production deployment with human‑in‑the‑loop architecture

In production, high‑confidence predictions are processed automatically while ambiguous cases route to human review. More than half of incoming EOBs clear the confidence threshold and bypass manual handling. Over time, human feedback continuously reduces review volume through retraining while maintaining strict accuracy guarantees.

Infrastructure cost avoidance

Together’s fully managed training and serving eliminated the need for a dedicated AI infrastructure hire. XY.AI’s existing team handles all model development, while infrastructure costs dropped 2–3× compared to the prior self‑hosted setup—while enabling  significantly higher experimentation throughput.

"Together AI does for fine-tuning and inference what Vercel does for LLM-based apps—it removes the infrastructure layer so we can focus on our product. We fine‑tune and deploy customer‑specific models through simple API calls. That lets our existing team move from weekly to daily iteration, cut costs by 2–3×, and improve accuracy from 77% to 87%." — Lamara De Brouwer, Co-Founder & CTO, XY.AI Labs

Use case details

Highlights

  • Weekly runs → multiple per day
  • 2–3× lower infra costs
  • EOB accuracy 77% → 87%
  • 10–20 min runs (~$10/run)
  • Human-in-loop routing at scale

Use case

Fine-tune and deploy models for EOB→JSON extraction

Company segment

Healthcare

Start building your success story today