How I Get Reliable JSON from Vision Models (and What Breaks)
A deep dive into the pipeline behind Dr. Vin's AI vehicle assessments: structured output, JSON repair, composable prompts, and the image-prep gotchas that wreck bounding boxes.
The hardest part of building Dr. Vin was not getting a vision model to find car damage. That part works surprisingly well. The hard part was getting the model to return its findings as structured, parseable JSON instead of prose. Prompt engineering gets you most of the way there, but "most of the way" means your app breaks on the requests where it does not work.
LLM output is not a function return value. It is a suggestion. Sometimes the suggestion is wrong, sometimes it is truncated mid-object, and sometimes the model just ignores you and describes the photo in plain English. You need infrastructure around the model, not just a better prompt.
Here is what I ended up building.
Four Passes, Not One
Every photo that enters Dr. Vin goes through a four-pass pipeline. Most of them never reach the full vision model call.
Photo -> Triage (10ms) -> Quick-scan (200ms) -> Full Analysis (2-5s) -> Enrichment (5ms)
| |
duplicate not a photo
(skip) (skip)
Triage runs in about 10 milliseconds with no AI involved. It computes a SHA-256 hash of the image bytes and checks for duplicates within the session. It also compares the image against a manifest of previous analyses. If the photo was already analyzed with the same prompt, it skips entirely.
Quick-scan is a lightweight AI call with a minimal prompt. It classifies the document type: is this actually a photograph of a car, or is it a screenshot of the listing page? A receipt? A meme the seller included for some reason? Non-photos get filtered here for almost nothing.
Full analysis is the real work. A composed prompt with all the analysis modules runs against the vision model. This is where damage detection, severity classification, repair cost estimation, and bounding box generation happen. It takes 2-5 seconds per image.
Enrichment runs after the AI call with no model involvement. If the same scratch appears in three photos taken from different angles, enrichment correlates those findings so they show up as one issue, not three. It also sequences images by content (exterior shots before interior, driver side before passenger).
A typical batch of 20 photos from a car listing might include 3 duplicates (the seller uploaded the same angle twice), 2 screenshots of the listing page, and a photo of the odometer that does not need full damage analysis. That is 5-7 images that cost almost nothing to process. The pipeline catches them early so I only spend tokens on the photos that matter.
Making LLMs Speak JSON
Getting structured JSON from a vision model has three layers, and all three have to work.
Layer 1: Native structured output. Gemini supports responseMimeType: 'application/json', which constrains the model to valid JSON at the token level. This is not "please return JSON" in the prompt. The model physically cannot produce non-JSON tokens. OpenAI has a similar response_format: { type: 'json_object' } mode. This gets you valid JSON almost all the time.
Layer 2: Schema validation. Valid JSON is not the same as correct JSON. The model might return {"damage": "yes"} instead of the structured object with severity levels, bounding boxes, and repair cost ranges that the UI expects. Every response passes through a Zod schema that validates the shape. I use .passthrough() on the schema so product-specific fields survive validation without being stripped.
Layer 3: Repair. Models sometimes truncate their output when they hit token limits. You get the first 80% of a perfectly structured JSON object, then it just stops mid-key. The response is valid up to the truncation point but JSON.parse() throws because there's an unclosed brace.
This is the function that handles it:
function repairJson(text: string): string {
let repaired = text;
// Remove trailing commas before } or ]
repaired = repaired.replace(/,\s*([\]}])/g, '$1');
// Count unclosed braces/brackets
let braces = 0;
let brackets = 0;
let inString = false;
let escape = false;
for (const ch of repaired) {
if (escape) { escape = false; continue; }
if (ch === '\\') { escape = true; continue; }
if (ch === '"') { inString = !inString; continue; }
if (inString) continue;
if (ch === '{') braces++;
if (ch === '}') braces--;
if (ch === '[') brackets++;
if (ch === ']') brackets--;
}
// Close any trailing incomplete string
if (inString) repaired += '"';
// Close unclosed brackets then braces (innermost first)
while (brackets > 0) { repaired += ']'; brackets--; }
while (braces > 0) { repaired += '}'; braces--; }
// Clean up trailing commas again after repair
repaired = repaired.replace(/,\s*([\]}])/g, '$1');
return repaired;
}
The key detail is tracking string context. A curly brace inside a string value like "color": "rust {oxidized}" is not a structural brace. The function walks every character, tracks escape sequences and whether it is inside a quoted string, and only counts structural braces and brackets. Then it closes from innermost out.
This function has saved more production incidents than almost any other function in the codebase. It is not a substitute for structured output. It is a safety net for the small fraction of responses where the model runs out of tokens.
A note on function calling: it is the "correct" answer for structured output, and I tried it. But native structured output mode plus schema validation plus repair turned out to be a better fit. Function definitions consume prompt tokens on every request, the output shape is locked to the tool contract (so changing your schema means redefining the tool), and you lose the ability to get free-form JSON fields that your schema does not explicitly define. Function calling is the right choice for agentic tool use where the model decides which action to take. For "analyze this image and return a JSON object matching this shape," constrained output mode wins.
Composable Prompts
I started with one giant prompt that did everything: role definition, output format, damage rules, bounding box coordinates, severity scale. One string, 3,000 tokens, and growing every week. The problem was that the instructions bled into each other. I added a line about being specific with paint damage and the model quietly stopped returning bounding boxes. I fixed the bounding boxes and the severity scores started coming back wrong. Every edit I made broke something I was not looking at.
The fix was composable prompt modules. I built an engine called Phototology that handles the vision pipeline, and at its core is a PromptBuilder:
const prompt = new PromptBuilder()
.use('base') // role definition + JSON format rules
.use('vehicle-condition') // damage detection, severity, repair costs
.use('moderation') // safety screening (always on)
.addCustom('Return bounding boxes as [ymin, xmin, ymax, xmax] on a 1000x1000 grid.')
.build();
Each module is a self-contained prompt fragment, typically 40-100 lines. It declares which JSON output fields it expects and provides an example of its expected output. The builder concatenates all module prompts and merges output examples into a single JSON schema shown to the model.
There are about 20 modules. I can open any single module and understand exactly what it does without reading the rest of the prompt. Modules are order-independent, so I can compose different combinations for different analysis types without worrying about instruction conflicts.
One side benefit: the builder computes a SHA-256 hash of the assembled prompt. This hash is stored alongside every analysis result. If I update the damage detection module, I know exactly which images were analyzed with the old prompt and need to be re-run. Images analyzed by unrelated modules are unaffected. The hash comparison is a string match, so figuring out what is stale costs nothing.
Pointing at Damage Without Hiding It
The model returns bounding boxes as [ymin, xmin, ymax, xmax] on a normalized 1000x1000 grid. Coordinates are independent of image dimensions. A bounding box at [100, 200, 400, 600] means "from 10% down, 20% from the left, to 40% down, 60% from the left." Rendering is just CSS: convert from the 1000-scale grid to percentages and absolutely-position a div.
The irony of damage detection is that you are trying to point at something that is already hard to see. A faint paint scratch, a small dent, a hairline crack in the windshield. The whole point is to help the user find it. But bounding boxes are not perfect, and a translucent overlay sitting on top of a subtle scratch can make it harder to see, not easier. You are literally covering the evidence with the flag.
It gets worse with scale. Some photos have 10 or more findings. Overlapping shaded regions, numbered indicators, confidence borders. At that point it stops looking like a diagnostic tool and starts looking like a Disneyworld hotel carpet. There is a fine line between the experience feeling like magic and it completely ruining the photo.
I moved to minimal damage dots, small numbered indicators placed at the center of each bounding box. The model still returns the full bounding box coordinates, and I still use them for positioning and for calculating whether findings overlap. But the visual treatment went from "highlight the entire region" to "drop a pin and let the user look." Less impressive in a demo. Much more useful when you are actually trying to decide whether to buy a car.
Image-Prep Gotchas I Wish I'd Known
Three things about feeding images to Gemini that aren't in Google's docs. Each one cost me a day.
1. Gemini does not respect EXIF orientation.
Gemini returns bounding box coordinates in the raw pixel-buffer space. It does not read EXIF orientation metadata. iPhone photos taken in portrait have landscape-oriented raw buffers (the sensor is physically landscape) with an EXIF tag telling display software to rotate. Browsers respect the tag and display the image upright.
So if you send the raw iPhone JPEG to Gemini and render its bboxes on the browser-displayed image, the coordinates are rotated by whatever the EXIF orientation says. For iPhone portraits (EXIF=6) that's 90 degrees. For upside-down captures (EXIF=3) it's 180.
I spent hours convinced the model was hallucinating coordinates before I realized the model and my renderer were looking at the image in different orientations. The fix: physically rotate the buffer before sending. sharp(buffer).rotate() applies the EXIF tag and strips it in the process, so Gemini and your renderer are both looking at the same orientation.
2. Coordinate accuracy drifts below 1000 pixels.
In A/B tests I ran with Gemini 2.5 Flash, a 960x720 image of a silver hatchback returned a paint_oxidation finding with centerY at 34.5%, which mapped to the sky above the car. The same image upscaled to 1333x1000 returned centerY on the actual body panel where the oxidation was. The exact centerY value varies run-to-run (my original measurement was 70.0%, a re-run landed at 53.6%); what doesn't vary is that it moves from sky to body panel as soon as the image clears 1000 pixels.
Same model, same prompt, same photo. Only the pixel dimensions changed.
I haven't seen this documented. My best guess is that Gemini internally upscales sub-tile images (published tile sizes for Gemini vision sit around 768x768) and whatever interpolation it uses throws off the coordinate alignment. The fix for me: upscale to at least 1000 pixels on each dimension before sending. sharp(fit: 'outside', withoutReduction: true) with a minimum dimension of 1000 leaves large images alone and upscales small ones proportionally.
Your mileage may vary. Measure on your own photos.
3. The field name is box_2d, specifically.
For the first month of this project my response schema called the bounding box field bbox. Gemini would return coordinates, but they were noticeably worse than they should have been.
Switching the field name to box_2d everywhere in the schema and prompts improved accuracy noticeably. I don't have clean A/B numbers for this one (the rename landed alongside other changes), but external evidence is consistent with it: Google's own docs specify box_2d as the expected field name, and Gemini appears post-trained to that specific key.
Gemini's post-training is specific to this exact field name in this exact coordinate order: box_2d as the key, [ymin, xmin, ymax, xmax] normalized to a 0-1000 grid, per Google's own docs. Field name and coordinate order both matter. SimEdw's 2025 benchmark covers bounding-box accuracy across Gemini variants with useful numbers if you want to dig deeper.
If you're building on Gemini bounding boxes: use box_2d, use [ymin, xmin, ymax, xmax], normalize to 0-1000, and don't try to be clever about field names.
All three of these live outside what Google's docs cover in detail. Each one cost me hours of staring at wrong coordinates before I pattern-matched to "something is off with the image, not the model."
Try It
Dr. Vin runs this pipeline on every photo set. If you want to see it in action, upload some photos from a used car listing at drvin.ai and see what it finds. The free assessment shows the top findings and an overall condition grade. No account, no VIN, no make or model needed. Just photos.
If you are building something similar and run into issues with structured output, bounding boxes, or provider reliability, I am happy to share thoughts. If I got a dollar for every time I made a mistake described above, I would not need to charge for Dr. Vin's full reports.
Related Reading
An honest breakdown of what AI photo analysis finds versus what a mechanic finds. They cover different ground - here's how to use both effectively.
What $15 of Homework Can Save You on a Used CarSmart buyers show up with a condition report. Here is what $15 and five minutes of photo uploads gets you at the negotiating table.
