How to Evaluate an AI Development Studio: 12 Questions That Filter Out Vibe-Coders

April 10, 2026·By Omer Khan·8 min read

If you cannot get clean answers to these twelve questions in a one-hour discovery call, do not hire that agency. The AI development category attracted thousands of agencies between 2024 and 2026, most of them pivoted from generic web shops with no production AI experience. The questions below are designed to surface that gap quickly, before you have spent six months and $200,000 finding out.

Question 1: Walk me through your evaluation methodology for a recent agent

What you want to hear: a description of unit-level tool tests, trajectory evals on common task types, an outcome eval (often LLM-as-judge with human review), and a regression suite tied to a CI pipeline. Walk-away signal: "We test it manually" or "We rely on user feedback." Production agents without an eval harness break silently every model upgrade.

Question 2: Show me a production trace from your last deployment

What you want to hear: they pull up a real trace showing model calls, tool calls, latencies, and outcome scores. They can talk through what each part means and why it is instrumented that way. Walk-away signal: they cannot show traces because "client confidentiality" with no redacted alternative. Real studios have either redacted demo traces or live traces from open-source projects they have shipped.

Question 3: How do you handle prompt versioning and rollback?

What you want to hear: prompts in version control, deploys gated by evals, traffic ramp via feature flags, kill switch with human fallback, last-known-good rollback in under five minutes. Walk-away signal: prompts edited in a vendor UI with no versioning or "we just push the new prompt."

Question 4: When have you escalated a model upgrade decision and how did you decide?

What you want to hear: a story about running both old and new model on the eval suite, comparing outcome scores, A/B testing, and either upgrading, holding, or reverting based on data. Walk-away signal: "We always upgrade to the latest model" or "We had Claude 4 in production and just swapped to Opus 4.6, easy" with no evidence of evaluation.

Question 5: How do you decide between MCP and custom tool APIs?

What you want to hear: a decision framework based on integration ownership, latency, eval needs, and third-party extensibility — and an opinion that varies by integration. Walk-away signal: "MCP for everything" or "we don't use MCP." Either dogma signals shallow experience.

Question 6: Show me the eval suite for a similar project to ours

What you want to hear: a real codebase with eval files organized by tool and task type, with a CI run that produces a quantitative report. Walk-away signal: a Notion page with three test cases.

Question 7: How do you price evals and ongoing improvement?

What you want to hear: 25–35% of build cost goes to evals, monitoring, and iteration in year one — explicitly budgeted, not buried. Walk-away signal: a fixed-bid build with no improvement budget, or evals priced as a separate add-on you can decline.

Question 8: What's the worst production incident you've had and what did you learn?

What you want to hear: a specific incident with a specific cause, a postmortem, and a structural change that prevented recurrence. Walk-away signal: "We have not had any incidents." Either they have not shipped to real production or they are not telling the truth.

Question 9: How do you handle data residency, audit trails, and compliance?

What you want to hear: explicit answers about where data flows, what is logged, how long it is retained, how access is gated, and what changes for HIPAA, SOC 2, or PCI workloads. Walk-away signal: "That's handled by the cloud provider" with no specifics.

Question 10: Walk me through your discovery process before you write any code

What you want to hear: structured workflow mapping, stakeholder interviews, a documented decision on what is in and out of scope, an explicit success metric, and an honest discussion of what they would not build. Walk-away signal: a one-page proposal generated within 24 hours of first contact. AI projects fail at scope, not at code.

Question 11: How do you handle model failures, hallucinations, and tool errors at runtime?

What you want to hear: structured tool outputs the agent must follow, retry logic with exponential backoff, fallback paths, and explicit handling of low-confidence states with human escalation. Walk-away signal: "The model handles it" or "We re-prompt." That is not a strategy.

Question 12: Who specifically will work on our project and what have they shipped?

What you want to hear: named senior engineers with public artifacts — open-source contributions, talks, blog posts, GitHub history — that demonstrate they have actually built and shipped agent systems. Walk-away signal: "Our team" or unnamed contractors.

Bonus filters for an extra layer of confidence

Ask whether the proposal includes a kill switch and a fallback path. Ask whether the contract has an SLA on outcome metrics, not just uptime. Ask for references from a project the agency has supported in production for at least nine months — pilots that ended at month three do not prove production capability. Ask whether they have written publicly about the pattern they will use for your project; if not, ask why not.

What good answers look like in aggregate

The studios that earn your business will sound less impressive on the first call than the ones that lose it. They will use words like "evaluation," "trajectory," "rollback," "incident," and "scope." They will be slower to commit to a price because they want to scope properly. They will offer references from projects in their tenth month of production, not their second. They will tell you what they would not build, and why. Vibe-coders sound great on the first call. Production studios sound careful on the first call and great six months in.

AI AgencyVendor SelectionAI DevelopmentB2BHiringEvaluation

Guides

How to Choose an App Development Agency: The 2026 Buyer's Guide

An honest guide to evaluating app development agencies. Red flags to avoid, evaluation criteria, questions to ask, how to review portfolios, and pricing model comparisons.

January 5, 2026·8 min read

AI & Automation

AI Agent Pricing in 2026: What B2B Builds Actually Cost (6 Real Project Breakdowns)

Most AI agent pricing articles give you vague ranges. This one publishes the actual line items from six recent Afiniti Global builds — pilot, scale-up, and full production — so you can budget with real numbers, not consultant fog.

May 1, 2026·9 min read

AI & Automation

Why 70% of B2B AI Agent Pilots Fail Production (And the 4-Layer Architecture That Survives)

We've watched 30+ AI agent pilots try to graduate to production. Most failed at the same four points. Here's the four-layer architecture pattern — Reasoning, Tools, Evaluation, Operations — that survives the transition, with the tradeoffs and code-level patterns that matter.

April 22, 2026·10 min read

How to Evaluate an AI Development Studio: 12 Questions That Filter Out Vibe-Coders

How to Choose an App Development Agency: The 2026 Buyer's Guide

AI Agent Pricing in 2026: What B2B Builds Actually Cost (6 Real Project Breakdowns)

Why 70% of B2B AI Agent Pilots Fail Production (And the 4-Layer Architecture That Survives)

Free AI & Product Strategy Session.