AI Trust Issues Solved: Gauntlet's CTO Drops Radical Testing Framework

⚡️ happy tuesday.

Dario Amodei (CEO of Anthropic) dropped a fat manifesto this week called "The Adolescence of Technology." His core claim: AI smarter than Nobel laureates across every field—what he calls a "country of geniuses in a datacenter"—could show up as early as late 2026. Meanwhile, his company shipped MCP Apps this week. It’s basically 10 interactive connectors that let Claude work inside Slack, Figma, Asana, Canva, and more.

Today, we’re talking about:

Fixing AI trust issues
Why enterprise AI keeps dying
Dario vs. software engineers
Microsoft's awkward AI hedge
AI that touches things
WTF is AEO/GEO? (join live 4 free)

📧 Reply w/ your take, get a gift: Are coding jobs actually cooked?

We’ll Be Your Chief AI Officer

Tenex is the AI services firm behind this newsletter. We build software that moves the P&L. Our work sits at the intersection of engineering, operations, and real business constraints.

Hire Us →

How Gauntlet's CTO Tests AI Before It Ships (+ How to Make Your System Trustworthy)

➿ Grab the full playbook, the open-source repo, the schemas, + every template.

the problem: In McKinsey's latest global survey of nearly 2,000 organizations, 51% reported AI backfiring in the past year. The top culprit? Plain old inaccuracy. The uncomfortable truth: most of these systems aren't broken. They're unreliable. And unreliable is worse than broken—because you can't predict when it'll fail.

think: Your demo works perfectly in the meeting. Your chatbot nails three questions. Leadership applauds. Then you deploy it. Real users show up with weird inputs and edge cases, and the whole thing goes Hindenburg.

the solution: Ash Tilawat is the Mr. Miyagi of AI engineering—the kind of teacher who makes you sand the floor and paint the fence before you realize you've learned karate.

As CTO of Gauntlet AI, he's trained over 1,000 devs through a program that's become legendary in the AI community. The reason: people who graduate ship production systems that actually work when push comes to shove. His evaluation framework is the core of what he teaches—and he just open-sourced the entire thing. Notebooks, schemas, configuration templates, everything.

the result: If you're the one who has to explain why the chatbot hallucinated in front of a customer, this framework is for you. By the end of this workflow, you'll have a library of test cases that prove your system works. You catch failures before users do. And over time, that library becomes a moat—proprietary data competitors can't clone, even if they copy every other feature you have.

Master Mr. Miyagi's method for machine trust in 5 moves:

build your golden set
test the weird stuff
build a replay harness
create your rubric
run experiments

UNLOCK OUR FREE PLAYBOOK

1. build your golden set

If you're reading this, you're one of three people: the engineer building this thing, the PM who owns it, or the boots-on-the-ground expert whose job is being automated—or who at least knows what good looks like.

We're talking about AI systems that take requests and do stuff—chatbots that answer questions, agents that pull data or take actions, copilots embedded in your product. If your system takes an input and generates an output, this applies.

Sit down together. Identify the 30–50 most important things your AI should handle correctly for that expert. Not edge cases yet—just the core stuff. The requests that, if broken, make the whole system useless.

For each one, write down:

the input: What the user asks
must do: What tools get called, what sources get retrieved
must say: Keywords that have to appear
must NOT say: Keywords that should never appear

Run each test manually. You and your PM check them off one by one. Get lazy or sloppy here, and everything downstream inherits the slop.

2. test the weird stuff

Your golden set covers the common requests. That's exactly what it should do. Labeled scenarios catch everything else—edge cases, strange phrasings, multi-step requests that make your system hesitate or fail silently.

The approach: create a matrix of categories (domain × tool type) and systematically fill the gaps. If your HR agent handles PTO questions but breaks on visa requirements, that's a labeled scenario you're missing. You can use an LLM to generate variations of existing test cases, but a human still needs to validate that the generated scenarios actually make sense.

pro tip: Track your coverage gaps explicitly—preferably in a visual matrix. Those gaps are exactly where production failures hide, waiting for a real user to stumble into them. Here's what we mean using that HR agent example:

3. build a replay harness

wtf is it: A recording system for your AI interactions.

Running your full test suite against a live system every time you tweak a prompt gets expensive fast. A replay harness fixes this. It documents the exact state of each interaction—input, system prompts, tool calls, sources retrieved—so you can replay them with modified configurations without burning another API call.

This is also where you graduate from binary pass/fail to actual numeric scores.

for retrieval systems: Track precision (of the docs you retrieved, how many were relevant?) and recall (of all the relevant docs, how many did you actually grab?).
for agents: Track tool call accuracy (did it call the right tool at the right time?).

Pick 2–3 metrics to start. More than that, and you'll drown.

4. create your rubric

Remember the rubrics your English teacher handed out in 8th grade? The ones that told you exactly what an A looked like versus a C? Same concept—but now you're teaching an LLM how to grade.

The scoring system: accuracy, completeness, groundedness, tone.

The LLM applies it consistently across thousands of outputs. But you don't just write a rubric and hand it over. You calibrate the LLM-as-judge against human judgment first.

Run 50–100 examples through both human and LLM scoring.
Find where they disagree.
Figure out why.
Adjust the rubric wording.
Re-run.
Keep going until LLM scores match human scores within half a point.

Once calibrated, the LLM can grade thousands of outputs while your humans spot-check monthly to catch drift.

5. run experiments

With the infrastructure in place, you can finally test different configurations against each other. When a new model drops or you want to try a prompt rewrite, create a new configuration and run your entire eval suite against both the baseline and the variant.

The decision framework is simple:

Does the golden set still pass 100%?
Did any critical metric regress?
Is accuracy improved or at least unchanged?
Are cost and latency acceptable?

If all checks pass, ship it. If anything fails, hold and investigate before deploying to production.

pro tip: Track cost per request alongside accuracy. A 2% accuracy improvement sounds great until you realize it came with a 3x cost increase. Depending on your use case, that tradeoff might still be worth it—but you need to know you're making it.

tl;dr: The teams still running production AI on vibes are f***ed. The ones building systematic eval sets are compounding an advantage that gets harder to catch every month.

➿ Grab the full playbook, the open-source repo, the schemas, + every template.

WTF is AEO/GEO??? Building a Scoring System

Guest: HubSpot CMO, Kipp Bodnar
Day: Wednesday, Jan 28
Time: 4:00 PM – 5:00 PM EST
RSVP FREE

Evals Part 2: How do you know if your org's AI is useful?

Guest: Hamel Husain, Parlance Labs
Day: Wednesday, Feb 4
Time: 4:00 PM – 5:00 PM EST
RSVP FREE

EXPLORE OUR CALENDAR

Alex, the co-founder of the company that pays for this newsletter, just dropped an essay on why enterprise AI keeps dying. The stats: 2 out of 3 AI projects die in pilot. 74% stall on operational barriers—not technical ones. $67.4B lost to hallucinations in 2024.

the two reasons that sting:

pilot graveyard: Nobody owns the path to production. No deployment plan, no change management, just endless "testing" until the project quietly dies.
organ transplant: Your AI probably works, your org is just rejecting it. Middle managers see AI as a threat, employees don't trust it, and leadership isn't aligned.

the fix: Define success metrics tied to P&L before you start, assign a clear owner, plan for production on day one, and invest as much in change management as the tech itself.

— # (#)

At Davos, Anthropic CEO Dario Amodei told an audience that AI will be writing essentially all code within 12 months.

the big quote: "If AI does to white-collar workers what globalization did to blue-collar workers, we need a credible plan—not abstractions about the jobs of tomorrow."

DeepMind's Demis Hassabis agreed that the early tremors are already here: companies are slowing hiring for junior roles and internships. If AI can do the work of a junior developer for free, the traditional corporate ladder starts to look broken.

— # (#)

But Dario’s “AI is taking your job” drumbeat has a counter-narrative. A new Oxford Economics report argues that "Companies don't appear to be replacing workers with AI on a significant scale." Instead, they may be using AI as PR cover for routine layoffs.

the data: AI was cited for 55,000 U.S. job cuts in 2025—sounds alarming until you realize that's only 4.5% of total layoffs. "Market conditions" caused four times more.

So why blame AI? Investor relations. The report notes that attributing cuts to AI "conveys a more positive message to investors" than admitting to weak demand or excessive hiring. It's a tech pivot story instead of a "we overhired" story.

Wharton professor Peter Cappelli says it’s all in how companies announce layoffs "because of AI," but read the fine print—they're saying "we expect AI will cover this work." Future tense. They haven't automated anything. They're just hoping it'll work out and telling investors what they want to hear.

So which is it? Dario says the jobs are going. Oxford says it's spin. Both might be true, but today's layoffs are mostly old-fashioned corrections dressed up in AI clothes.

— # (#)

Microsoft told employees across Windows, Microsoft 365, Teams, Bing, Edge, and Surface to install Claude Code—Anthropic's coding tool that competes directly with GitHub Copilot.

Engineers are now expected to use both and provide direct comparisons. Designers and PMs are prototyping with Claude Code. Meanwhile, Microsoft is still selling Copilot to the rest of us.

That's a striking vote of confidence in a competitor's product. And a signal that even Microsoft isn't sure its own AI coding tool is the best option. The AI wars are far from settled, and the companies building these tools are hedging their bets internally while pushing certainty externally.

Want our evals walkthrough as a video instead of a writeup? Here's our full convo with Ash.

eli5: Physical AI is AI that can touch things. Not chatbots, not image generators—machines that perceive, reason, and move through the real world. Robots, autonomous vehicles, drones, humanoids. The leap from "AI that writes your emails" to "AI that drives your forklift."

the jargon, translated:

Embodied AI: the academic term for Physical AI; you'll see both used interchangeably
Sim-to-real transfer: training robots in simulation before they touch the physical world, because you can't let a robot wander around a warehouse breaking shit while it learns
Vision-language-action models (VLAs): AI that can see, understand natural language instructions, and translate both into physical movement; the core architecture making this moment possible

why this matters: You might've seen headlines about Apple and OpenAI racing to build AI wearables. That's a sideshow. The real physical AI story is playing out in warehouses, factories, and on the roads—and it's already further along than most people realize.

Amazon now runs over a million robots across 300 fulfillment centers. Waymo has completed 10 million paid robotaxi rides. Aurora is hauling commercial freight between Dallas + Houston with no one behind the wheel. BMW is testing humanoid robots at its South Carolina plant for tasks that require genuine dexterity—two-handed coordination, precision gripping, the kind of work traditional automation has never been able to handle.

And China has made physical AI a national priority, writing it into their latest Five-Year Plan as a "new driver of economic growth" alongside quantum and 6G. When Beijing bets this visibly on a technology category, it tends to accelerate global investment and competition.

the shift: Industrial robots used to do one task forever—same motion, same part, same outcome, year after year. Physical AI makes them adaptive. A robot that can sort two types of fruit, then figure out a third without being retrained. Machines that respond to changing circumstances instead of failing the moment something's different.

what changed: First, VLAs matured enough to give robots a unified way of processing visual input, understanding spoken or written commands, and executing physical actions—all in one model.

Second, simulation environments got good enough that robots can train virtually on millions of hours of operations before ever touching real hardware, then transfer those learned behaviors to the physical world with increasing reliability. The training happens in software; the deployment happens in your facility.

apply it: If your business moves physical things—manufacturing, logistics, warehousing, field operations—this is reshaping your cost structure over the next three to five years. The companies positioned to win aren't necessarily the ones buying robots first; they're the ones already generating the operational data these systems need to train on.

Open roles:

Newsletter Writer (yup, you’ll write this thing)
AI Strategist
Talent Acquisition Lead
Technical Recruiter
Forward Deployed Engineer
Applied AI Engineer
Engagement Manager

Salary ranges vary by role and experience. Additional comp based on output. Must be NY-based.

JOIN US 😈

How Gauntlet's CTO Tests AI Before It Ships (+ The Framework Behind 1,000 Production Systems)

⚡️ happy tuesday.

How Gauntlet's CTO Tests AI Before It Ships (+ How to Make Your System Trustworthy)

1. build your golden set

2. test the weird stuff

3. build a replay harness

4. create your rubric

5. run experiments

Keep Reading

ultrathink