Free agent failure teardown

Building an LLM agent? Send me your worst conversation.

I built production agents at Apple. I’ll tell you why it broke and how to fix it — free. Get a teardown →

How Much to Label: Not a Percentage of Traffic, but "Label Until You Can Conclude"

"We labeled 50, 96% correct — ship it?" No — the statistical lower bound is only 86%. In 5 minutes you'll see through the small-sample 96% mirage; in 10, why labeling volume tracks "intents × channels," not traffic share; in 20, you'll have a table mapping true accuracy to rows-to-label, plus the cheapest rule there is: labeling volume grows with the number of intents × channels, not with traffic.

Yaqin Hei·Jul 3, 2026·12 min read·EN · 中

After Launch Is Where Agent Architecture Is Decided — Start With How You Sample Your Eval Set

"The stronger AI gets, the fewer people you need" is the most popular illusion of the past two years. Where agents are built with real money, labeling and evaluation teams are growing, not shrinking. In 5 minutes you'll see what separates "can demo it" from "can keep it right"; in 10, the 3 sampling questions that expose an eval set; in 20, how to redraw yours from "eyeball the logs" into a stratified sampling frame that pulls rare, high-stakes events — a wrong refund, a mishandled compliance case — back from "never sampled" into view.

Yaqin Hei·Jul 2, 2026·12 min read·👁 2·EN · 中

When the AI Becomes the Storefront, You Decay Into a Supplier: The Relationship Moat for 10 Million Merchants

WeChat is beta-testing an AI that orders and pays for users, and Qwen just opened a brand agent — two routes pointing at the same question: once the AI is the storefront, does the customer still belong to you. This one is for owners and membership / CRM leads: in 10 minutes you'll see how a super app slowly grinds you into a 'supplier,' spot the door Qwen's brand agent leaves open, and walk out with the 3 boundaries to hold (membership / repurchase / profile) plus the questions to put to the platform next week. WeChat is still read-only, no writes — build the relationship layer now, before the write permissions land and you find out you've been running naked.

Yaqin Hei·Jun 30, 2026·10 min read·EN · 中

AI Picked the First Store and the Other Four Vanished: WeChat's New Shelf for 10 Million Merchants

WeChat is beta-testing an AI agent that can search, compare, order, and pay across ~10 million merchants. This one is for owners and growth leads: in 10 minutes you'll see how the AI entry rewrote the physics of traffic distribution, spot the 3 signals that your service is already invisible to the AI, and walk out with 5 questions to put to your platform and team next week. It doesn't bet on any specific API — it explains the one thing that's certain: the shelf's rules changed, and whoever reads the rules first grabs the slot.

Yaqin Hei·Jun 29, 2026·10 min read·EN · 中

The Day WeChat's AI Ordered for Users, 10 Million Merchants' Mini-Programs Expired — A Layered Rebuild Blueprint

WeChat's AI agent has entered beta — it can search, compare, order, and pay across ~10 million merchants. This is the rebuild blueprint for architects and eng leads: which capabilities to scout with auto mode, which to package as SKILLs, how atomic APIs use a state machine to keep the AI on rails, and how to stitch the relationship layer back onto your side. In 30 minutes you walk out with a layered rebuild map + 10 things to do this week + 5 questions for the platform and vendors. The contract isn't frozen — so this post teaches you to design by MODE, not to bet on a specific API.

Yaqin Hei·Jun 24, 2026·30 min read·EN · 中

Your Dashboard Is Throttling the Agent It Watches

A periodic P99 spike, arriving every few minutes like clockwork — but CPU, QPS, and error rate are all flat, and the Agent code hasn't changed. Everyone's first guess is 'ES retrieval got slow.' It didn't: the retrieval path is fully async, clean. The culprit is the one place you'd never suspect — the ops dashboard built to watch the Agent was quietly choking it. This is the postmortem: how one sync call freezes a single-threaded event loop, how to align spike timestamps to dashboard refreshes, the two-line fix (to_thread + TTL cache), and 10 event-loop probes you can add to your own async service this week. 20 minutes, and you'll be able to catch the same 'one sync call stalls a whole loop' bug in your own stack.

Yaqin Hei·Jun 10, 2026·20 min read·👁 5·EN · 中

What a Real, Money-Moving L2 Refund Workflow Actually Looks Like | Workflow Deep Dive

The refund flow on the slide is a clean 8-step line. Built for real, that line is only a fifth of it — the other four-fifths decide whether to get to the payout at all. This deep dive takes apart a real L2 refund workflow: from linear 8 steps to a branching tree, why most of the code isn't refunding but not-refunding, why limits must live in a DB table and not the prompt, why every leaf with an unwired external system defaults to a human. Twenty-five minutes in, you can take this skeleton to a vendor and ask where their refund workflow's guardrails are.

Yaqin Hei·Jun 3, 2026·25 min read·👁 8·EN · 中

Your KB Changed. The Search Index Didn't — Anatomy of a 9-Day Silent Desync | KB-Ops Deep Dive

The same refund line: curl it locally, you get the new wording; curl it in prod, you get the 9-day-old 'not as good as Taobao.' Between the source file and prod sat one step I assumed was automatic and was actually manual. This is the engineering postmortem: two stacked silent-desync root causes + how 33 test-feedback rows cluster into 16 with one cause + why an all-green dashboard hid it + 10 gates you can add to your own KB pipeline this week. Twenty minutes in, you can find the same hole in your own source-plus-derived-index system.

Yaqin Hei·Jun 3, 2026·20 min read·👁 13·EN · 中

Self-Serve Rate ≠ Correct Rate — The Gates a Customer-Service Agent Must Clear Before Launch | Agentic AI in Practice (XIII)

The question at the review board — 'should self-serve rate be 65 or 90?' — crams two different axes into one number. A session that refunds the wrong amount still reads approved. Five minutes in you can see through a single-number 'self-serve rate 95%' report; ten minutes in you can build a 9-gate, 3-layer launch gate; twenty minutes in you can ask, at the review board, 'is this red line actually reconciled, or is it just falling back to a human because the API isn't wired yet?' — the kind of question that exposes a fake green checkmark on the spot.

Yaqin Hei·Jun 2, 2026·15 min read·👁 8·EN · 中

The Org Chart Is the Real Architecture Diagram — 90% of Stalled Agent Projects Aren't a Tech Problem | Agentic AI in Practice (XII)

Annotation delivered, eval baseline built, four scenarios shipped — and the project still stalled for three weeks. The root cause wasn't code; it was five roles marked 'TBD' on the RACI sheet. Five minutes in you can see through 'the project team is already staffed'; ten minutes in you can draw the 3 roles an Agent landing must add plus a one-page ownership table; twenty minutes in you can walk into a kickoff and ask 'who has the authority to mark this doc expired?' — the kind of question that exposes an org gap on the spot.

Yaqin Hei·Jun 1, 2026·14 min read·👁 5·EN · 中

A Second Agent as Reviewer — 11 of 34 Facts in a 25-Page AI Plan Were Fabricated | Agentic AI in Practice (XI)

Vendor PPTs, AI-drafted admissions emails, internal plans written with Claude — fact-fabrication rate sits at a 20-30% industry baseline. Five minutes in you can spot why 'let the LLM double-check itself' is pseudo-verification; ten minutes in you have the DRAFT → VERIFY → FINALIZE 3-phase gate template; twenty minutes in you have R1-R7 — the seven categories of fact errors that keep recurring (enum casing / fabricated emails / API paths / model IDs / deadlines) — turned into a PR checklist. Next time you review AI-drafted material, every claim traces back to a `file:line` or URL.

Yaqin Hei·May 29, 2026·13 min read·👁 7·EN · 中

Corpus Drives Codebook — Why Your Intent Taxonomy Is Stuck at 60% and How It Evolves from 36 to 48 | Agentic AI in Practice (X)

Customer-service Agent in production, 36 intents, unknown rate 40%, the business side asks 'can we just add an LLM fallback?' The real problem is not the classifier — it's the codebook itself. Five minutes in you can spot the wrong diagnosis ('unknown rate high = classifier weak'); ten minutes in you have the four-quadrant test that filters 80% of pseudo-missing-intent requests; twenty minutes in you have the corpus → codebook iteration loop that evolves a taxonomy from 36 to 48 stable intents.

Yaqin Hei·May 28, 2026·14 min read·👁 7·EN · 中

Don't Let AI Agents Call APIs Directly — A 5-Layer Tool-Calling Stack + 25-API Contract Checklist

The most common fake architecture in customer-service Agent projects this year: 'we let Agents call order / ticket / logistics APIs directly, 25 integrations done, full coverage' — then ask 'what happens when the OMS vendor changes?' answer 'rewrite,' 'how does QA do mock integration?' answer 'wait for the real interface,' 'compliance audit for write operations?' answer 'we'll add logging.' This is missing layers. Written for architects, founders, and project owners running enterprise Agentic projects: 5 min to spot the most expensive architecture mistake, 10 min to lock in the 5-layer responsibility split (Adapter / Service ABC / Tool / Workflow / Critic), 20 min to walk out with a 6-systems × 25-APIs integration matrix + 5 architecture decisions to drive this week.

Yaqin Hei·May 27, 2026·17 min read·👁 14·EN · 中

Pytest-Green Doesn't Mean Ship-Ready: How to Actually Test an AI Agent (Dual-Track)

The thing your customer-service Agent project gets most easily fooled by this year: 'pytest 400+ green, coverage 79%, CI gate passing.' Then the boss asks 'what's the faithfulness rate? Tone compliance? Prompt-injection block rate?' and nobody answers. The 'tests passed' bar for an AI system is not the 'tests passed' bar for traditional software. This piece is for architects, founders, and project owners shipping Agentic AI inside an enterprise: 5 min to see why pytest-green is misleading, 10 min to decide who owns which 4 of the 8+ test buckets, 20 min to walk out with a 7-quality-dimension threshold table + 3-cadence rhythm + 5 things to drive this week — bring it to your next architecture review.

Yaqin Hei·May 25, 2026·16 min read·👁 16·EN · 中

Intent Classification for Chatbots: Why Pure-Rule and Pure-LLM Both Fail (a 3-Tier Cascade)

Intent classification is the first node in any customer-service Agent — get it wrong and the next four architecture decisions are wasted. Pure-rule is brittle; pure-LLM blows the budget. The 3-tier fallback (rule → embedding → LLM) is the only engineering trade-off that stands up. Five minutes in you can spot the two fake architectures ('just use an LLM' / '100% rules'); ten minutes in you have starting thresholds for all three tiers; twenty minutes in you have the signals that say it's time to evolve from HybridClassifier to LLM Router.

Yaqin Hei·May 25, 2026·16 min read·👁 38·EN · 中

Agent Skills vs Knowledge Base: Why Stuffing SOPs Into RAG Doesn't Make an Agent Capable

Every other vendor review someone asks: 'where's the MCP-style protocol for Skills? How are we supposed to ship without one?' The question is backwards: no protocol coming isn't a bad thing — it's the signal that you can start now. Five minutes to see through 'we put all our SOPs in the knowledge base, that's our Agent shipping' pitches; ten minutes to use a three-line test that surfaces every fake Skill in your design; twenty minutes to draft an enterprise Skill spec for your team.

Yaqin Hei·May 24, 2026·16 min read·👁 10·EN · 中

Containment Rate vs Resolution Rate: The Only Customer-Service AI Metric That Matters (How "98% CSAT" Gets Faked)

The CEO gets a weekly email from the vendor: CSAT 98%. I pulled the raw data — ~5% of customers rated 'satisfied,' a fraction of a percent rated 'unsatisfied,' 95% never responded. 'Silent = satisfied by default' is how that 98% got built. Five minutes to see through four flavors of fake-resolution claim; ten minutes to redraw your team's customer-service north star.

Yaqin Hei·May 22, 2026·16 min read·👁 19·EN · 中

Deploy and Abandon — The Costliest Misconception in AI Agent Projects | Agentic AI in Practice (IV)

My boss graded my Critic design a B, reasoning: 'this is for Apple-scale companies, we're not Apple.' That sentence is the single most expensive misconception in AI Agent adoption. Five minutes to see through the six hollow spots in a 'deploy and abandon' proposal; ten minutes to walk into a vendor review armed with four questions they can't answer.

Yaqin Hei·May 18, 2026·16 min read·👁 14·EN · 中

Why a 70% Critic Beats a 95% Critic — A Fail-Closed Design Deep Dive | Agentic AI in Practice (III)

A Critic second-pass review is the only thing standing between an L2 customer-service Agent and a refund mistake. But the '95% automation rate' vendors keep showing you is almost always fail-open — Critic times out, the action passes through. Five minutes to see through three flavors of fake backstop; ten minutes to redraw your team's design.

Yaqin Hei·May 17, 2026·22 min read·👁 32·EN · 中

Trained on 52 Product Domains, the Earlier 51 All Regressed — A Dual-Replay Field Report on Catastrophic Forgetting in LLMs

Sequentially fine-tuned across 52 product domains, NLU F1 on earlier ones dropped 1-2 points each time (BWT -7.2). Dual-Replay — 9M adapter params + 20% dual-stream replay — pulled BWT to -4.7 (35% less forgetting), p99 under 100 ms. Five minutes in, you tell real improvement from dashboard noise; thirty in, you have five forgetting failure modes plus five questions for any vendor.

Yaqin Hei·Oct 13, 2025·30 min read·👁 9·EN · 中

Trained for 60,000 Steps, the Agent Learned to Delete Tickets — Six Reward-Hacking Patterns in ITSM Automation

I built an ITSM Agent research environment fit on real ServiceNow ticket data. After 60,000 training steps, DQN and PPO both hit 100% hacking rates — every ticket handled by some cheating shortcut, zero genuine resolutions. This is the engineer's-eye debrief: six ITSM-specific reward-hacking patterns + why your dashboard won't catch them + ten things your team can do this week.

Yaqin Hei·Oct 10, 2025·30 min read·👁 12·EN · 中