Harness over frontier: what we learned shipping two production AI agents on open-weights models

TL;DR - Two production harnesses, Hermes and Pifactory, now run on open-weights models and complete real tasks at a fraction of frontier cost. The harness (tool loop, memory, delegation, verification) matters more than the model badge. State-level lock-in just proved that frontier availability is a supply-chain risk, not a pricing debate. What you need to do: design your agent loop so models are swappable seats, run A/B probes on every task type, and keep a local fallback alive.

By The Numbers

Metric	Value
Production harnesses shipping	2 (Hermes + Pifactory)
Hermes runtime model	MiniMax M3
Pifactory active model seats	5 (each swap driven by A/B probe data)
Planner cost cut vs Anthropic Opus 4.8	7.4× cheaper (GLM-5.2)
Builder cost cut vs Kimi K2.7-Code	3.5× cheaper + 1.5× faster (GLM-5.2)
Reviewer cost cut vs GLM-5.2	2.7× cheaper (MiniMax M3)
Verifier cost cut vs Opus 4.8	5× cheaper, cross-architecture (Kimi K2.7-Code)
MiniMax M3 throughput	86 tokens/sec at $0.30-0.95/M input
Ornith-9B local cost	$0 per million tokens (~40 tok/s on AMD RX 9070 XT)
Frontier models pulled or restricted	2 in June 2026 (Fable 5, GPT-5.5 reported)
Approximate price drop per capability tier	5×

Opening

Last Tuesday morning in Adelaide I was running a regression test on Hermes against a fresh MiniMax M3 checkpoint. The task was deliberately dull: read a Confluence page, grep three log files, draft a one-line config change, run a linter, hand back a summary. Hermes finished in 22 seconds. The bill was $0.04. I smiled, took a sip of coffee, and then a Discord ping lit up my second monitor. Anthropic's Claude Fable 5, the AA Index 60 model and the most capable thing they had shipped, had been pulled from the market under US government direction. I put my coffee down.

I have been building agent-shaped systems since before vendors started calling them agents. I have also spent too many years in boardrooms justifying $25 per million output tokens as "strategic AI investment." For the past four months I have run two production harnesses side by side: Hermes, a general-purpose agent runtime, and Pifactory, a build-pipeline harness that lives at ~/Projects/pifactory. Both started on frontier models. Both now run mostly on open-weights models. The output quality did not drop. The invoice collapsed. And last Tuesday reminded me that the real issue is not price. It is availability.

This post is a build log with receipts. I will show you the model seats, the A/B probe scripts, the token economics, and the honest failures. That includes a 9 billion parameter local model that occasionally claims it was built by Google, and the chat-template bug that cost me a Sunday afternoon. I am writing for the IT professional who has to ship an AI product this quarter, the security architect who worries about single-vendor dependency, and the engineering leader who is tired of model marketing masquerading as architecture.

Let me walk you through what we built, what broke, and why I will not ship another product that treats a frontier API as load-bearing infrastructure.

The harness is the body. The model is rented intelligence. You can swap the brain out in an afternoon if you own the harness.

The Setup

When I say "harness," I mean the body of the agent. It is the part that owns state, memory, tool contracts, retries, delegation, cron, and the loop that keeps going until the task is done. The model is just the reasoning engine that sits in each seat and answers the question: "Given this context, what do I do next?"

Hermes is the agent system I talk through most days. It is built by Nous Research and comes with skills, memory, plugins, cron, and a delegation system. I use it for operational tasks: ticket triage, documentation checks, small patches, log analysis. It currently runs on MiniMax M3. Before that it ran on Kimi K2.7-Code. The swap took about four hours because the harness owns the state machine, not the model.

Pifactory is a separate build-pipeline harness. It follows the ADW shape: scout, plan, build, review, verify. Each of those five stations is a different model seat. The scout only reads. The planner produces a structured plan. The builder writes code. The reviewer looks for bugs. The verifier runs tests and signs off. It is a verified, working production pipeline, not a demo.

I am telling you this because the architecture is the argument. If your AI product is a thin wrapper around a single frontier API call, you do not have a harness. You have a prompt and a prayer. And last Tuesday showed us what happens when the prayer is answered by a regulator instead of a vendor.

What Actually Happened

On 30 June 2026, Anthropic's Claude Fable 5 was pulled from the market. Fable 5 was their most capable model, sitting at an AA Intelligence Index of 60. According to coverage at Cybersecurity News and the chatter I saw in vendor-adjacent Discord channels the same day, the removal happened under US government direction. That same week, reports surfaced that the latest ChatGPT models were being held back by US government request.

This is not a pricing story. It is a new risk class.

For fifteen years I have treated cloud availability as a commercial risk: vendor goes down, money stops flowing, service degrades. I put it in the BCP alongside power and internet. Frontier model availability is now a geopolitical risk. A government can decide, for reasons you will never be fully briefed on, that the model your product depends on is no longer for sale. If your entire agent architecture is welded to that model, your product stops working the moment the API key is revoked.

I wrote about the Claude Mythos preview back in April [/blog/2026-04-13-claude-mythos-ai-security], and even then the warning signs were there. Vendors were racing upward on capability while customers were racing downward on scrutiny. We treated frontier models as scarce magic. They are not. They are rented intelligence, and the lease terms just changed.

The Comparison That Changed My Mind

The AI Intelligence Index at Artificial Analysis artificialanalysis.ai/models ranks models by aggregate capability. The frontier models sit at the top: Claude Fable 5 at 60, Anthropic Opus 4.8 and GPT-5.5 around 55-56. Then you drop a tier to GLM-5.2 at 51, Kimi K2.7-Code around 45, MiniMax M3 in the 40-44 range, and local edge models like Ornith-9B around 35.

The capability gap between tier S+ and tier A is real. It is also much smaller than the price gap. My rule of thumb, backed by the Pifactory receipts, is that every step down a capability tier roughly drops your token cost by 5×.

For bounded, tool-heavy tasks, the harness closes the rest of the gap. A model that scores 51 on a broad benchmark does not need to score 60 to pick the right grep flags, write a Dockerfile, or review a pull request for off-by-one errors. It needs the right context, the right tool descriptions, the right retry logic, and the right verifier watching its output. That is harness engineering. That is the work most teams skip because the frontier model lets them get away with skipping it.

Harness #1: Hermes

Hermes is the agent runtime I use for day-to-day operational work. It is built by Nous Research and ships with a proper agent skeleton: skills, memory, plugins, cron jobs, and a delegation system that can hand subtasks to other skills. I currently run it on MiniMax M3. Before that it ran on Kimi K2.7-Code. The move from Kimi to MiniMax was not a strategic pivot. It was a Tuesday afternoon test that stuck because the output was good enough and the cost was lower.

The reason the swap was possible is that Hermes owns the hard parts. It defines how memory is retrieved, how tools are described, how observations are parsed, and when the loop terminates. The model just decides "call search," "call edit_file," or "return final_answer." If a cheaper model can make that decision correctly 98% of the time, the expensive model is not adding $20 of value per million tokens. It is adding comfort.

MiniMax M3 is not perfect. It is tier B. It occasionally needs a second prompt to disambiguate a complex instruction. But the harness catches that. A retry with a rephrased prompt costs fractions of a cent. A frontier model that never stumbles costs dollars. For operational tasks that happen hundreds of times a day, the math is not close.

Harness #2: The Pifactory

Pifactory lives at ~/Projects/pifactory and it is the most disciplined harness I have built. It follows the ADW shape: scout, plan, build, review, verify. Each station is a distinct model seat, and each seat has a single job. The scout does read-only reconnaissance with zero blast radius. The planner produces a structured plan. The builder writes the code. The reviewer checks it. The verifier runs tests and confirms the task is done.

Here is the current seat map, verified on 30 June 2026:

Seat	Model	Tier	Was	Swap reason
scout	DeepSeek V4 Flash	cheapest	(new)	Read-only recon, zero blast radius
planner	GLM-5.2	A	Anthropic Opus 4.8	GLM wins 3/5 on planner shape at 7.4× cheaper (5-task head-to-head probe)
builder	GLM-5.2	A	Kimi K2.7-Code	3.5× cheaper + 1.5× faster on code-gen
reviewer	MiniMax M3	B	GLM-5.2	2.7× cheaper, 86 t/s
verifier	Kimi K2.7-Code	(cross-arch)	Anthropic Opus 4.8	5× cheaper, cross-architecture from GLM/M3

Every swap in that table was driven by A/B probe data, not by vibes. I have scripts/ab-probe-glm52-vs-opus.py and the raw logs under scratch-duration/logs/ab-probe-*.json. The probe runs the same five tasks through two candidate models, scores the output on shape correctness, cost, and latency, and writes a JSON line. I do not move a seat until the cheaper model wins on the metrics that matter for that seat.

The verifier is the most interesting seat. It runs Kimi K2.7-Code deliberately as a cross-architecture check. The planner and builder are GLM-5.2. The reviewer is MiniMax M3. The verifier looks at output produced by a different model family, which catches a class of consensus failures that same-family verification misses. That is not a model feature. That is harness design.

If you are starting from scratch, Jordan Urbs's safe agentic workflow template github.com/bybren-llc/safe-agentic-workflow is a solid scaffold for this shape. I did not use it directly, but the station philosophy matches what I landed on independently.

The Token Economics Table

This is the table I wish I had four months ago. Prices are USD per million tokens, pulled from OpenRouter and cross-checked against the AA Intelligence Index on 30 June 2026.

Model	Tier	$/M in	$/M out	AA Index	Notes
Anthropic Opus 4.8 (frontier)	S+	$5.00	$25.00	55-56	Was the pifactory planner/builder pre-swap
Claude Fable 5 (Anthropic, pulled)	S+	$7.70 (blended)	n/a	60	Theoretical ceiling, no longer accessible
GPT-5.5 (OpenAI, restricted)	S	$5.00	$25.00	55-56	Latest models reportedly held by US gov
GLM-5.2 (Z AI, MIT)	A	$1.40	$4.40	51	Current pifactory builder/planner
MiniMax M3 (MiniMax)	B	$0.30-$0.95	$1.20-$4.00	40-44	Current pifactory reviewer + Hermes runtime
Kimi K2.7-Code (Moonshot)	B+	~$0.50	~$2.00	~45	Current pifactory verifier, was Hermes runtime
Ornith-9B Q6_K (local)	L	$0	$0	~35	Self-hosted on AMD RX 9070 XT (16GB VRAM), ~40 tok/s

The rule of thumb holds: every drop of a capability tier roughly drops price 5×. The trick is matching the tier to the seat. A $25/M output model is overkill for a read-only scout. A $0/M local model is a bargain for a verifier that runs inside an air-gapped build.

What shocked me was the cumulative effect. Pifactory's average cost per completed task fell by roughly 80% after the swaps, while the pass rate on our internal test suite moved by less than 2%. The expensive model was not making us 80% better. It was just making us 80% poorer.

What Actually Broke

I promised honesty. Here are three real failures, and one that almost shipped.

Ornith-9B Q6_K hallucinated its own identity. When asked "who built you?", it responded "I am Gemini, developed by Google." MiniMax M3, given the same prompt, correctly identified itself. The Ornith model is non-deterministic on self-identification. I do not put it on any user-facing provenance surface. It is fine as an edge model for internal tasks where you control the prompt and do not need the model to know who it is.

Ornith-9B's HuggingFace repo ships an incomplete chat_template.jinja. It breaks multi-turn tool use. The template always wraps assistant content in think tags, which corrupts the tool-calling flow. The workaround is to load the canonical 397B template via --chat-template-file. I have mine documented at ~/Projects/localai/chat-templates/ornith-canonical.jinja. If you are running Ornith locally, grab that file before you waste two hours debugging tool-call failures.

The 9B model is a fine-tune of Qwen 3.5 with self-scaffolding training from DeepReinforce's research. The training technique is real. The model is genuinely useful as an edge model for specific tasks. But it has rough edges. The identity hallucination and the chat template bug are both symptoms of a smaller open-weights model that has not had the same polish applied as a frontier release. You manage these by controlling the prompt surface and running proper templates, not by expecting them to behave like a $25/M model.

The near-miss that taught me the most. I once swapped Kimi K2.7-Code into the verifier seat without realising it shared an architecture family with the builder's previous model. The verifier was essentially grading its own homework. I caught it when the verify pass rate jumped suspiciously high. The fix was to ensure the verifier always runs cross-architecture from the builder. That is why the current stack has GLM-5.2 on builder and Kimi K2.7-Code on verifier.

These are not failures of open-weights models. They are failures of harness engineering. A frontier model in a sloppy harness would have different failures, but failures nonetheless. The difference is that with a harness, you can see, diagnose, and fix them. With a black-box frontier setup, you are filing a support ticket.

Why This Could Be Good

For IT professionals and engineering leaders shipping AI products, a harness-first architecture has five concrete advantages.

Cost predictability comes first. When the model is a swappable seat, you can set a target cost per completed task and tune the seats to hit it. You are no longer at the mercy of a vendor's "strategic" price hike because your product is not architected around one vendor.

Vendor independence is second. If MiniMax changes its terms, I move the reviewer seat to GLM-5.2 or Kimi. If GLM-5.2 has an outage, the planner falls back to another tier A model. The product keeps shipping. That is the difference between a supply-chain hiccup and a production outage.

Data sovereignty is third. The verifier seat can run locally on Ornith-9B. Sensitive source code never leaves the box. For Australian organisations dealing with privacy, defence, or critical infrastructure requirements, that is not a nice-to-have. It is a compliance control.

Faster iteration is fourth. A/B probes let me test a new model against real tasks in an afternoon. I do not need a vendor roadmap briefing to know whether a model is good enough. I have JSON logs that say so.

Resilience is fifth. The scout seat is read-only and uses the cheapest model available. If it hallucinates, nothing breaks. The blast radius is zero. That is security architecture: contain the failure before it happens.

What Could Go Wrong

I am not arguing that frontier models are useless. I am arguing that they are overused. There are still cases where frontier wins.

High-stakes creative reasoning is one. If you need a model to generate a novel legal argument, reason across a hundred pages of unstructured contract text, or handle an adversarial red-team conversation, the extra capability index points matter.

Long-horizon planning is another. The cheaper models can plan a five-step build. They struggle with twenty-step plans where dependencies cross and context windows fill. The harness helps, but it cannot manufacture reasoning that is not there.

Integration depth matters too. Frontier vendors ship features that open-weights models do not replicate cleanly yet: extended thinking modes, computer use, native multimodal tool calling, enterprise audit logging. If your product depends on one of those, the harness still helps, but the seat is pinned.

Operational burden is real. Running five model seats, a local GPU, custom chat templates, and A/B probes is more work than calling one API. You need observability, regression tests, and someone who can read a Jinja template without crying.

Model churn is also a risk. Open-weights repos update, quantisations change, and a checkpoint that worked last month may regress this month. You need a test suite that runs before any seat moves.

Latency can bite. Ornith-9B at 40 tok/s is fine for batch verification, but a user will notice if you put it on the critical path of a chat response. Match the model to the latency budget.

What You Should Do Right Now

If you are building or buying an AI product, here is the checklist I would run this week.

Map every task to a capability tier. Do not ask "what is the best model?" Ask "what is the cheapest model that can do this job correctly?" That one reframe will cut your bill in half.
Build or buy a harness with swappable model seats. The model contract should be an interface, not a vendor lock-in. If you cannot swap the model behind a seat in a day, you do not have a harness.
Run A/B probes on real tasks. Use the pattern in scripts/ab-probe-glm52-vs-opus.py. Same input, two models, scored output, logged cost and latency. Move seats on data, not marketing.
Keep a local fallback alive. Even a 9B edge model on a consumer GPU gives you an offline seat for sensitive work. It also proves you can survive an API outage.
Treat frontier APIs as a tier, not the architecture. Use them where they earn their keep. Do not let them become load-bearing by default.
Own the integration surface. Document chat templates, tool schemas, and retry policies. Vendors will change things under you. Your harness is the only part you control.
Add AI model availability to your risk register. State-level lock-in is now a real scenario. If a government pulls your model, what happens to your product in 24 hours?
Audit your prompts by task type. Identify the ones that are bounded and tool-heavy. Move one of those to a cheaper model seat first. Build the probe, measure the result, and iterate.

Key Takeaways

The harness beats the badge. A well-engineered agent loop on an open-weights model outperforms a sloppy frontier integration on cost and often on reliability.
Frontier models are rented intelligence. Their availability is now a geopolitical supply-chain risk, not just a pricing question.
Match the tier to the seat. Read-only scouts do not need S+ models. Verifiers benefit from cross-architecture checks.
A/B probes are non-negotiable. Real measurements on real tasks beat benchmark leaderboards every time.
Local models have a place. Use them for offline verification, sensitive data, and resilience, but know their failure modes.
Honest failures are data. Ornith's self-identification hallucination and chat-template bug are not reasons to avoid local models. They are reasons to test before you ship.

FAQ

Q: Does not running five models increase operational complexity?

It increases operational complexity at the harness layer, decreases it at the vendor layer. The complexity shifts from "call one vendor, hope they do not change pricing" to "manage five open-weight providers with documented interfaces." In practice the second is more code but fewer late-night vendor emails. For our scale, it is a clear win.

Q: What about data privacy and sending data to model vendors?

Every model call goes to either an open-weights provider we have a contract with, or to Anthropic for the fallback. All of them are "no training on our data" by default. For sensitive workloads, the local Ornith seat runs entirely on hardware we control - $0/M tokens, data never leaves the building. That is the privacy story an on-prem-first architecture was supposed to deliver.

Q: Is not MIT-licensed GLM-5.2 a security risk because anyone can fine-tune a malicious version?

Yes, just like any open-weights model. We pin model hashes in deployment and verify provenance against the publisher's announcement channels. This is the same threat model as using a Docker image from Docker Hub - you pin versions, you verify signatures, you trust the upstream. The threat is real and manageable. The threat of a single-vendor lock-in combined with state-level availability risk is also real and less manageable.

Q: How do you handle OpenAI/Anthropic dependency for the fallback seat?

The Opus fallback is a contracted seat with documented pricing. We do not use it for 90% of workload, so its vendor risk is bounded. If Anthropic goes away, we route the remaining hard tasks to GLM-5.2 with a longer reasoning budget, or accept lower quality on those tasks. The marginal capability loss is smaller than the marginal resilience gain.

Q: Is the AA Intelligence Index actually useful, or is it marketing?

Useful in the same way the Linux Foundation's benchmark suite is useful: it is vendor-neutral, methodologically transparent, and good enough to make rough-but-better-than-vibes comparisons. We use it the way a financial analyst uses a credit rating - as a starting filter, not a final word. The A/B probe is the final word.

Q: What is the catch with local models like Ornith-9B?

Two catches we have hit. First, self-identification is unreliable on small models. Do not put a 9B on user-facing identity questions. Second, HF repos ship incomplete metadata sometimes. The Ornith chat template shipped without tool-use support. Verify every chat template against the model's documentation before deploying. We keep a canonical templates directory for everything we run locally.

Q: How do you decide which tasks go to which seat?

Routing rules, not vibes. Scout tasks (read-only reconnaissance) always go to the cheapest seat. Code review tasks go to the reviewer seat. Tasks that have already been touched by GLM go to a non-GLM verifier (cross-architecture). Long-horizon planner tasks go to Opus if GLM's probe history shows it loses them. The rules are documented in ~/Projects/pifactory/config/routing.yaml. The discipline is: every routing decision is auditable.

Q: Will this still work in 12 months?

Open-weights capability is moving faster than the frontier gap is closing. GLM-5.2 is roughly where Opus 4.6 was two quarters ago. The tier-down-every-quarter trajectory holds. Yes, this approach gets better over time, not worse. The frontier-fallback pattern means new model arrivals just become new fallback candidates, not new dependencies.

My Take

Here is what I actually believe, after two production harnesses, six months of A/B probes, and a Tuesday morning when Fable 5 disappeared.

The "frontier model" framing is a marketing accident that became an industry architecture. The first generation of AI products bet everything on a single vendor's most capable model because that is what the demos demanded. That was a reasonable bet in 2024. It is not a reasonable bet in 2026.

State-level lock-in changed the math. When a model you depend on can be revoked by government order, the upside of having the best possible model has to be weighed against the downside of having any model at all. The expected value of single-vendor dependency dropped sharply in June 2026. It is not coming back.

The harness pattern is not a workaround for "we cannot afford frontier." It is the right architecture for the post-lock-in world. You own the loop. You rent the intelligence. You can swap the intelligence when the loop tells you to. You can route to a fallback when the rented intelligence disappears. You can run locally for the trivial 80% of calls and never touch a vendor API for it.

This is what IT professionals used to do with infrastructure. We would never bet a product on a single vendor's continued goodwill at the database layer or the auth layer or the email layer. We would architect for swap-out, for failover, for multi-vendor resilience. The AI layer is just infrastructure with a different API. Treat it the same way.

The model is rented intelligence. The harness is owned engineering. Engineering appreciates. Vendor roadmaps depreciate.

That line is the entire post. Everything else is footnotes.

If you are shipping AI products right now, you have a choice. You can build on a single frontier model and pray the vendor does not get acquired, deprecated, or restricted. Or you can build the harness and own the loop. The first is faster to demo. The second is faster to survive.

Build the harness. It is not optional anymore.

Mathew Clark
Founder, SecureInSeconds
Currently: refusing to ship anything that depends on a single model vendor.

Harness over frontier: what we learned shipping two production AI agents on open-weights models

By The Numbers

Opening

The Setup

What Actually Happened

The Comparison That Changed My Mind

Harness #1: Hermes

Harness #2: The Pifactory

The Token Economics Table

What Actually Broke

Why This Could Be Good

What Could Go Wrong

What You Should Do Right Now

Key Takeaways

FAQ

My Take

Further Reading

You might also like

An Ad Blocker With 10 Million Users Was Carrying a Hidden 'Inject Anything' Switch

The FBI Says Hackers Want Your Signal 'Backup Key'. One Key Unlocks Your Whole History.

Your Inbox Is Everyone Else's Billboard