typed vs Opus 4.8: a coding head-to-head -- typed

The short version

We ran the same Claude Code against two backends -- Anthropic's flagship Opus 4.8 (direct) and typed -- on a battery of medium-to-hard coding tasks: real SWE-bench Verified bugs, plus algorithm, performance, and edge-case tasks in both Python and TypeScript.

They tied. On every task. Across 14 tasks, both backends went 14/14. We then built harder tasks specifically to separate them -- performance gates, backtracking, fiddly text-justification, insight-dependent algorithms -- and they tied on those too.

The takeaway: typed matches Opus 4.8 on real coding work. Same Claude Code, same tasks, same result -- at the reference subscription's same monthly price.

How we ran it

The point of this benchmark was a clean, honest, apples-to-apples comparison -- not a leaderboard score, and not a setup rigged to make either side look good. So:

Same harness, same prompts, same tools. The only thing that changed between runs was ANTHROPIC_BASE_URL -- Anthropic direct vs typed. Identical Claude Code, identical task prompts, and the same effort tier on each (Opus via --effort xhigh; typed via the typed-xhigh model).
Objective graders. Every task is graded by running its real test suite (pytest for Python, vitest for TypeScript). No model judges another model; a task passes only if its tests pass.
Every task is validated both ways. Before a task counts, we confirm it is RED against the planted bug and GREEN against a known-correct fix -- so a broken task can never unfairly fail (or pass) either backend.
It all runs locally. No cloud build host, no special infrastructure.

The tasks

Real-world (SWE-bench Verified). Five real bug-fix tasks from the sympy project -- actual GitHub issues with their real regression tests. The agent gets the repo, the issue, and a failing test, and has to land the fix.

Algorithm & data-structure (TypeScript). Five classic bugs every engineer recognizes: an LRU cache that evicts the wrong entry, interval merging on unsorted input, cycle detection in a topological sort, async-pool result ordering, and a token-bucket rate limiter that overflows.

The "harder" round -- built specifically to find a difference:

Performance gates. A function that is correct but too slow -- it passes correctness and then fails a measured-runtime assertion on a large input. Passing requires actually finding the efficient algorithm (O(n) instead of O(n*k); O(n log n) instead of O(n^2)), not just a working answer.
Thoroughness. Full text-justification -- a dozen-plus fiddly edge cases (even space distribution with the extra spaces favoring the left, last-line handling, single-word lines). Being 90% right still fails.
Insight-dependent. Max-product subarray (you have to track the running minimum too, because two negatives flip to a large positive) and coin change (greedy is not optimal -- it needs dynamic programming).

The results

Round	Opus 4.8 (direct)	typed
Real SWE-bench Verified (5)	5 / 5	5 / 5
Algorithm / data-structure TS (5)	5 / 5	5 / 5
Harder: perf gates + backtracking (2 TS)	2 / 2	2 / 2
Harder: perf + edge-cases (2 Python)	2 / 2	2 / 2
Total	14 / 14	14 / 14

The "can we even find a difference?" round. Still tied, we took two insight-dependent tasks -- max-product subarray (you have to track the running minimum, since two negatives multiply into a large positive) and coin change (greedy is wrong here; it needs dynamic programming) -- and ran each three times on each backend, to catch any flakiness or a pass-rate gap that a single run might hide. The result: 12 / 12 -- six for six on each side. No divergence, even across repeated trials.

On speed they were close -- roughly even task by task, with typed faster on the average because Opus 4.8 had a couple of long outliers where it spent several extra minutes on a task (and still arrived at the same passing answer).

What this means (and what it doesn't)

We're not claiming typed beats Opus 4.8 -- it doesn't, and we're not going to manufacture a win we didn't measure. We're claiming something we think is more useful and more honest: on objective, test-graded coding tasks, you can't tell them apart by the results. Both are excellent. The work gets done, correctly, either way.

So the question stops being "which model is better at coding?" and becomes "how do you get billed for it?" That's where typed is different: the same frontier-class quality at the same monthly price as the reference subscription, with monthly billing instead of 5-hour reset windows and optional top-ups.

Honest caveats

This is a curated set of hard tasks, not the full SWE-bench Verified leaderboard -- so read it as "matched on a representative hard battery," not "identical leaderboard score."
The genuinely hardest SWE-bench tasks (the multi-hour-difficulty ones) need per-task Docker environments to run faithfully; those aren't in this set. The tasks here are real and hard, but they're the ones that run cleanly without that machinery.
Different model under the hood means different responses on edge cases. Most coding work feels identical; some won't. Our migration page is upfront about where.

Try it: typed.cloud -- point your existing Claude Code at it, run a real session against your own codebase, and judge for yourself. Swap the variables back any time.