Coding Agents · July 26, 2026

Glsrm Coding Agent Benchmarks

We measure real-world performance of coding agents on software engineering tasks, including cost, token usage, and execution time — across agents, models, and harnesses.

Explore the index ↓

Configurations

Harnesses

Models

Snapshot

July 26, 2026

Highlights

Higher is better

Coding Agent Index

Equal-weight mean of the three suites · top configurations

1Claude Code · Opus 5 (xhigh)66.7
2Codex · GPT-5.6 Sol (max)66.6
3Claude Code · Fable 5 (max) (with fallback)65.8
4Claude Code · Opus 5 (max)65.5
5Codex · GPT-5.6 Sol (xhigh)65.1
6Grok Build · Grok 4.5 (high)64.4
7Codex · GPT-5.6 Sol (high)64.1
8Claude Code · Opus 5 (high)63.4
9Codex · GPT-5.6 Terra (max)62.3

Lower is better

Cost per Task

Mean USD per completed task · cheapest configurations

1Cursor CLI · Composer 2$0.04
2Cursor CLI · Composer 2.5$0.08
3Codex · GPT-5.6 Luna (low)$0.21
4Claude Code · DeepSeek V4 Pro (high)$0.27
5Codex · GPT-5.6 Luna (none)$0.35
6Codex · GPT-5.6 Terra (none)$0.37
7Codex · GPT-5.6 Luna (medium)$0.47
8Codex · GPT-5.6 Terra (low)$0.48
9Cursor CLI · Composer 2.5 Fast$0.55

Lower is better

Time per Task

Mean wall-clock minutes per task · fastest configurations

1Codex · GPT-5.6 Terra (none)1.8m
2Codex · GPT-5.6 Luna (low)1.9m
3Codex · GPT-5.6 Luna (none)2.5m
4Codex · GPT-5.6 Terra (low)2.8m
5Codex · GPT-5.6 Sol (none)3.4m
6Codex · GPT-5.6 Luna (medium)3.4m
7Codex · GPT-5.6 Sol (low)3.7m
8Codex · GPT-5.6 Terra (medium)4.3m
9Codex · GPT-5.6 Sol (medium)5.2m

Coding agent comparison summary

Every harness × model configuration we track, ranked by Coding Agent Index.

#	Agent	Model	Lab	Index	DeepSWE	Term-Bench	Atlas QnA	$/Task	Tokens/Task	Time	Turns
1	Claude CodeAnthropic	Claude Opus 5 (xhigh)	Anthropic	66.7	60.5	84.9	54.8	$8.23	21.7M	23.6m	153
2	CodexOpenAI	GPT-5.6 Sol (max)	OpenAI	66.6	68.7	87.7	43.3	$7.08	13.2M	10.2m	114
3	Claude CodeAnthropic	Claude Fable 5 (max) (with fallback)	Anthropic	65.8	66.1	82.5	48.9	$11.71	13.8M	23.4m	138
4	Claude CodeAnthropic	Claude Opus 5 (max)	Anthropic	65.5	63.1	84.5	48.9	$8.95	23.7M	23.7m	166
5	CodexOpenAI	GPT-5.6 Sol (xhigh)	OpenAI	65.1	67.0	86.1	42.2	$5.24	9.9M	7.4m	96
6	Grok BuildxAI	Grok 4.5 (high)	xAI	64.4	59.9	85.3	48.1	$2.59	3.6M	16.5m	61
7	CodexOpenAI	GPT-5.6 Sol (high)	OpenAI	64.1	64.9	82.5	44.9	$4.14	8.1M	6.3m	86
8	Claude CodeAnthropic	Claude Opus 5 (high)	Anthropic	63.4	60.8	80.2	49.2	$3.80	9.6M	13.4m	93
9	CodexOpenAI	GPT-5.6 Terra (max)	OpenAI	62.3	67.0	84.1	35.8	$2.76	9.5M	8.4m	97
10	Claude CodeAnthropic	Claude Opus 5 (medium)	Anthropic	61.9	62.8	78.6	44.4	$3.14	7.9M	12.2m	83
11	CodexOpenAI	GPT-5.5 (xhigh)	OpenAI	61.5	64.3	84.1	36.0	$5.07	12.3M	10.1m	106
12	Kimi Code CLIMoonshot AI	Kimi K3	Kimi	61.3	63.7	83.7	36.6	$3.18	10.6M	23.8m	125
13	CodexOpenAI	GPT-5.6 Sol (medium)	OpenAI	60.6	64.0	77.8	40.1	$2.99	5.8M	5.2m	72
14	Claude CodeAnthropic	Claude Opus 4.8 (max)	Anthropic	60.6	55.8	79.4	46.5	$7.70	17.7M	23.1m	166
15	CodexOpenAI	GPT-5.6 Luna (max)	OpenAI	58.7	63.4	79.8	32.8	$1.57	15.5M	8m	115
16	Claude CodeAnthropic	Claude Opus 4.8 (xhigh)	Anthropic	58.4	51.3	81.3	42.7	$5.67	13.6M	17.6m	136
17	CodexOpenAI	GPT-5.6 Terra (xhigh)	OpenAI	57.1	58.4	80.6	32.3	$1.90	6.5M	6.9m	75
18	Claude CodeAnthropic	Claude Opus 5 (low)	Anthropic	56.8	56.9	74.2	39.2	$2.18	5.1M	9.5m	64
19	Claude CodeAnthropic	Claude Opus 4.8 (high)	Anthropic	56.7	51.6	79.8	38.7	$3.72	9.0M	12.3m	104
20	CodexOpenAI	GPT-5.6 Terra (high)	OpenAI	55.8	60.5	76.0	30.9	$1.59	5.5M	6.2m	67
21	CodexOpenAI	GPT-5.6 Luna (xhigh)	OpenAI	54.7	56.6	76.2	31.2	$1.26	12.3M	6.6m	96
22	CodexOpenAI	GPT-5.5 (medium)	OpenAI	54.3	56.6	75.8	30.6	$2.75	7.0M	6.4m	78
23	CodexOpenAI	GPT-5.6 Sol (low)	OpenAI	53.6	53.4	73.0	34.4	$1.72	3.2M	3.7m	54
24	Claude CodeAnthropic	Claude Opus 4.8 (medium)	Anthropic	53.6	49.3	75.4	36.0	$3.26	7.7M	12.4m	93
25	OpencodeSST	Muse Spark 1.1 (xhigh)	Meta	53.5	54.3	73.0	33.3	$1.43	12.2M	12.6m	55
26	CodexOpenAI	GPT-5.6 Luna (high)	OpenAI	51.4	53.4	71.8	29.0	$0.96	9.5M	5.7m	84
27	Claude CodeAnthropic	Claude Opus 4.7 (max)	Anthropic	50.3	40.1	73.8	37.1	$5.63	15.8M	15.7m	107
28	OpencodeOpencode	Claude Opus 4.7 (medium)	Anthropic	49.9	39.5	75.4	34.9	$2.93	7.5M	12.2m	54
29	CodexOpenAI	GPT-5.6 Terra (medium)	OpenAI	47.8	45.7	69.4	28.2	$0.90	3.1M	4.3m	51
30	Claude CodeAnthropic	Claude Opus 4.8 (low)	Anthropic	47.4	41.0	73.0	28.2	$2.15	5.1M	8.5m	67
31	Claude CodeAnthropic	Claude Opus 4.6 (medium)	Anthropic	46.4	—	70.6	22.3	$1.28	4.4M	8m	34
32	Cursor CLICursor	GPT-5.5 (medium)	OpenAI	46.1	37.2	73.4	27.7	$2.01	4.0M	6.6m	78
33	Cursor CLICursor	Claude Opus 4.7 (medium)	Anthropic	45.4	31.6	70.6	33.9	$2.68	5.6M	13.6m	86
34	CodexOpenAI	GPT-5.6 Sol (none)	OpenAI	43.4	35.4	60.7	34.1	$1.40	3.4M	3.4m	55
35	Claude CodeAnthropic	GLM-5.2	Z AI	43.2	28.6	71.9	29.0	$6.51	6.5M	25.1m	127
36	CodexOpenAI	GPT-5.6 Luna (medium)	OpenAI	42.4	36.6	63.5	27.2	$0.47	4.4M	3.4m	58
37	Claude CodeAnthropic	Claude Opus 4.7 (medium)	Anthropic	40.5	27.4	71.4	22.6	$1.68	4.5M	6.3m	42
38	CodexOpenAI	GPT-5.4 (medium)	OpenAI	39.0	25.0	69.8	22.3	$2.42	5.9M	7.1m	73
39	Cursor CLICursor	Composer 2.5 Fast	Cursor	38.2	15.9	67.5	31.2	$0.55	4.2M	6.8m	117
40	Cursor CLICursor	Composer 2.5	Cursor	38.2	15.9	67.5	31.2	$0.08	3.6M	9.5m	117
41	Claude CodeAnthropic	Claude Sonnet 4.6 (medium)	Anthropic	37.6	28.9	64.3	19.6	$2.01	8.4M	13.5m	67
42	Cursor CLICursor	GPT-5.4 (medium)	OpenAI	36.8	16.7	65.5	28.2	$1.55	4.0M	8.1m	26
43	CodexOpenAI	GPT-5.6 Terra (low)	OpenAI	36.7	29.8	57.5	22.8	$0.48	1.5M	2.8m	37
44	Claude CodeAnthropic	GLM-5.1	Z AI	36.1	18.6	65.1	24.7	$4.33	25.9M	19.6m	174
45	Claude CodeAnthropic	Qwen3.7 Plus (thinking)	Alibaba	36.0	19.2	65.1	23.7	$6.23	8.7M	10.6m	146
46	Claude CodeAnthropic	Kimi K2.6	Kimi	32.6	16.5	65.5	15.9	$1.19	11.5M	41m	131
47	Claude CodeAnthropic	DeepSeek V4 Pro (high)	DeepSeek	31.5	8.6	65.9	19.9	$0.27	9.8M	17.9m	127
48	Gemini CLIGoogle	Gemini 3.1 Pro (high)	Google	30.4	14.2	68.3	8.6	$2.00	4.7M	10.8m	31
49	Cursor CLICursor	Composer 2	Cursor	27.5	0.0	64.7	17.7	$0.04	2.9M	8.6m	28
50	CodexOpenAI	GPT-5.6 Luna (low)	OpenAI	25.1	10.3	49.6	15.3	$0.21	1.5M	1.9m	35
51	CodexOpenAI	GPT-5.6 Terra (none)	OpenAI	23.7	13.3	39.3	18.5	$0.37	1.1M	1.8m	34
52	CodexOpenAI	GPT-5.6 Luna (none)	OpenAI	20.4	6.5	37.3	17.5	$0.35	3.6M	2.5m	56

Coding Agent Index

Equal-weight mean of three real-world suites: hard repository tasks, agentic terminal work, and codebase Q&A. Higher is better.

Composed of DeepSWE (113 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks). The agent is the unit of measurement — the same model lands differently in different harnesses.

DeepSWE

Solve rate on 113 real-world software engineering tasks in real repositories, %.

Terminal-Bench v2

Solve rate on 84 agentic terminal tasks in a live shell, %.

SWE-Atlas-QnA

Rubric score on 124 codebase Q&A tasks, %.

The Harness Effect — Claude Opus 4.7 (medium)

The same model, three harnesses. The scaffold alone moves the Coding Agent Index.

Identical model weights and settings; only the harness changes. Prompting, context management, and tooling are worth real points.

Points of Index

Harness Spread

Index points between a model's best and worst harness, for every model run in 2+ harnesses

1Claude Opus 4.7 (medium) (3 harnesses)9.5
2GPT-5.5 (medium) (2 harnesses)8.2
3GPT-5.4 (medium) (2 harnesses)2.2

Cost per Task

Mean USD to complete one task, across all three suites. Lower is better.

Measured mean spend per task at list API prices, cache discounts included. The spread is the story: the priciest configuration costs roughly 290× the cheapest.

Coding Agent Index vs. Cost per Task

Capability against mean USD per task (log scale).

Up and to the left wins: more solved tasks per dollar. Open-weight models power most of the value corner.

Token Usage per Task

Mean tokens consumed per task, split into fresh input, cache reads, and output.

Agents read far more than they write — cache reads are the overwhelming majority of tokens everywhere. Output is the thin dark sliver on top.

Cache Hit Rate

Share of context reads served from prompt cache, %. Higher is better for cost.

Harnesses that keep context stable cache better. Every point of hit rate is money: cached reads bill at a tenth of the fresh-input price.

Coding Agent Index vs. Total Tokens

Capability against mean total tokens per task (millions).

Reading more of the repo correlates with solving more of it — but the best harnesses get more index per token read.

Execution Time per Task

Mean wall-clock minutes from task start to the agent declaring done. Lower is better.

Includes model latency, tool calls, builds, and test runs. Fast models in lean harnesses finish in ~10 minutes; deliberate configurations take nearly three times as long.

Coding Agent Index vs. Execution Time

Capability against mean wall-clock minutes per task.

Up and to the left wins: capable and quick. Slow is only worth it if the index follows.

Turns per Task

Mean assistant turns (tool-call rounds) per task.

More turns means more, smaller steps — not necessarily better results. Turn count tracks harness style more than capability.

Run Specifications

Every configuration runs the same way, so the numbers compare clean.

Environment: Fresh container per task, repo pinned to a fixed commit, network limited to package mirrors.
Attempts: One attempt per task (pass@1), no retries, no human nudges.
Configuration: Each harness runs at default settings with its recommended model configuration.
Budget: Hard cap of 60 minutes wall-clock per task; runs that exceed it score zero.
Cost accounting: List API prices at snapshot date; cache reads billed at 10% of the input rate.
Reporting: Cost, tokens, time, and turns are means across all completed tasks in the three suites.

Frequently Asked Questions

What is the Coding Agent Index?

The equal-weight mean of a configuration's scores on the three task suites. One number for how much real software work gets done — no extra weighting tricks, no style points.

What do the three suites actually test?

DeepSWE is 113 real-world software-engineering tasks in real repositories, graded end to end. Terminal-Bench v2 is 84 multi-step jobs in a live shell — builds, migrations, debugging, ops. SWE-Atlas-QnA is 124 questions that require navigating a large codebase and answering precisely.

How are tasks scored?

Implementation and terminal tasks are pass@1: one attempt, and the test suite either passes or it doesn't. Codebase Q&A earns partial credit against a rubric. Nothing is cherry-picked or re-run.

What counts as execution time?

Wall-clock from handing the agent a task to the agent declaring done — model latency, tool calls, builds, and test runs included. It's the number you actually wait.

Why track tokens at all?

Because agents read far more than they write. Cache reads dominate the bill at a 10×-discounted rate, so two agents with the same index can differ several-fold in cost. The token mix is the why behind the cost chart.

Why does the same model score differently across harnesses?

The harness decides what the model sees and which tools it gets — system prompts, context management, edit formats, test loops. Same engine, different car.

Latest Agent Insights

Reporting from the agents desk

Agents

Biggest challenge is convincing companies to invest in cybersecurity: SLT | Daily FT

By FT Sri Lanka · 1 hr ago

Agents

OpenAI-linked Hugging Face breach raises AI risk fears

By SecurityBrief Asia · 1 hr ago

Agents

Want to build career in AI? Former OpenAI intern shares 3 tips for easy opportunity

By Digit · 2 hrs ago

News

Accountants embracing AI will outperform | Daily FT

By FT Sri Lanka · 1 hr ago

Methodology: every configuration runs the same three suites — DeepSWE (113 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks) — and the Coding Agent Index is their equal-weight mean. Suite design and metric definitions follow a standard public coding-agents methodology. Figures are Glsrm editorial estimates calibrated to our model benchmark table, not any external published results. Cost per task is derived from each run's mean token mix at list API prices, with cache reads billed at 10% of the input rate. Prices change frequently. Logos identify the respective model creators.