Real software-engineering tasks, run through complete agent stacks — harness plus model — and scored on what actually ships: solve rate across three task suites (higher is better), dollars per task, tokens consumed, and minutes on the clock (all lower is better).
21
Configurations
5
Harnesses
17
Models
June 11, 2026
Snapshot
Coding Agent Index
Higher is betterEqual-weight mean of the three suites · top configurations
- 1
Claude Code · Opus 4.8 (max)77.2
- 2
Claude Code · Opus 4.8 (medium)67.2
- 3
Claude Code · Opus 4.7 (max)66.6
- 4
Codex · GPT-5.5 (xhigh)65.3
- 5
Opencode · Opus 4.7 (medium)64.6
- 6
Cursor CLI · Composer 2.562.9
- 7
Cursor CLI · Composer 2.5 Fast62.9
- 8
Cursor CLI · Opus 4.7 (medium)61.1
- 9
Codex · GPT-5.5 (medium)60.4
Cost per Task
Lower is betterMean USD per completed task · cheapest configurations
- 1
Cursor CLI · Composer 2.5$0.07
- 2
Cursor CLI · Composer 2$0.07
- 3
Claude Code · DeepSeek V4 Pro (high)$0.35
- 4
Cursor CLI · Composer 2.5 Fast$0.44
- 5
Claude Code · Kimi K2.6$0.76 - 6
Claude Code · Sonnet 4.6 (medium)$1.02
- 7
Claude Code · Opus 4.7 (medium)$1.24
- 8
Claude Code · Opus 4.6 (medium)$1.27
- 9
Cursor CLI · Opus 4.7 (medium)$1.47
Time per Task
Lower is betterMean wall-clock minutes per task · fastest configurations
- 1
Claude Code · Opus 4.7 (medium)5.8m
- 2
Cursor CLI · GPT-5.5 (medium)6.2m
- 3
Cursor CLI · Composer 2.5 Fast6.7m
- 4
Codex · GPT-5.4 (medium)6.9m
- 5
Claude Code · Opus 4.6 (medium)7m
- 6
Codex · GPT-5.5 (medium)7.1m
- 7
Cursor CLI · GPT-5.4 (medium)7.6m
- 8
Gemini CLI · Gemini 3.1 Pro (high)7.6m
- 9
Cursor CLI · Opus 4.7 (medium)7.8m
Coding agent comparison summary
Every harness × model configuration we track, ranked by Coding Agent Index.
| # | Agent | Model | Lab | Index | SWE-Pro | Term-Bench | Atlas QnA | $/Task | Tokens/Task | Time | Turns |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude CodeAnthropic | Anthropic | 77.2 | 69.8 | 79.4 | 82.5 | $4.62 | 9.4M | 17.7m | 134 | |
| 2 | Claude CodeAnthropic | Anthropic | 67.2 | 49.8 | 75.0 | 76.8 | $1.60 | 3.4M | 8.8m | 63 | |
| 3 | Claude CodeAnthropic | Anthropic | 66.6 | 44.9 | 73.8 | 81.0 | $4.14 | 11.2M | 13.8m | 97 | |
| 4 | CodexOpenAI | OpenAI | 65.3 | 30.9 | 84.1 | 80.8 | $4.33 | 9.3M | 8.7m | 96 | |
| 5 | OpencodeOpencode | Anthropic | 64.6 | 40.2 | 74.6 | 79.0 | $1.82 | 4.4M | 9.7m | 43 | |
| 6 | Cursor CLICursor | Cursor | 62.9 | 49.2 | 66.9 | 72.5 | $0.07 | 2.8M | 9.3m | 101 | |
| 7 | Cursor CLICursor | Cursor | 62.9 | 49.2 | 66.9 | 72.5 | $0.44 | 3.1M | 6.7m | 101 | |
| 8 | Cursor CLICursor | Anthropic | 61.1 | 34.4 | 70.6 | 78.4 | $1.47 | 2.9M | 7.8m | 61 | |
| 9 | CodexOpenAI | OpenAI | 60.4 | 26.2 | 75.8 | 79.1 | $2.21 | 5.4M | 7.1m | 73 | |
| 10 | Claude CodeAnthropic | Anthropic | 59.8 | 36.4 | 71.4 | 71.7 | $1.24 | 3.3M | 5.8m | 35 | |
| 11 | Cursor CLICursor | OpenAI | 57.8 | 24.9 | 73.4 | 75.0 | $1.61 | 2.8M | 6.2m | 69 | |
| 12 | CodexOpenAI | OpenAI | 53.5 | 18.4 | 69.8 | 72.4 | $2.09 | 4.9M | 6.9m | 70 | |
| 13 | Claude CodeAnthropic | Alibaba | 53.1 | 22.9 | 64.7 | 71.8 | $4.98 | 6.0M | 10.5m | 126 | |
| 14 | Claude CodeAnthropic | Z AI | 52.7 | 19.8 | 65.1 | 73.2 | $2.26 | 8.9M | 21.6m | 99 | |
| 15 | Cursor CLICursor | OpenAI | 52.2 | 18.9 | 64.7 | 72.9 | $1.53 | 3.8M | 7.6m | 36 | |
| 16 | Claude CodeAnthropic | Anthropic | 51.3 | 11.8 | 70.2 | 71.9 | $1.27 | 4.2M | 7m | 38 | |
| 17 | Claude CodeAnthropic | Kimi | 50.5 | 27.3 | 64.3 | 59.8 | $0.76 | 7.3M | 41.5m | 111 | |
| 18 | Claude CodeAnthropic | DeepSeek | 50.2 | 18.0 | 64.7 | 67.8 | $0.35 | 6.2M | 18m | 101 | |
| 19 | Claude CodeAnthropic | Anthropic | 49.4 | 14.9 | 63.1 | 70.3 | $1.02 | 4.3M | 9.2m | 47 | |
| 20 | Cursor CLICursor | Cursor | 48.5 | 12.2 | 64.3 | 68.9 | $0.07 | 3.3M | 8.7m | 44 | |
| 21 | Gemini CLIGoogle | 43.0 | 15.1 | 68.3 | 45.6 | $1.60 | 3.2M | 7.6m | 44 |
Coding Agent Index
Equal-weight mean of three real-world suites: hard repository tasks, agentic terminal work, and codebase Q&A. Higher is better.
Composed of SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks). The agent is the unit of measurement — the same model lands differently in different harnesses.
SWE-Bench-Pro-Hard-AA
Solve rate on 150 hard code generation tasks in real repositories, %.
Terminal-Bench v2
Solve rate on 84 agentic terminal tasks in a live shell, %.
SWE-Atlas-QnA
Rubric score on 124 codebase Q&A tasks, %.
The Harness Effect — Claude Opus 4.7 (medium)
The same model, three harnesses. The scaffold alone moves the Coding Agent Index.
Identical model weights and settings; only the harness changes. Prompting, context management, and tooling are worth real points.
Harness Spread
Points of IndexIndex points between a model's best and worst harness, for every model run in 2+ harnesses
- 1
Claude Opus 4.7 (medium) (3 harnesses)4.8
- 2
GPT-5.5 (medium) (2 harnesses)2.6
- 3
GPT-5.4 (medium) (2 harnesses)1.4
Cost per Task
Mean USD to complete one task, across all three suites. Lower is better.
Measured mean spend per task at list API prices, cache discounts included. The spread is the story: the priciest configuration costs roughly 70× the cheapest.
Coding Agent Index vs. Cost per Task
Capability against mean USD per task (log scale).
Up and to the left wins: more solved tasks per dollar. Open-weight models power most of the value corner.
Token Usage per Task
Mean tokens consumed per task, split into fresh input, cache reads, and output.
Agents read far more than they write — cache reads are the overwhelming majority of tokens everywhere. Output is the thin dark sliver on top.
Cache Hit Rate
Share of context reads served from prompt cache, %. Higher is better for cost.
Harnesses that keep context stable cache better. Every point of hit rate is money: cached reads bill at a tenth of the fresh-input price.
Coding Agent Index vs. Total Tokens
Capability against mean total tokens per task (millions).
Reading more of the repo correlates with solving more of it — but the best harnesses get more index per token read.
Execution Time per Task
Mean wall-clock minutes from task start to the agent declaring done. Lower is better.
Includes model latency, tool calls, builds, and test runs. Fast models in lean harnesses finish in ~10 minutes; deliberate configurations take nearly three times as long.
Coding Agent Index vs. Execution Time
Capability against mean wall-clock minutes per task.
Up and to the left wins: capable and quick. Slow is only worth it if the index follows.
Turns per Task
Mean assistant turns (tool-call rounds) per task.
More turns means more, smaller steps — not necessarily better results. Turn count tracks harness style more than capability.
Run Specifications
Every configuration runs the same way, so the numbers compare clean.
- Environment
- Fresh container per task, repo pinned to a fixed commit, network limited to package mirrors.
- Attempts
- One attempt per task (pass@1), no retries, no human nudges.
- Configuration
- Each harness runs at default settings with its recommended model configuration.
- Budget
- Hard cap of 60 minutes wall-clock per task; runs that exceed it score zero.
- Cost accounting
- List API prices at snapshot date; cache reads billed at 10% of the input rate.
- Reporting
- Cost, tokens, time, and turns are means across all completed tasks in the three suites.
Frequently Asked Questions
What is the Coding Agent Index?
The equal-weight mean of a configuration's scores on the three task suites. One number for how much real software work gets done — no extra weighting tricks, no style points.
What do the three suites actually test?
SWE-Bench-Pro-Hard-AA is 150 hard bug-fix and feature tasks in real repositories, graded by hidden tests. Terminal-Bench v2 is 84 multi-step jobs in a live shell — builds, migrations, debugging, ops. SWE-Atlas-QnA is 124 questions that require navigating a large codebase and answering precisely.
How are tasks scored?
Implementation and terminal tasks are pass@1: one attempt, and the test suite either passes or it doesn't. Codebase Q&A earns partial credit against a rubric. Nothing is cherry-picked or re-run.
What counts as execution time?
Wall-clock from handing the agent a task to the agent declaring done — model latency, tool calls, builds, and test runs included. It's the number you actually wait.
Why track tokens at all?
Because agents read far more than they write. Cache reads dominate the bill at a 10×-discounted rate, so two agents with the same index can differ several-fold in cost. The token mix is the why behind the cost chart.
Why does the same model score differently across harnesses?
The harness decides what the model sees and which tools it gets — system prompts, context management, edit formats, test loops. Same engine, different car.
Latest Agent Insights
Reporting from the agents desk

Lawmakers Eye Four-Star Command for Unmanned, Autonomous Systems
By Air & Space Forces Magazine · 7 hrs ago

Meet the OpenAI Engineer Leading ChatGPT's Biggest Transformation Yet
By Wired · 7 hrs ago

Three ways that Asia's enterprises are adopting AI -- and where they are falling behind | Fortune
By Fortune · 7 hrs ago

Evaluate AI agents systematically with Agent-EvalKit
By Amazon Web Services, Inc. · 12 hrs ago
Methodology: every configuration runs the same three suites — SWE-Bench-Pro-Hard-AA (150 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks) — and the Coding Agent Index is their equal-weight mean. Suite design and metric definitions follow the public coding-agents methodology of Artificial Analysis (artificialanalysis.ai/agents/coding-agents). Figures are Glsrm editorial estimates calibrated to our model benchmark table, not Artificial Analysis's published results. Cost per task is derived from each run's mean token mix at list API prices, with cache reads billed at 10% of the input rate. Prices change frequently. Logos identify the respective model creators.