SkillsBench
Benchmarking how well Skills work across diverse, verifiable tasks.
SkillsBench measures whether Agent Skills improve AI coding agents with paired runs, deterministic verifiers, and a public leaderboard across expert-curated tasks.
The Architecture
How SkillsBench Evaluates Agents
Skills Layer
Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.
Agent Harness
The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.
Models
The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.
Agent Performance
Pass rates across 7 agent–model configurations on SkillsBench (84 tasks, 5 trials per task). Click a row to view sample trajectory.
| # | Agent | With Skills |
|---|---|---|
| 1 | Gemini CLIGemini 3.1 Pro | 53.8% |
| 2 | Claude CodeOpus 4.7 | 51.6% |
| 3 | Gemini CLIGemini 3 Flash | 48.7% |
| 4 | CodexGPT-5.5 | 48.1% |
| 5 | Claude CodeOpus 4.5 | 45.3% |
| 6 | CodexGPT-5.2 | 44.7% |
| 7 | Claude CodeOpus 4.6 | 44.5% |
| 8 | Gemini CLIGemini 3 Pro | 41.2% |
| 9 | Claude CodeSonnet 4.5 | 31.8% |
| 10 | Claude CodeHaiku 4.5 | 27.7% |