by MilkeyAI

SkillsBench

Benchmarking how well Skills work across diverse, verifiable tasks.

SkillsBench measures whether Agent Skills improve AI coding agents with paired runs, deterministic verifiers, and a public leaderboard across expert-curated tasks.

See Leaderboard Documentation

The Architecture

How SkillsBench Evaluates Agents

Skills Layer

Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.

Agent Harness

The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.

Models

The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.

Agent Performance

Pass rates across 7 agent–model configurations on SkillsBench (84 tasks, 5 trials per task). Click a row to view sample trajectory.

Sort by

#	Agent	Without	With Skills	Δ
1	Gemini CLIGemini 3.1 Pro	34.5%	53.8%	+19.3
2	Claude CodeOpus 4.7	32.4%	51.6%	+19.2
3	Gemini CLIGemini 3 Flash	31.3%	48.7%	+17.4
4	CodexGPT-5.5	33.2%	48.1%	+14.9
5	Claude CodeOpus 4.5	22.0%	45.3%	+23.3
6	CodexGPT-5.2	30.6%	44.7%	+14.1
7	Claude CodeOpus 4.6	30.6%	44.5%	+13.9
8	Gemini CLIGemini 3 Pro	27.6%	41.2%	+13.6
9	Claude CodeSonnet 4.5	17.3%	31.8%	+14.5
10	Claude CodeHaiku 4.5	11.0%	27.7%	+16.7

Hover over a row to see confidence intervals and normalized gain

84 tasks · 5 trials per task · 95% CIsWithoutWith Skills

Claude Code

Gemini CLI

Codex