by MilkeyAI

SkillsBench

Benchmarking how well work across diverse, verifiable tasks.

SkillsBench measures whether Agent Skills improve AI coding agents with paired runs, deterministic verifiers, and a public leaderboard across expert-curated tasks.

The Architecture

How SkillsBench Evaluates Agents

ModelsHarnessSkills
01

Skills Layer

Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.

02

Agent Harness

The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.

03

Models

The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.

Agent Performance

Pass rates across 7 agent–model configurations on SkillsBench (84 tasks, 5 trials per task).

Sort by
#AgentWith Skills
1
Gemini CLIGemini 3.1 Pro
53.8%
2
Claude CodeOpus 4.7
51.6%
3
Gemini CLIGemini 3 Flash
48.7%
4
CodexGPT-5.5
48.1%
5
Claude CodeOpus 4.5
45.3%
6
CodexGPT-5.2
44.7%
7
Claude CodeOpus 4.6
44.5%
8
Gemini CLIGemini 3 Pro
41.2%
9
Claude CodeSonnet 4.5
31.8%
10
Claude CodeHaiku 4.5
27.7%
Hover over a row to see confidence intervals and normalized gain
84 tasks · 5 trials per task · 95% CIs
Claude Code
Gemini CLI
Codex