Why do Agent Skills need evaluation?

Agent Skills can improve performance, but they can also add irrelevant or harmful context. Paired evaluation shows when Skills help and when they do not.

Introducing SkillsBench by MilkeyAI

SkillsBench by MilkeyAI is a benchmark for answering one operational question: when an AI agent receives an Agent Skill, does the agent become more reliable at completing real work?

That question is now practical, not theoretical. AI coding agents can inspect repositories, run commands, modify files, call tools, and complete multi-step workflows. At the same time, teams are packaging procedural knowledge into Skills: small bundles of instructions, examples, scripts, references, and domain rules that an agent can load at inference time.

Skills are useful because they are easier to update than model weights and more structured than one-off prompts. But a Skill is only valuable if it improves verified outcomes. It can help, do nothing, or make the agent worse by adding irrelevant context. SkillsBench exists to measure that difference.

Benchmark focus

Agent Skills

Procedural knowledge packaged for agents.

Method

Paired runs

Compare the same task with and without Skills.

Scoring

Verifiers

Use deterministic checks instead of subjective judging.

Use case

Deployment

Decide which Skills are worth maintaining.

The research release connected to SkillsBench reports 86 expert-created tasks across 11 domains, aggregate results on 84 tasks after excluding two verifier-runtime failures, and 7,308 evaluation trajectories. The accompanying paper is available on arXiv.

The Problem: Agents Need Procedural Knowledge

Modern foundation models are broadly capable, but professional work is often procedural. Success depends on knowing the right workflow, the expected file shape, the validation command, the domain convention, the tool-specific edge case, and the recovery path when something fails.

This is especially visible in coding-agent environments. An agent might understand Python, TypeScript, SQL, or shell commands, but still fail because it does not know the conventions of the target project. It might edit the right file but miss the generated registry. It might solve the apparent problem but fail the hidden verifier. It might over-apply a generic pattern where a domain-specific workflow is required.

Agent Skills are one answer to that gap. A Skill can describe a procedure, include a script, provide a template, or point to a reference that the agent should use while working. In practical terms, Skills are a way to turn repeated operational knowledge into reusable agent infrastructure.

The missing piece is evaluation. Without a benchmark, teams cannot reliably distinguish a useful Skill from a document that merely feels helpful.

What SkillsBench Evaluates

SkillsBench treats Agent Skills as first-class artifacts. The benchmark does not only ask which model is strongest; it asks whether curated procedural knowledge changes the outcome.

That distinction matters because agent performance has three interacting layers:

Layer	Role	What SkillsBench helps measure
Model	Reasoning, planning, code generation, language understanding	How much base capability exists without extra procedural knowledge
Agent harness	Tool access, file operations, terminal execution, context loading, Skill discovery	Whether the agent can route and apply Skills reliably
Skill	Task-specific procedure, examples, scripts, domain constraints	Whether curated knowledge improves the final verified outcome

This separation is the core of the benchmark. A model can be strong and still fail because it lacks procedural knowledge. A Skill can be well-written and still fail because the harness does not load it well. A harness can expose Skills correctly and still see no gain if the task already fits the model's base knowledge.

SkillsBench makes those interactions visible.

Benchmark Design

SkillsBench uses expert-created tasks and deterministic verification so the benchmark can measure outcomes, not subjective impressions. Each task is designed to represent work that an agent might realistically perform in a professional environment.

The release reports 86 tasks across 11 domains. These tasks are intended to test more than ordinary coding fluency: they include domain constraints, brittle workflows, structured outputs, and verification requirements that reward correct execution rather than plausible answers.

Task Quality

For a benchmark like this to be useful, the task must be realistic, resolvable, and scorable. A task that leaks the solution through its Skill does not measure Skill usefulness. A task with an ambiguous verifier does not measure agent reliability. A task that is too artificial may not transfer to real workflows.

SkillsBench addresses this by using self-contained task environments and deterministic verifiers. The verifier decides whether the final state satisfies the task. This keeps the benchmark focused on completion, not stylistic judgment.

Paired Evaluation

The central protocol is paired evaluation. The same task is evaluated in controlled conditions so the effect of Skills can be measured directly.

In simple terms:

Run the agent without curated Skills.
Run the agent with relevant curated Skills available.
Compare pass rates under deterministic scoring.
Repeat across tasks, domains, models, harnesses, and trials.

This is more informative than a normal leaderboard. A leaderboard can show that one agent scores higher than another. Paired evaluation shows whether the Skill itself changed performance.

Deterministic Scoring

SkillsBench uses deterministic verifiers rather than relying on an LLM judge. For agent benchmarks, this matters. LLM-as-judge systems can be useful in open-ended settings, but they add variance and interpretation risk. SkillsBench needs to detect whether procedural knowledge improved task completion, so the scoring has to be stable.

The task either passes the verifier or it does not.

Key Findings

The most important result is not simply that Skills help. The important result is that Skills help conditionally.

The paper reports a 16.2 percentage point average pass-rate improvement from curated Skills across evaluated configurations. That is a meaningful gain. It also hides important variation across domains, tasks, models, and harnesses.

Finding	What it means
Curated Skills improve average performance	Procedural knowledge can materially improve agent outcomes.
Gains vary by domain and task	Skills should be evaluated before deployment.
Some tasks get worse with Skills	More context can mislead or distract an agent.
Self-generated Skills add little benefit	Human-curated procedural knowledge remains valuable.
Smaller models can benefit from strong Skills	Skills can improve cost-performance tradeoffs.

Skills Help Most When Procedure Matters

Skills appear most valuable when a task depends on specialized procedure: uncommon formats, domain-specific constraints, brittle command sequences, or workflows that are unlikely to be fully captured in general model training.

This is the correct use case for Skills. They should not be a dumping ground for generic advice. They should encode the procedural detail that changes how an agent acts.

Skills Can Hurt

Negative Skill effects are real. A Skill can introduce the wrong workflow, overload the context window, or encourage the model to ignore local evidence. This is why blanket Skill injection is risky.

For production teams, the implication is direct: Skills should be tested like software. A Skill should have a clear purpose, an owner, and evidence that it improves the tasks where it will be used.

Self-Generated Skills Are Not Enough

The benchmark also evaluates settings where models attempt to generate their own Skills. The result is a useful warning. Models may recognize that extra procedural knowledge would help, but they do not reliably create the exact procedure needed to solve the task.

Self-generated Skills often become generic: inspect files, run tests, follow best practices. That can be helpful as broad advice, but it is not the same as curated domain knowledge. For reliability, human-authored and tested Skills still matter.

Skills Affect Cost and Model Choice

One of the strongest practical implications is cost-performance. If a smaller model receives the right procedural knowledge, it can close part of the gap with a larger model running without Skills.

This does not mean Skills replace model quality. It means Skills can change deployment economics. For repeated workflows, maintaining a strong Skill may be cheaper and more controllable than always moving to the largest available model.

What This Means for Builders

SkillsBench turns Skill development into an engineering discipline. The takeaway is not "add Skills everywhere." The takeaway is "measure where Skills improve outcomes, then maintain those Skills deliberately."

Good Skills should be compact, procedural, and testable. They should explain when they apply, what steps matter, what files or commands are important, and how the agent can validate progress. They should avoid broad documentation dumps unless the reference material directly improves execution.

For agent teams, this creates a clear workflow:

Identify repeated tasks where agents fail for procedural reasons.
Write a focused Skill for the missing procedure.
Evaluate the task with and without the Skill.
Keep the Skill only if the measured delta justifies it.
Re-test when the model, harness, task, or dependency changes.

This is the standard MilkeyAI wants SkillsBench to support: not more context for its own sake, but measured procedural knowledge that improves verified work.

What This Means for Researchers

SkillsBench also creates research questions that normal coding benchmarks do not isolate well.

It gives researchers a way to study how agents consume procedural context, when external knowledge helps, when it hurts, and how harness design affects Skill usage. It also makes automatic Skill generation a measurable research problem rather than a vague prompting exercise.

The benchmark is especially useful because it separates the base model from the agent harness and the Skill layer. That separation makes it easier to understand why an agent improved or failed.

Explore SkillsBench

View the leaderboard for agent-model comparisons.
Read the documentation to understand the benchmark workflow.
Review the paper for methodology and detailed results.

SkillsBench by MilkeyAI makes Agent Skills measurable. That is the foundation for building agents that do more than receive instructions: they use the right procedural knowledge at the right time.