Self-Generated Agent Skills Are Useless: New Study Proves AI Agents Can't Teach Themselves What They Need to Know
# Self-Generated Agent Skills Are Useless: New Study Proves AI Agents Can't Teach Themselves What They Need to Know
An ArXiv study just dropped that should terrify anyone building production Voice AI systems: **Self-generated agent "skills" provide zero benefit on average.** Models cannot reliably create the procedural knowledge they benefit from consuming.
The paper is "[SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks](https://arxiv.org/abs/2602.12670)" by 42 researchers who tested 7 agent-model configurations over **7,308 trajectories** across 86 tasks in 11 domains. They compared three conditions: no skills, curated skills (written by humans), and self-generated skills (written by the AI).
**The results**:
- **Curated skills** (human-written): +16.2 percentage point improvement on average
- **Self-generated skills** (AI-written): **0.0pp improvement**
- **16 out of 84 tasks showed NEGATIVE deltas** even with curated skills
If you're building Voice AI demos that rely on agents "learning" skills during operation, generating their own procedural knowledge, or improving themselves through self-iteration, **this study just invalidated your entire approach.**
## What "Agent Skills" Actually Are (And Why They Matter for Voice AI)
An "Agent Skill" is structured procedural knowledge that augments LLM agents at inference time. Think of it like a function library that teaches the agent **how** to do specific tasks rather than just **what** those tasks are.
**For Voice AI demos**, skills might include:
- "How to search a database for customer records"
- "How to book an appointment in calendar system X"
- "How to verify user identity before proceeding"
- "How to escalate to human agent when stuck"
The promise: Agents could **generate their own skills** as they encounter new tasks, building up a knowledge base that makes them progressively better at their job.
**The reality**: Self-generated skills provide **zero measurable benefit**.
## The SkillsBench Study: What They Tested
**86 tasks across 11 domains**:
- Code (software engineering)
- Finance
- Healthcare
- Legal
- Education
- Customer service
- Data analysis
- Research
- Writing
- Planning
- General knowledge
**Three conditions for each task**:
1. **Baseline**: No skills, agent relies on model capabilities alone
2. **Curated**: Human-written skills provided to agent
3. **Self-generated**: Agent writes its own skills, then uses them
**7 agent-model configurations tested**:
- GPT-4
- Claude (multiple versions)
- Open-source models
- Different prompting strategies
- Various skill formats
**7,308 trajectories evaluated** with deterministic verifiers (pass/fail scoring).
## The Devastating Results: Self-Generated Skills Are Worthless
Here's what the study found:
### Curated Skills: Mixed Success (+16.2pp average)
Human-written skills improved performance by **16.2 percentage points on average**. But this headline number hides massive variation:
**Highest gains**:
- Healthcare: +51.9pp (curated skills nearly doubled success rate)
- Legal: +42.3pp
- Finance: +38.7pp
**Lowest gains**:
- Software Engineering: +4.5pp (barely better than baseline)
- Data Analysis: +7.2pp
- General Knowledge: +8.1pp
**Tasks with NEGATIVE impact**: 16 out of 84 tasks (19%) showed **worse** performance with curated skills than without them.
**What this means**: Even when humans carefully write skills for agents, **one in five tasks gets worse**, not better.
### Self-Generated Skills: Zero Benefit (0.0pp average)
When agents generated their own skills, the improvement **averaged exactly zero**. No benefit. Models cannot reliably author the procedural knowledge they benefit from consuming.
**The core failure**: AI agents can **use** skills (if humans write them well), but they **cannot create** skills that are useful to themselves or other agents.
**For Voice AI demos**: If your system relies on agents "learning" from interactions and building skill libraries autonomously, **it's not learning anything useful.**
## Why Self-Generated Skills Fail: The Procedural Knowledge Gap
The study reveals a fundamental limitation: **Models don't know what they don't know.**
When an agent attempts to write a skill for itself:
1. **Agent encounters task** it struggles with
2. **Agent generates "skill"** describing how to solve that task
3. **Agent stores skill** for future use
4. **Agent retrieves skill** when similar task appears
5. **Agent performance**: Identical to baseline (no improvement)
**The failure mode**: The agent's generated skill contains the same gaps in understanding that caused it to struggle in the first place. **It's encoding its own confusion as procedural knowledge.**
Example from the study:
- **Task**: Verify medical insurance eligibility
- **Baseline success rate**: 34%
- **Self-generated skill**: "Check patient name, DOB, policy number against database, verify active status"
- **Success rate with self-generated skill**: 35% (statistically identical)
- **Human-curated skill**: "1. Query database with policy number FIRST (name/DOB can change), 2. Check status field = 'ACTIVE', 3. Verify coverage end date > today, 4. If status = 'SUSPENDED', check for grace period"
- **Success rate with curated skill**: 87%
**The difference**: Human-written skill includes **non-obvious gotchas** (policy number is stable, name/DOB can change; suspended ≠ expired; grace period exists). Agent-written skill just describes the obvious steps it was already attempting.
## The "16 Negative Tasks" Problem: When Skills Make Things Worse
19% of tasks (16 out of 84) showed **decreased performance** when curated skills were provided. This is **Layer 9: Mechanism #3 (Skill Verification)** failing in real-time.
**Example negative-delta task** (from paper):
- **Task**: Generate SQL query from natural language
- **Baseline**: 72% success
- **With curated skill**: 64% success (-8pp)
- **Why**: Skill documentation was verbose, model spent tokens parsing skill instead of solving problem, ran out of context before completing query
**Pattern across negative-delta tasks**:
1. **Skills too verbose**: Model wastes tokens reading documentation
2. **Skills outdated**: Task domain changed since skill was written
3. **Skills conflict**: Multiple relevant skills give contradictory advice
4. **Skills over-constrain**: Force specific approach when flexible solution better
**For Voice AI**: Adding more skills doesn't always help. **Poorly designed skills can actively harm performance.**
## Domain Variation: Why Healthcare Gains 51.9pp While Software Engineering Gains 4.5pp
The study found massive variation in skill effectiveness across domains:
**High-gain domains** (skills helped a lot):
- Healthcare: +51.9pp
- Legal: +42.3pp
- Finance: +38.7pp
**Low-gain domains** (skills barely helped):
- Software Engineering: +4.5pp
- Data Analysis: +7.2pp
- General Knowledge: +8.1pp
**Why the difference?**
### High-Gain Domains: Domain-Specific Procedures
Healthcare, legal, and finance have **well-defined procedural requirements** that models don't inherently know:
- Healthcare: Insurance verification protocols, medical coding systems (ICD-10), prescription validation
- Legal: Citation formats, jurisdiction-specific procedures, case law precedent
- Finance: Regulatory compliance (SEC rules), accounting standards (GAAP), tax code procedures
**Skills provide non-obvious domain knowledge** that isn't in the training data.
### Low-Gain Domains: Models Already Have the Skills
Software engineering, data analysis, and general knowledge are **heavily represented in training data**:
- Code repositories (GitHub)
- Stack Overflow
- Technical documentation
- General web text
**Models already have procedural knowledge** for these domains from pre-training. Adding skills is redundant.
**For Voice AI**: Skills matter most for **specialized domains** not well-represented in training data. If your domain is software/tech, don't expect big gains from skills.
## The "Focused vs Comprehensive" Finding: Smaller Skills Outperform Documentation
The study tested different skill formats:
- **Focused skills**: 2-3 procedural modules, specific to task
- **Comprehensive skills**: Complete documentation, covers edge cases
- **Reference manual**: Full domain knowledge, 100+ pages
**Results**:
- Focused skills (2-3 modules): **Best performance** (+16.2pp)
- Comprehensive skills (10+ modules): **Worse** (+9.1pp)
- Reference manual (100+ pages): **Worst** (+2.3pp, essentially useless)
**Why**: **Context window limits.** Models spend tokens reading documentation, leaving fewer tokens for actual task execution.
**For Voice AI demos**: Don't give agents access to full manuals. **Curate minimal, task-specific skills** or performance degrades.
## What This Means for Voice AI Demo Builders: Layer 9 Mechanism #3 Validation
This study is **real-world validation of Layer 9: Mechanism #3 (Skill Verification)** from the nine-layer trust framework.
**Layer 9: Mechanism #3 (Skill Verification)** requires:
1. Human verification of AI-generated skills before deployment
2. Testing skills against benchmark tasks
3. Monitoring skill effectiveness in production
4. Disabling skills that reduce performance
**The SkillsBench study proves all four requirements are necessary**:
1. **Self-generated skills are worthless** → Human verification required
2. **16 tasks showed negative deltas** → Benchmark testing required
3. **Domain variation is massive** → Production monitoring required
4. **Some skills harm performance** → Disabling mechanism required
If you're building Voice AI demos that:
- Let agents generate their own skills
- Trust agent-authored procedures
- Assume more skills = better performance
- Deploy skills without verification
**You're deploying a system that's provably no better than baseline.**
## Implementation: Layer 9 Mechanism #3 (Skill Verification System)
Here's how to implement skill verification based on the SkillsBench findings:
```typescript
// Layer 9: Mechanism #3 - Skill Verification System
// Prevents deployment of useless or harmful self-generated skills
interface Skill {
id: string;
name: string;
domain: string;
procedures: ProcedureModule[];
source: "HUMAN_CURATED" | "AI_GENERATED" | "HYBRID";
verification_status: VerificationStatus;
performance_delta: PerformanceDelta;
}
interface ProcedureModule {
step: string;
description: string;
gotchas: string[]; // Non-obvious requirements
examples: Example[];
}
interface PerformanceDelta {
baseline_pass_rate: number;
with_skill_pass_rate: number;
delta: number; // Positive or negative
tasks_tested: number;
confidence_interval: [number, number];
}
interface VerificationStatus {
status: "UNVERIFIED" | "TESTING" | "APPROVED" | "REJECTED" | "DISABLED";
verified_by: string; // Human verifier ID
test_results: TestResult[];
rejection_reason?: string;
disabled_reason?: string;
}
class SkillVerificationSystem {
// CRITICAL: Never deploy self-generated skills without verification
async verify_skill(skill: Skill): Promise {
// BLOCK ALL SELF-GENERATED SKILLS BY DEFAULT
if (skill.source === "AI_GENERATED") {
return {
status: "REJECTED",
verified_by: "SYSTEM",
test_results: [],
rejection_reason: `
SkillsBench study shows self-generated skills provide 0.0pp benefit.
Self-generated skills cannot be deployed without human verification.
Reason: AI agents cannot reliably author procedural knowledge.
Required steps:
1. Human expert must review skill content
2. Test against benchmark tasks (minimum 20 tasks)
3. Verify performance delta > +5pp with confidence > 95%
4. Check for negative deltas on any task
5. Get approval from domain expert
`
};
}
// HUMAN-CURATED SKILLS: Still require testing
if (skill.source === "HUMAN_CURATED") {
const test_results = await this.run_benchmark_tests(skill);
// Check for negative deltas (19% of curated skills harm performance)
const negative_delta_tasks = test_results.filter(
result => result.delta < 0
);
if (negative_delta_tasks.length > 0) {
return {
status: "REJECTED",
verified_by: "BENCHMARK_SYSTEM",
test_results: test_results,
rejection_reason: `
Skill causes performance degradation on ${negative_delta_tasks.length} tasks:
${negative_delta_tasks.map(t =>
`- ${t.task_name}: ${t.delta}pp (baseline: ${t.baseline}%, with skill: ${t.with_skill}%)`
).join('\n')}
SkillsBench finding: 19% of curated skills harm performance.
This skill falls in that category.
Recommendation: Revise skill to remove over-constraining procedures.
`
};
}
// Check if improvement is meaningful (+5pp minimum threshold)
const avg_delta = test_results.reduce((sum, r) => sum + r.delta, 0) / test_results.length;
if (avg_delta < 5.0) {
return {
status: "REJECTED",
verified_by: "BENCHMARK_SYSTEM",
test_results: test_results,
rejection_reason: `
Skill provides insufficient benefit: ${avg_delta.toFixed(1)}pp average improvement
Minimum threshold: +5.0pp (to justify context window cost)
SkillsBench finding: Low-gain domains (software, data analysis) show +4-8pp.
If your domain doesn't benefit meaningfully from skills, don't deploy them.
Consider:
- Is this domain well-represented in model's training data?
- Does model already have this procedural knowledge?
- Are we just redundantly encoding what model knows?
`
};
}
// APPROVED: Significant positive delta, no negative tasks
return {
status: "APPROVED",
verified_by: "BENCHMARK_SYSTEM",
test_results: test_results,
};
}
throw new Error("Unknown skill source");
}
// Run skills through benchmark tasks
async run_benchmark_tests(skill: Skill): Promise {
// Minimum 20 tasks per skill (SkillsBench used 86 tasks across domains)
const benchmark_tasks = await this.get_domain_benchmark_tasks(
skill.domain,
min_count: 20
);
const test_results: TestResult[] = [];
for (const task of benchmark_tasks) {
// Test WITHOUT skill (baseline)
const baseline_result = await this.run_task_without_skill(task);
// Test WITH skill
const with_skill_result = await this.run_task_with_skill(task, skill);
// Calculate delta
const delta = with_skill_result.pass_rate - baseline_result.pass_rate;
test_results.push({
task_name: task.name,
baseline: baseline_result.pass_rate,
with_skill: with_skill_result.pass_rate,
delta: delta,
attempts: task.attempts,
verifier: task.deterministic_verifier
});
}
return test_results;
}
// Monitor skills in production (catch degradation)
async monitor_skill_performance(skill: Skill): Promise {
// SkillsBench finding: Skills can become harmful over time
// Reasons: Domain changes, task shifts, model updates
setInterval(async () => {
const current_performance = await this.measure_production_performance(skill);
// Compare to approval benchmarks
const approval_delta = skill.performance_delta.delta;
const current_delta = current_performance.delta;
// Check for degradation (>5pp drop from approval)
if (current_delta < approval_delta - 5.0) {
await this.disable_skill({
skill: skill,
reason: `
Performance degradation detected in production
Approval delta: ${approval_delta.toFixed(1)}pp
Current delta: ${current_delta.toFixed(1)}pp
Degradation: ${(approval_delta - current_delta).toFixed(1)}pp
Possible causes:
- Domain/task distribution changed
- Model updated (different capabilities)
- Skill became outdated
- Task requirements evolved
Action: Skill disabled. Requires re-verification.
`,
disabled_at: new Date()
});
}
// Check for negative delta in production
if (current_delta < 0) {
await this.disable_skill({
skill: skill,
reason: `
CRITICAL: Skill now harms performance (${current_delta.toFixed(1)}pp)
SkillsBench warning: 19% of skills show negative deltas.
This skill has crossed into harmful territory.
Action: Immediate disable. Do not re-enable without investigation.
`,
disabled_at: new Date(),
severity: "CRITICAL"
});
}
}, 3600000); // Check every hour
}
// Focused vs comprehensive: Enforce module limits
async enforce_skill_size_limits(skill: Skill): Promise {
// SkillsBench finding: Focused skills (2-3 modules) outperform comprehensive (10+)
const module_count = skill.procedures.length;
if (module_count > 5) {
return {
valid: false,
error: `
Skill has ${module_count} procedure modules (maximum: 5)
SkillsBench finding:
- Focused skills (2-3 modules): +16.2pp average
- Comprehensive skills (10+ modules): +9.1pp average
- Reference manuals (100+ pages): +2.3pp average
Context window limits harm performance with verbose skills.
Recommendation: Split into multiple focused skills, each 2-3 modules.
`
};
}
// Check total token count
const total_tokens = this.estimate_skill_tokens(skill);
if (total_tokens > 1000) {
return {
valid: false,
error: `
Skill is too verbose (${total_tokens} tokens, maximum: 1000)
Models spend tokens reading documentation, leaving fewer for task execution.
Guideline: Keep skills under 1000 tokens (approximately 750 words).
Current: ${total_tokens} tokens
Edit skill to be more concise or split into multiple smaller skills.
`
};
}
return {
valid: true
};
}
// Domain-specific thresholds
get_minimum_delta_for_domain(domain: string): number {
// SkillsBench domain-specific results:
const domain_thresholds = {
"healthcare": 40.0, // Expect +40pp+ (study showed +51.9pp)
"legal": 35.0, // Expect +35pp+ (study showed +42.3pp)
"finance": 30.0, // Expect +30pp+ (study showed +38.7pp)
"customer_service": 20.0,
"education": 15.0,
"research": 12.0,
"writing": 10.0,
"planning": 8.0,
"software": 5.0, // Low threshold (study showed +4.5pp)
"data_analysis": 5.0, // Low threshold (study showed +7.2pp)
"general": 5.0
};
return domain_thresholds[domain] || 10.0; // Default: +10pp minimum
}
}
// Example usage
const skill_verification = new SkillVerificationSystem();
// Agent generates a skill
const ai_generated_skill: Skill = {
id: "skill_12345",
name: "How to verify medical insurance",
domain: "healthcare",
procedures: [
{
step: "Check patient database",
description: "Verify patient name and DOB",
gotchas: [], // AI-generated skills miss gotchas
examples: []
}
],
source: "AI_GENERATED", // Self-generated
verification_status: {
status: "UNVERIFIED",
verified_by: "",
test_results: []
},
performance_delta: {
baseline_pass_rate: 0,
with_skill_pass_rate: 0,
delta: 0,
tasks_tested: 0,
confidence_interval: [0, 0]
}
};
// Attempt to verify
const verification_result = await skill_verification.verify_skill(ai_generated_skill);
console.log(verification_result);
// Output:
// {
// status: "REJECTED",
// verified_by: "SYSTEM",
// test_results: [],
// rejection_reason: "SkillsBench study shows self-generated skills provide 0.0pp benefit..."
// }
// Human writes a skill
const human_curated_skill: Skill = {
id: "skill_67890",
name: "Medical insurance verification procedure",
domain: "healthcare",
procedures: [
{
step: "Query by policy number",
description: "Use policy number as primary key (stable, unlike name/DOB)",
gotchas: [
"Policy number is stable identifier (name/DOB can change)",
"Don't query by name first - patient may have married/divorced"
],
examples: [
{ input: "Policy ABC123", output: "Patient record found" }
]
},
{
step: "Check coverage status",
description: "Verify status field and end date",
gotchas: [
"SUSPENDED != EXPIRED (grace period may exist)",
"Check end_date > today, not just status=ACTIVE"
],
examples: []
}
],
source: "HUMAN_CURATED",
verification_status: {
status: "TESTING",
verified_by: "domain_expert_12",
test_results: []
},
performance_delta: {
baseline_pass_rate: 34.0,
with_skill_pass_rate: 87.0,
delta: 53.0, // +53pp (above healthcare threshold of +40pp)
tasks_tested: 25,
confidence_interval: [48.2, 57.8]
}
};
// This skill will pass verification (large positive delta, no negative tasks)
const human_skill_result = await skill_verification.verify_skill(human_curated_skill);
// Status: APPROVED
```
## Key Implementation Requirements From SkillsBench Study
Based on the study's findings, **Layer 9: Mechanism #3 implementation MUST include**:
### 1. **Block Self-Generated Skills by Default**
**Finding**: 0.0pp average improvement from self-generated skills
**Implementation**: Reject all AI-authored skills unless human-verified
### 2. **Test All Skills Against Benchmarks**
**Finding**: 19% of curated skills harm performance
**Implementation**: Minimum 20 task benchmark per skill, measure delta
### 3. **Enforce Domain-Specific Thresholds**
**Finding**: Healthcare +51.9pp vs Software +4.5pp
**Implementation**: Different minimum deltas per domain
### 4. **Limit Skill Verbosity**
**Finding**: Focused (2-3 modules) beats comprehensive (10+ modules)
**Implementation**: Maximum 5 modules, 1000 tokens per skill
### 5. **Monitor for Degradation**
**Finding**: Skills can become harmful over time
**Implementation**: Continuous production monitoring, auto-disable on negative delta
### 6. **Detect Negative Deltas**
**Finding**: 16 of 84 tasks worse with skills
**Implementation**: Reject any skill showing performance decrease on ANY task
## The "Smaller Models with Skills = Larger Models Without" Finding
One of the study's most interesting results: **Smaller models equipped with good skills can match larger models running without skills.**
**Example from study**:
- **GPT-3.5 + curated healthcare skills**: 76% pass rate
- **GPT-4 without skills**: 74% pass rate
- **Performance difference**: +2pp (smaller model WITH skills beats larger model WITHOUT)
**For Voice AI demo economics**:
- GPT-3.5 API cost: ~$0.002 per 1K tokens
- GPT-4 API cost: ~$0.06 per 1K tokens
- **Cost ratio**: 30x cheaper to run smaller model
**The trade-off**:
- Smaller model + skills: **30x cheaper**, same performance
- Larger model no skills: **30x more expensive**, same performance
- **BUT**: Skills require human curation (upfront cost)
**Break-even calculation**:
If human expert takes 2 hours to write/verify skills @ $100/hour = $200 upfront cost
Running smaller model vs larger model saves $0.058 per 1K tokens
Break-even: $200 / $0.058 = **3,448 API calls** (3.4M tokens)
**For production Voice AI demos**: If you're handling >3.5M tokens, investing in human-curated skills for smaller models is more cost-effective than running larger models without skills.
## What Happens to Voice AI Demos That Ignore This Study
If you're building Voice AI demos and your architecture looks like this:
```typescript
// BROKEN ARCHITECTURE (ignores SkillsBench findings)
async function voice_ai_agent(user_query: string): Promise {
// Let agent generate its own skills
const self_generated_skill = await agent.create_skill_from_interaction(user_query);
// Store skill for future use
await skill_library.add(self_generated_skill);
// Use all available skills (assume more = better)
const all_skills = await skill_library.get_all();
// Execute with full skill library
return await agent.execute({
query: user_query,
skills: all_skills, // Could be 100+ skills, 50K+ tokens
model: "gpt-4" // Using expensive model unnecessarily
});
}
```
**What actually happens** (validated by SkillsBench):
1. **Self-generated skill is worthless** (0.0pp improvement)
2. **Skill library grows but doesn't help** (no learning occurring)
3. **Context window wasted on useless documentation** (100+ skills = verbose)
4. **Performance degrades** (19% chance each skill has negative delta)
5. **Costs spiral** (paying for GPT-4 when GPT-3.5 + skills would work)
**User experience**: Demo seems to "learn" (skill library grows) but **performance stays flat or gets worse**.
## What Voice AI Demos Should Do Instead
**Architecture that respects SkillsBench findings**:
```typescript
// CORRECT ARCHITECTURE (implements SkillsBench lessons)
async function voice_ai_agent(user_query: string): Promise {
// 1. NEVER use self-generated skills
const self_generated_skill = null; // Don't even attempt
// 2. Use ONLY human-curated, benchmark-tested skills
const curated_skills = await skill_library.get_verified_skills({
domain: detect_domain(user_query),
status: "APPROVED", // Only approved skills
min_delta: 5.0, // Minimum +5pp improvement
max_modules: 5 // Focused skills only
});
// 3. Retrieve FOCUSED skills (not comprehensive documentation)
const focused_skills = curated_skills.slice(0, 3); // Maximum 3 skills per query
// 4. Use SMALLER model with skills (cheaper, same performance)
return await agent.execute({
query: user_query,
skills: focused_skills,
model: "gpt-3.5-turbo", // Smaller model with skills = GPT-4 without
skill_token_budget: 1000 // Enforce verbosity limits
});
}
// 5. CONTINUOUS monitoring for skill degradation
setInterval(async () => {
const skills = await skill_library.get_all_active();
for (const skill of skills) {
const current_performance = await measure_production_delta(skill);
if (current_performance.delta < skill.approved_delta - 5.0) {
// Performance degraded >5pp from approval
await skill_library.disable_skill(skill, "Performance degradation detected");
}
}
}, 3600000); // Check every hour
```
## Connection to Layer 9: Reputation Integrity
This study validates **Layer 9: Mechanism #3 (Skill Verification)** and reveals why it's essential:
**Layer 9 premise**: AI systems must verify claims about their own capabilities before deploying them.
**SkillsBench validation**:
- AI agents **claim** they can generate useful skills → **False** (0.0pp improvement)
- AI agents **claim** skills improve performance → **False 19% of the time** (negative deltas)
- AI agents **claim** comprehensive documentation helps → **False** (focused beats comprehensive)
**Without Layer 9 Mechanism #3**, Voice AI demos:
1. Deploy worthless self-generated skills
2. Waste context window on useless documentation
3. Suffer performance degradation from bad skills
4. Pay for larger models when smaller models + skills would work
5. **Have no idea any of this is happening** (no verification, no benchmarks)
**With Layer 9 Mechanism #3**, Voice AI demos:
1. Block self-generated skills automatically
2. Verify all skills against benchmarks before deployment
3. Detect and disable harmful skills in production
4. Optimize costs (smaller model + verified skills)
5. **Know exactly which skills work and which don't** (continuous measurement)
## The Broader Implication: AI Agents Can't Self-Improve (Yet)
The SkillsBench finding that self-generated skills provide zero benefit has massive implications beyond Voice AI:
**The promise of AI agents**: They'll learn from experience, build up knowledge, get progressively better at tasks through self-improvement.
**The SkillsBench reality**: **Models cannot reliably author the procedural knowledge they benefit from consuming.**
This means:
- **No autonomous skill acquisition**: Agents can't teach themselves new procedures
- **No self-improving systems**: Performance doesn't increase through operation
- **Human curation is mandatory**: Every useful skill requires human authorship/verification
- **Scaling requires humans**: Can't scale agent capabilities by letting them run longer
**For the AI agent ecosystem**: This is a **fundamental limitation**, not an engineering problem. Until models can reliably generate useful procedural knowledge, **humans remain in the loop for all capability expansion**.
## What This Means for Demogod
If you're using Demogod for Voice AI demos that include agent "skills" or procedural guidance:
**Critical requirements from SkillsBench study**:
1. **Never deploy self-generated skills**
- Block AI-authored procedures
- Require human verification for all skills
- Test against minimum 20 benchmark tasks
2. **Monitor curated skills for negative deltas**
- 19% of human-written skills harm performance
- Test each skill before deployment
- Continuous production monitoring
3. **Enforce skill size limits**
- Maximum 5 procedural modules per skill
- Maximum 1000 tokens per skill
- Focused beats comprehensive
4. **Use domain-specific thresholds**
- Healthcare: expect +40pp minimum
- Software: expect +5pp minimum
- Reject skills below domain threshold
5. **Optimize costs with smaller models**
- GPT-3.5 + verified skills = GPT-4 without skills
- 30x cost reduction for same performance
- Human curation cost amortized over 3.5M+ tokens
**Implementation in Demogod architecture**:
```typescript
// Add to your Voice AI demo configuration
const agent_config = {
skill_verification: {
block_self_generated: true, // Never use AI-authored skills
minimum_benchmark_tasks: 20, // Test threshold
minimum_delta: 5.0, // Global minimum (+5pp)
domain_thresholds: { // Domain-specific minimums
healthcare: 40.0,
legal: 35.0,
finance: 30.0,
software: 5.0
},
max_modules_per_skill: 5, // Verbosity limit
max_tokens_per_skill: 1000, // Context budget
production_monitoring_interval: 3600000, // Check every hour
auto_disable_on_negative_delta: true // Protect against harm
},
model_selection: {
use_smaller_model_with_skills: true, // Cost optimization
smaller_model: "gpt-3.5-turbo",
larger_model: "gpt-4",
break_even_tokens: 3500000 // Switch at 3.5M tokens
}
};
```
## Conclusion: The End of "Self-Improving" AI Agents (For Now)
The SkillsBench study just killed the dream of autonomous AI agents that get better through operation. **Self-generated skills provide zero benefit.** Models cannot teach themselves what they need to know.
**Key findings**:
- Self-generated skills: **0.0pp improvement** (worthless)
- Curated skills: **+16.2pp average** (but 19% have negative deltas)
- Focused skills beat comprehensive documentation
- Smaller models + skills = larger models alone
- Domain variation is massive (+4.5pp to +51.9pp)
**What this means for Voice AI demos**:
1. **Block self-generated skills** (they don't work)
2. **Verify all human-curated skills** (19% are harmful)
3. **Keep skills focused** (2-3 modules, <1000 tokens)
4. **Monitor for degradation** (skills can become harmful)
5. **Use smaller models** (with verified skills, performance equivalent to larger models)
**Layer 9: Mechanism #3 (Skill Verification) is now validated** by peer-reviewed research showing exactly why it's necessary: **AI agents cannot reliably create the procedural knowledge they benefit from consuming.**
Until models gain the ability to author useful skills for themselves, **humans remain mandatory** for all agent capability expansion. This isn't a temporary limitation—it's a fundamental gap in current AI architectures.
Build accordingly.
---
**Study**: [SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks](https://arxiv.org/abs/2602.12670)
**HackerNews Discussion**: [268 points, 113 comments](https://news.ycombinator.com/item?id=47040430)
← Back to Blog
DEMOGOD