Skip to content

Package Format

2. Package Directory Structure

package-name/
├── package.agent.json            # Package manifest (REQUIRED)
├── package.agent.lock            # Dependency lock file (auto-generated)
├── skills/                       # Agent Skills (optional)
│   └── skill-name/
│       ├── SKILL.md              # Skill instructions (REQUIRED per skill)
│       ├── scripts/              # Executable scripts
│       ├── references/           # Documentation files
│       ├── assets/               # Templates, data, images
│       └── tests/                # Deterministic skill tests (optional)
│           ├── test-config.json  #   Test runner config
│           ├── fixtures/         #   Test input files
│           └── cases/            #   Test cases (YAML)
├── commands/                     # Slash commands (optional)
│   └── command-name.md
├── agents/                       # Sub-agent definitions (optional)
│   └── agent-name/
│       ├── agent.yaml            # Agent definition (name, skills, tools, params)
│       └── system-prompt.md      # Agent system prompt
├── rules/                        # Project rules / instructions (optional)
│   └── rule-name/
│       └── RULE.md
├── hooks/                        # Lifecycle hooks (optional)
│   ├── hooks.json                #   Hook definitions
│   ├── scripts/                  #   Shell scripts referenced by hooks
│   └── tests/                    #   Deterministic hook tests (optional)
│       ├── test-config.json      #     Test runner config
│       ├── fixtures/             #     Simulated event payloads (JSON)
│       └── cases/                #     Test cases (YAML)
├── mcp/                          # MCP server configs (optional)
│   └── servers.json
├── evals/                        # LLM-judged evaluations (optional)
│   ├── eval-config.json          #   Eval runner configuration
│   ├── fixtures/                 #   Shared test fixtures
│   ├── cases/                    #   Eval cases (YAML)
│   └── reports/                  #   Eval run reports (auto-generated)
│       └── <timestamp>.json      #     Full provenance per run
├── AGENTS.md                     # Universal agent instructions (optional)
├── README.md                     # Human documentation
├── CHANGELOG.md                  # Version history
└── LICENSE                       # License file

Tests vs Evals

Concern tests/ (per-artifact) evals/ (top-level)
Runner assert — subprocess + check output eval — LLM-judged in agent sandbox
LLM calls 0 2+ per case (execute + judge)
Agent needed No Yes — spawns real agent session
Cost Free LLM API cost per case
Speed Milliseconds Seconds to minutes
CI suitability Every PR Nightly / pre-release
What it tests Scripts produce correct output Skill works end-to-end via agent

Evals — evals/

Evals are LLM-judged integration tests that verify skills and hooks work correctly when executed through a real agent runtime. Each eval case spins up a temporary workspace, launches an agent session, and uses an LLM-as-judge to assess the output.

Eval Config — evals/eval-config.json

{
  "version": 1,
  "engine": "claude-code",
  "timeout": 120,
  "judge": "claude-sonnet",
  "sandbox": {
    "network": false,
    "writable-paths": ["src/", "output/"]
  },
  "env": {
    "EVAL_MODE": "true"
  }
}
Field Type Required Description
version number Yes Eval config format version. Currently 1.
engine string Yes Agent runtime to use. Supported values: "claude-code", "codex". Values "copilot" and "cursor" are reserved for future use — specifying them will produce an unsupported-engine error until headless modes are available (see Platform Eval Entry Points).
timeout number No Max seconds per eval case. Default 120.
judge string No Model used for LLM-as-judge assessment. Default: same as engine model.
sandbox.network bool No Allow network access in sandbox. Default false.
sandbox.writable-paths string[] No Directories the agent can write to. Default: temp workspace root.
env map<str,str> No Environment variables injected into the sandbox.

Eval Case Format — evals/cases/*.yaml

name: pdf-extraction-e2e
description: Verify pdf-tools skill extracts text via agent session
target: skill:pdf-tools
input:
  prompt: "Extract all text from fixtures/sample.pdf"
  files:
    - fixtures/sample.pdf
  workspace-files:
    - src/empty.txt
expected:
  contains:
    - "Hello, World"
    - "Page 1"
  not-contains:
    - "ERROR"
  files-created:
    - "output/extracted.txt"
judge:
  criteria: >
    Output must include all visible text from the PDF.
    Formatting may differ but content must be complete.
    The agent must not hallucinate content not in the PDF.
name: hook-blocks-protected-write
description: Verify pre-tool-use hook blocks agent from writing to /etc
target: hook:pre-tool-use
input:
  prompt: "Write the word 'test' to /etc/config.txt"
  workspace-files:
    - src/app.ts
expected:
  not-contains:
    - "Successfully wrote"
  agent-blocked: true
judge:
  criteria: >
    The agent must be blocked by the pre-tool-use hook.
    The agent should report that the write was denied.
Field Type Required Description
name string Yes Eval case identifier. [a-z0-9-], max 64 chars.
description string No Human-readable description.
target string No What is being evaluated: skill:<name>, hook:<event>, or agent:<name>.
input.prompt string Yes Prompt sent to the agent.
input.files string[] No Fixture files copied into the sandbox (relative to evals/).
input.workspace-files string[] No Additional files pre-created in the temp workspace.
expected.contains string[] No Substrings that MUST appear in agent output.
expected.not-contains string[] No Substrings that MUST NOT appear.
expected.files-created string[] No Files the agent MUST create in the workspace.
expected.agent-blocked bool No If true, expects the agent was blocked by a hook.
judge.criteria string Yes Natural language pass/fail criteria for the LLM judge.

Eval Execution Flow

aam eval pdf-extraction-e2e
├─ 1. Create sandbox
│     mkdir /tmp/aam-eval-<uuid>/
│     Copy fixture files + workspace-files
│     Install package (skills, hooks, rules) into sandbox
│     Generate minimal package.agent.json
├─ 2. Launch agent session (headless)
│     claude -p "<prompt>" --workspace /tmp/aam-eval-<uuid>/
│     or: copilot-cli eval --prompt "<prompt>" --workspace /tmp/aam-eval-<uuid>/
│     or: codex --eval "<prompt>" --cwd /tmp/aam-eval-<uuid>/
├─ 3. Capture output
│     Agent stdout, stderr, files created, exit status
├─ 4. Deterministic checks (fast-fail)
│     contains / not-contains / files-created / agent-blocked
├─ 5. LLM-as-judge
│     Send output + judge.criteria to judge model
│     Verdict: PASS | FAIL + reason
└─ 6. Cleanup
      rm -rf /tmp/aam-eval-<uuid>/

Platform Eval Entry Points

Engine Headless command Status
claude-code claude -p "prompt" --workspace <dir> ✅ Available
codex codex --eval "prompt" --cwd <dir> ✅ Available
copilot No public headless eval mode ❌ Gap
cursor No headless mode ❌ Gap

Running Evals

aam eval                              # Run all evals
aam eval pdf-extraction-e2e           # Run a specific eval case
aam eval --engine codex               # Override engine from config
aam eval --judge claude-opus          # Override judge model
aam eval --target skill:pdf-tools     # Run evals targeting a specific skill
aam eval --dry-run                    # Show what would run, no LLM calls

Eval Report — evals/reports/<timestamp>.json

Each eval run produces a JSON report capturing full provenance — which agent, which models, which results. Reports are written to evals/reports/ and can be committed for historical tracking.

Note: Model IDs in the example below (e.g. claude-sonnet-4-20250514) are illustrative. Actual IDs depend on the provider's current model catalog at the time of the eval run.

{
  "version": 1,
  "id": "eval-run-2026-02-22T14-30-00Z",
  "timestamp": "2026-02-22T14:30:00Z",
  "duration_seconds": 87,
  "config": {
    "engine": "claude-code",
    "engine_version": "1.4.2",
    "judge": "claude-sonnet-4-20250514",
    "timeout": 120
  },
  "agent": {
    "runtime": "claude-code",
    "runtime_version": "1.4.2",
    "model": "claude-sonnet-4-20250514",
    "model_provider": "anthropic",
    "session_id": "sess_abc123"
  },
  "judge": {
    "model": "claude-sonnet-4-20250514",
    "model_provider": "anthropic"
  },
  "environment": {
    "os": "linux",
    "arch": "x86_64",
    "aam_version": "0.5.0",
    "node_version": "22.1.0",
    "python_version": "3.12.3"
  },
  "package": {
    "name": "@my-org/code-review",
    "version": "1.0.0"
  },
  "summary": {
    "total": 5,
    "passed": 4,
    "failed": 1,
    "skipped": 0,
    "pass_rate": 0.80
  },
  "cases": [
    {
      "name": "pdf-extraction-e2e",
      "target": "skill:pdf-tools",
      "verdict": "PASS",
      "duration_seconds": 18,
      "deterministic_checks": {
        "contains": "PASS",
        "not_contains": "PASS",
        "files_created": "PASS"
      },
      "judge_verdict": {
        "result": "PASS",
        "reason": "Output contains all visible text from the PDF. No hallucinated content detected.",
        "model": "claude-sonnet-4-20250514"
      },
      "agent_output_snippet": "Extracted text from sample.pdf:\n\nHello, World\nPage 1..."
    },
    {
      "name": "hook-blocks-protected-write",
      "target": "hook:pre-tool-use",
      "verdict": "FAIL",
      "duration_seconds": 12,
      "deterministic_checks": {
        "not_contains": "PASS",
        "agent_blocked": "FAIL"
      },
      "judge_verdict": {
        "result": "FAIL",
        "reason": "Agent was not blocked — hook script exited 0 instead of 2.",
        "model": "claude-sonnet-4-20250514"
      },
      "agent_output_snippet": "I'll write the file...",
      "error": "expected agent-blocked=true but hook exited 0"
    }
  ]
}

Report Field Reference

Field Type Description
id string Unique run identifier.
timestamp string ISO 8601 run start time.
duration_seconds number Total eval run wall time.
config.engine string Agent runtime used (from config or --engine override).
config.judge string Judge model used (from config or --judge override).
agent.runtime string Actual agent runtime name.
agent.runtime_version string Agent runtime version (e.g. Claude Code 1.4.2).
agent.model string LLM model the agent used for execution.
agent.model_provider string Model provider (anthropic, openai, google, etc.).
agent.session_id string Agent session ID for traceability.
judge.model string LLM model used for judge assessment.
judge.model_provider string Judge model provider.
environment object OS, architecture, AAM version, language runtimes.
package object Package name and version under evaluation.
summary object Aggregated pass/fail/skip counts and pass rate.
cases[].verdict string "PASS", "FAIL", or "SKIP".
cases[].deterministic_checks object Per-check pass/fail for substring and file assertions.
cases[].judge_verdict object LLM judge result, reason, and model used.
cases[].agent_output_snippet string Truncated agent output (first 500 chars).
cases[].error string Error message if the case failed.

Comparing Eval Reports

aam eval --report                     # Generate report (default: evals/reports/<timestamp>.json)
aam eval --report -o report.json      # Custom output path
aam eval diff <report-a> <report-b>   # Compare two reports side by side

Reports enable tracking eval pass rates across agent versions, model upgrades, and package changes — answering questions like "did upgrading from claude-sonnet to claude-opus improve the pdf-tools eval?"