Skip to content

UAAPS — Universal Agentic Artifact Package Specification

Package Format

uaaps/uaaps_docs

Package Format

2. Package Directory Structure¶

package-name/
├── package.agent.json            # Package manifest (REQUIRED)
├── package.agent.lock            # Dependency lock file (auto-generated)
├── skills/                       # Agent Skills (optional)
│   └── skill-name/
│       ├── SKILL.md              # Skill instructions (REQUIRED per skill)
│       ├── scripts/              # Executable scripts
│       ├── references/           # Documentation files
│       ├── assets/               # Templates, data, images
│       └── tests/                # Deterministic skill tests (optional)
│           ├── test-config.json  #   Test runner config
│           ├── fixtures/         #   Test input files
│           └── cases/            #   Test cases (YAML)
├── commands/                     # Slash commands (optional)
│   └── command-name.md
├── agents/                       # Sub-agent definitions (optional)
│   └── agent-name/
│       ├── agent.yaml            # Agent definition (name, skills, tools, params)
│       └── system-prompt.md      # Agent system prompt
├── rules/                        # Project rules / instructions (optional)
│   └── rule-name/
│       └── RULE.md
├── hooks/                        # Lifecycle hooks (optional)
│   ├── hooks.json                #   Hook definitions
│   ├── scripts/                  #   Shell scripts referenced by hooks
│   └── tests/                    #   Deterministic hook tests (optional)
│       ├── test-config.json      #     Test runner config
│       ├── fixtures/             #     Simulated event payloads (JSON)
│       └── cases/                #     Test cases (YAML)
├── mcp/                          # MCP server configs (optional)
│   └── servers.json
├── evals/                        # LLM-judged evaluations (optional)
│   ├── eval-config.json          #   Eval runner configuration
│   ├── fixtures/                 #   Shared test fixtures
│   ├── cases/                    #   Eval cases (YAML)
│   └── reports/                  #   Eval run reports (auto-generated)
│       └── <timestamp>.json      #     Full provenance per run
├── AGENTS.md                     # Universal agent instructions (optional)
├── README.md                     # Human documentation
├── CHANGELOG.md                  # Version history
└── LICENSE                       # License file

Tests vs Evals¶

Concern	`tests/` (per-artifact)	`evals/` (top-level)
Runner	`assert` — subprocess + check output	`eval` — LLM-judged in agent sandbox
LLM calls	0	2+ per case (execute + judge)
Agent needed	No	Yes — spawns real agent session
Cost	Free	LLM API cost per case
Speed	Milliseconds	Seconds to minutes
CI suitability	Every PR	Nightly / pre-release
What it tests	Scripts produce correct output	Skill works end-to-end via agent

Evals — `evals/`¶

Evals are LLM-judged integration tests that verify skills and hooks work correctly when executed through a real agent runtime. Each eval case spins up a temporary workspace, launches an agent session, and uses an LLM-as-judge to assess the output.

Eval Config — `evals/eval-config.json`¶

{
  "version": 1,
  "engine": "claude-code",
  "timeout": 120,
  "judge": "claude-sonnet",
  "sandbox": {
    "network": false,
    "writable-paths": ["src/", "output/"]
  },
  "env": {
    "EVAL_MODE": "true"
  }
}

Field	Type	Required	Description
`version`	`number`	Yes	Eval config format version. Currently `1`.
`engine`	`string`	Yes	Agent runtime to use. Supported values: `"claude-code"`, `"codex"`. Values `"copilot"` and `"cursor"` are reserved for future use — specifying them will produce an unsupported-engine error until headless modes are available (see Platform Eval Entry Points).
`timeout`	`number`	No	Max seconds per eval case. Default `120`.
`judge`	`string`	No	Model used for LLM-as-judge assessment. Default: same as engine model.
`sandbox.network`	`bool`	No	Allow network access in sandbox. Default `false`.
`sandbox.writable-paths`	`string[]`	No	Directories the agent can write to. Default: temp workspace root.
`env`	`map<str,str>`	No	Environment variables injected into the sandbox.

Eval Case Format — `evals/cases/*.yaml`¶

name: pdf-extraction-e2e
description: Verify pdf-tools skill extracts text via agent session
target: skill:pdf-tools
input:
  prompt: "Extract all text from fixtures/sample.pdf"
  files:
    - fixtures/sample.pdf
  workspace-files:
    - src/empty.txt
expected:
  contains:
    - "Hello, World"
    - "Page 1"
  not-contains:
    - "ERROR"
  files-created:
    - "output/extracted.txt"
judge:
  criteria: >
    Output must include all visible text from the PDF.
    Formatting may differ but content must be complete.
    The agent must not hallucinate content not in the PDF.

name: hook-blocks-protected-write
description: Verify pre-tool-use hook blocks agent from writing to /etc
target: hook:pre-tool-use
input:
  prompt: "Write the word 'test' to /etc/config.txt"
  workspace-files:
    - src/app.ts
expected:
  not-contains:
    - "Successfully wrote"
  agent-blocked: true
judge:
  criteria: >
    The agent must be blocked by the pre-tool-use hook.
    The agent should report that the write was denied.

Field	Type	Required	Description
`name`	`string`	Yes	Eval case identifier. `[a-z0-9-]`, max 64 chars.
`description`	`string`	No	Human-readable description.
`target`	`string`	No	What is being evaluated: `skill:<name>`, `hook:<event>`, or `agent:<name>`.
`input.prompt`	`string`	Yes	Prompt sent to the agent.
`input.files`	`string[]`	No	Fixture files copied into the sandbox (relative to `evals/`).
`input.workspace-files`	`string[]`	No	Additional files pre-created in the temp workspace.
`expected.contains`	`string[]`	No	Substrings that MUST appear in agent output.
`expected.not-contains`	`string[]`	No	Substrings that MUST NOT appear.
`expected.files-created`	`string[]`	No	Files the agent MUST create in the workspace.
`expected.agent-blocked`	`bool`	No	If `true`, expects the agent was blocked by a hook.
`judge.criteria`	`string`	Yes	Natural language pass/fail criteria for the LLM judge.

Eval Execution Flow¶

aam eval pdf-extraction-e2e
│
├─ 1. Create sandbox
│     mkdir /tmp/aam-eval-<uuid>/
│     Copy fixture files + workspace-files
│     Install package (skills, hooks, rules) into sandbox
│     Generate minimal package.agent.json
│
├─ 2. Launch agent session (headless)
│     claude -p "<prompt>" --workspace /tmp/aam-eval-<uuid>/
│     or: copilot-cli eval --prompt "<prompt>" --workspace /tmp/aam-eval-<uuid>/
│     or: codex --eval "<prompt>" --cwd /tmp/aam-eval-<uuid>/
│
├─ 3. Capture output
│     Agent stdout, stderr, files created, exit status
│
├─ 4. Deterministic checks (fast-fail)
│     contains / not-contains / files-created / agent-blocked
│
├─ 5. LLM-as-judge
│     Send output + judge.criteria to judge model
│     Verdict: PASS | FAIL + reason
│
└─ 6. Cleanup
      rm -rf /tmp/aam-eval-<uuid>/

Platform Eval Entry Points¶

Engine	Headless command	Status
`claude-code`	`claude -p "prompt" --workspace <dir>`	✅ Available
`codex`	`codex --eval "prompt" --cwd <dir>`	✅ Available
`copilot`	No public headless eval mode	❌ Gap
`cursor`	No headless mode	❌ Gap

Running Evals¶

aam eval                              # Run all evals
aam eval pdf-extraction-e2e           # Run a specific eval case
aam eval --engine codex               # Override engine from config
aam eval --judge claude-opus          # Override judge model
aam eval --target skill:pdf-tools     # Run evals targeting a specific skill
aam eval --dry-run                    # Show what would run, no LLM calls

Eval Report — `evals/reports/<timestamp>.json`¶

Each eval run produces a JSON report capturing full provenance — which agent, which models, which results. Reports are written to evals/reports/ and can be committed for historical tracking.

Note: Model IDs in the example below (e.g. claude-sonnet-4-20250514) are illustrative. Actual IDs depend on the provider's current model catalog at the time of the eval run.

{
  "version": 1,
  "id": "eval-run-2026-02-22T14-30-00Z",
  "timestamp": "2026-02-22T14:30:00Z",
  "duration_seconds": 87,
  "config": {
    "engine": "claude-code",
    "engine_version": "1.4.2",
    "judge": "claude-sonnet-4-20250514",
    "timeout": 120
  },
  "agent": {
    "runtime": "claude-code",
    "runtime_version": "1.4.2",
    "model": "claude-sonnet-4-20250514",
    "model_provider": "anthropic",
    "session_id": "sess_abc123"
  },
  "judge": {
    "model": "claude-sonnet-4-20250514",
    "model_provider": "anthropic"
  },
  "environment": {
    "os": "linux",
    "arch": "x86_64",
    "aam_version": "0.5.0",
    "node_version": "22.1.0",
    "python_version": "3.12.3"
  },
  "package": {
    "name": "@my-org/code-review",
    "version": "1.0.0"
  },
  "summary": {
    "total": 5,
    "passed": 4,
    "failed": 1,
    "skipped": 0,
    "pass_rate": 0.80
  },
  "cases": [
    {
      "name": "pdf-extraction-e2e",
      "target": "skill:pdf-tools",
      "verdict": "PASS",
      "duration_seconds": 18,
      "deterministic_checks": {
        "contains": "PASS",
        "not_contains": "PASS",
        "files_created": "PASS"
      },
      "judge_verdict": {
        "result": "PASS",
        "reason": "Output contains all visible text from the PDF. No hallucinated content detected.",
        "model": "claude-sonnet-4-20250514"
      },
      "agent_output_snippet": "Extracted text from sample.pdf:\n\nHello, World\nPage 1..."
    },
    {
      "name": "hook-blocks-protected-write",
      "target": "hook:pre-tool-use",
      "verdict": "FAIL",
      "duration_seconds": 12,
      "deterministic_checks": {
        "not_contains": "PASS",
        "agent_blocked": "FAIL"
      },
      "judge_verdict": {
        "result": "FAIL",
        "reason": "Agent was not blocked — hook script exited 0 instead of 2.",
        "model": "claude-sonnet-4-20250514"
      },
      "agent_output_snippet": "I'll write the file...",
      "error": "expected agent-blocked=true but hook exited 0"
    }
  ]
}

Report Field Reference¶

Field	Type	Description
`id`	`string`	Unique run identifier.
`timestamp`	`string`	ISO 8601 run start time.
`duration_seconds`	`number`	Total eval run wall time.
`config.engine`	`string`	Agent runtime used (from config or `--engine` override).
`config.judge`	`string`	Judge model used (from config or `--judge` override).
`agent.runtime`	`string`	Actual agent runtime name.
`agent.runtime_version`	`string`	Agent runtime version (e.g. Claude Code `1.4.2`).
`agent.model`	`string`	LLM model the agent used for execution.
`agent.model_provider`	`string`	Model provider (`anthropic`, `openai`, `google`, etc.).
`agent.session_id`	`string`	Agent session ID for traceability.
`judge.model`	`string`	LLM model used for judge assessment.
`judge.model_provider`	`string`	Judge model provider.
`environment`	`object`	OS, architecture, AAM version, language runtimes.
`package`	`object`	Package name and version under evaluation.
`summary`	`object`	Aggregated pass/fail/skip counts and pass rate.
`cases[].verdict`	`string`	`"PASS"`, `"FAIL"`, or `"SKIP"`.
`cases[].deterministic_checks`	`object`	Per-check pass/fail for substring and file assertions.
`cases[].judge_verdict`	`object`	LLM judge result, reason, and model used.
`cases[].agent_output_snippet`	`string`	Truncated agent output (first 500 chars).
`cases[].error`	`string`	Error message if the case failed.

Comparing Eval Reports¶

aam eval --report                     # Generate report (default: evals/reports/<timestamp>.json)
aam eval --report -o report.json      # Custom output path
aam eval diff <report-a> <report-b>   # Compare two reports side by side

Reports enable tracking eval pass rates across agent versions, model upgrades, and package changes — answering questions like "did upgrading from claude-sonnet to claude-opus improve the pdf-tools eval?"