How to Evaluate an AI Agent System for Production Readiness
A structured checklist for evaluating whether an AI agent system is ready for production use. Anchored to authority boundaries, scope enforcement, policy gates, evidence completeness, replayability, and independent verification.
Most AI agent evaluations focus on capability — does the agent do the task correctly? Production readiness requires a different question: can the system prove what happened, constrain what the agent can do, and recover when something goes wrong?
Before you start
Gather the following before beginning any evaluation:
- System architecture doc — how the agent, tools, policy layer, and execution environment relate to each other
- Policy documentation — what rules govern which actions the agent is allowed to take
- Sample execution records — actual or representative logs, receipts, or evidence bundles from prior runs
- Stated failure modes — documented behavior when policy is violated, scope is exceeded, or execution fails
If the system does not publish any of these, that is itself a finding.
1. Authority boundary
Who decides whether the agent is allowed to take an action?
| Question | What to look for |
|---|---|
| Who decides if an action is allowed? | A distinct policy layer, human approver, or automated rule — not the agent itself |
| Is that decision recorded separately from execution? | Authority decisions should be logged independently of the action log |
| Can an attacker bypass the authority layer by manipulating the agent's input? | Prompt injection, tool chaining, or context stuffing that overrides policy |
| What happens if the authority layer is unavailable? | Fail-open vs. fail-closed behavior; documented default |
A system where the agent evaluates its own authority to act has a weak trust model regardless of how accurate its self-assessment typically is.
2. Scope enforcement
Is scope declared only at configuration time, or also enforced at execution time?
- Does each tool call include a scope check, or only the initial session setup?
- What happens if a tool call targets a resource outside declared scope — is it blocked, logged, or silently allowed?
- Can scope be expanded mid-session through prompting, tool chaining, or indirect instruction?
- Is scope enforcement implemented in the policy layer, the tool layer, or both?
Scope that is declared at configuration but not enforced per-call is not enforcement — it is documentation.
3. Policy gates
Does the system evaluate whether a proposed action is in policy before executing it?
- Are policy checks run before every tool call, or only on the initial request?
- Who defines the policy? Who can change it? Is that change auditable?
- Is policy logic embedded in the agent prompt, defined in a separate policy engine, or delegated to external rules?
- Can a policy change be correlated with a change in agent behavior through the evidence record?
A policy gate that can be bypassed by changing the system prompt is not a gate — it is a default.
4. Evidence completeness
Does the system produce an execution record for each tool call?
- Does the record include: inputs, tool identifier, parameters, response, and timestamp?
- Is the record produced by the same component that executed the action, or by a separate observer?
- Are records written before execution completes, or only on success?
- Is there a mechanism to detect a missing record (i.e., an action with no evidence)?
A summary of what the agent did is not an execution record. Completeness requires per-call evidence, not session-level narration.
5. Replayability
Can you reconstruct what happened from the evidence alone, without access to the live system?
- Does the evidence record include enough context to understand why the agent took each action — not just what it did?
- Can a dispute about what happened be resolved using the evidence, or does resolution require trusting the vendor's interpretation?
- Is the execution record self-contained, or does it reference external state that may have changed?
- Can the sequence of actions be reconstructed in order, with timing?
Evidence that requires the originating system to interpret is not independently replayable.
6. Independent verification
Can any artifact from this system be verified by someone who does not trust the issuing system?
- Are signatures or hashes present on execution records?
- Are public keys published in a location the issuer does not control?
- Can a third party confirm the timestamp is accurate without relying on the issuer's assertion?
- Is there a verification path that does not require a live connection to the issuing system?
If every verification step terminates in "trust the vendor," the system has no independent verification.
Worked example: AutoOps
AutoOps is a fictional AI agent that runs infrastructure playbooks on operator request.
| Dimension | What AutoOps claims | What it can prove | Gap |
|---|---|---|---|
| Authority boundary | All actions approved by policy engine | Policy engine logs are internal only | No external check that approval preceded execution |
| Scope enforcement | Scope defined per-playbook at session start | No per-call scope check in tool layer | Scope validated once; not enforced per tool call |
| Policy gates | All tool calls evaluated against policy | Policy evaluation logged in same system as execution | Policy change and behavior change are not independently correlated |
| Evidence completeness | Full execution record per run | Record is session summary, not per-call | Individual tool call inputs and parameters not captured |
| Replayability | Logs available on request | Logs require AutoOps portal to render | Dispute resolution depends on vendor's log rendering |
| Independent verification | Signed receipts issued | Public key hosted on AutoOps domain | Verifier must trust the issuer to obtain the key |
Summary: AutoOps has documented governance and produces execution summaries. It does not independently enforce scope per call, does not separate policy evaluation from execution logging, and does not support verification without relying on the vendor.
Findings structure
Organize your findings consistently:
- What the system claims — stated governance, policy enforcement, and evidence model
- What the system can prove — what is cryptographically or independently verifiable
- What the system assumes — trust assumptions that are not independently verifiable (runtime honesty, timestamp accuracy, scope fidelity)
- Where the gaps are — claims that exceed what the evidence supports
- What to watch — assumptions most likely to fail or be challenged in production
The principle
A production-ready AI agent system is not one that works correctly most of the time. It is one that can prove what happened, constrain what it can do, and fail safely when it goes wrong.