How to Evaluate an AI Agent System for Production Readiness

A structured checklist for evaluating whether an AI agent system is ready for production use. Anchored to authority boundaries, scope enforcement, policy gates, evidence completeness, replayability, and independent verification.

Most AI agent evaluations focus on capability — does the agent do the task correctly? Production readiness requires a different question: can the system prove what happened, constrain what the agent can do, and recover when something goes wrong?

Before you start

Gather the following before beginning any evaluation:

System architecture doc — how the agent, tools, policy layer, and execution environment relate to each other
Policy documentation — what rules govern which actions the agent is allowed to take
Sample execution records — actual or representative logs, receipts, or evidence bundles from prior runs
Stated failure modes — documented behavior when policy is violated, scope is exceeded, or execution fails

If the system does not publish any of these, that is itself a finding.

1. Authority boundary

Who decides whether the agent is allowed to take an action?

Question	What to look for
Who decides if an action is allowed?	A distinct policy layer, human approver, or automated rule — not the agent itself
Is that decision recorded separately from execution?	Authority decisions should be logged independently of the action log
Can an attacker bypass the authority layer by manipulating the agent's input?	Prompt injection, tool chaining, or context stuffing that overrides policy
What happens if the authority layer is unavailable?	Fail-open vs. fail-closed behavior; documented default

A system where the agent evaluates its own authority to act has a weak trust model regardless of how accurate its self-assessment typically is.

2. Scope enforcement

Is scope declared only at configuration time, or also enforced at execution time?

Does each tool call include a scope check, or only the initial session setup?
What happens if a tool call targets a resource outside declared scope — is it blocked, logged, or silently allowed?
Can scope be expanded mid-session through prompting, tool chaining, or indirect instruction?
Is scope enforcement implemented in the policy layer, the tool layer, or both?

Scope that is declared at configuration but not enforced per-call is not enforcement — it is documentation.

3. Policy gates

Does the system evaluate whether a proposed action is in policy before executing it?

Are policy checks run before every tool call, or only on the initial request?
Who defines the policy? Who can change it? Is that change auditable?
Is policy logic embedded in the agent prompt, defined in a separate policy engine, or delegated to external rules?
Can a policy change be correlated with a change in agent behavior through the evidence record?

A policy gate that can be bypassed by changing the system prompt is not a gate — it is a default.

4. Evidence completeness

Does the system produce an execution record for each tool call?

Does the record include: inputs, tool identifier, parameters, response, and timestamp?
Is the record produced by the same component that executed the action, or by a separate observer?
Are records written before execution completes, or only on success?
Is there a mechanism to detect a missing record (i.e., an action with no evidence)?

A summary of what the agent did is not an execution record. Completeness requires per-call evidence, not session-level narration.

5. Replayability

Can you reconstruct what happened from the evidence alone, without access to the live system?

Does the evidence record include enough context to understand why the agent took each action — not just what it did?
Can a dispute about what happened be resolved using the evidence, or does resolution require trusting the vendor's interpretation?
Is the execution record self-contained, or does it reference external state that may have changed?
Can the sequence of actions be reconstructed in order, with timing?

Evidence that requires the originating system to interpret is not independently replayable.

6. Independent verification

Can any artifact from this system be verified by someone who does not trust the issuing system?

Are signatures or hashes present on execution records?
Are public keys published in a location the issuer does not control?
Can a third party confirm the timestamp is accurate without relying on the issuer's assertion?
Is there a verification path that does not require a live connection to the issuing system?

If every verification step terminates in "trust the vendor," the system has no independent verification.

Worked example: AutoOps

AutoOps is a fictional AI agent that runs infrastructure playbooks on operator request.

Dimension	What AutoOps claims	What it can prove	Gap
Authority boundary	All actions approved by policy engine	Policy engine logs are internal only	No external check that approval preceded execution
Scope enforcement	Scope defined per-playbook at session start	No per-call scope check in tool layer	Scope validated once; not enforced per tool call
Policy gates	All tool calls evaluated against policy	Policy evaluation logged in same system as execution	Policy change and behavior change are not independently correlated
Evidence completeness	Full execution record per run	Record is session summary, not per-call	Individual tool call inputs and parameters not captured
Replayability	Logs available on request	Logs require AutoOps portal to render	Dispute resolution depends on vendor's log rendering
Independent verification	Signed receipts issued	Public key hosted on AutoOps domain	Verifier must trust the issuer to obtain the key

Summary: AutoOps has documented governance and produces execution summaries. It does not independently enforce scope per call, does not separate policy evaluation from execution logging, and does not support verification without relying on the vendor.

Findings structure

Organize your findings consistently:

What the system claims — stated governance, policy enforcement, and evidence model
What the system can prove — what is cryptographically or independently verifiable
What the system assumes — trust assumptions that are not independently verifiable (runtime honesty, timestamp accuracy, scope fidelity)
Where the gaps are — claims that exceed what the evidence supports
What to watch — assumptions most likely to fail or be challenged in production

The principle

A production-ready AI agent system is not one that works correctly most of the time. It is one that can prove what happened, constrain what it can do, and fail safely when it goes wrong.