April 5, 2026
WitnessOps

How to Evaluate an AI Agent System for Production Readiness

A structured checklist for evaluating whether an AI agent system is ready for production use. Anchored to authority boundaries, scope enforcement, policy gates, evidence completeness, replayability, and independent verification.

Most AI agent evaluations focus on capability — does the agent do the task correctly? Production readiness requires a different question: can the system prove what happened, constrain what the agent can do, and recover when something goes wrong?

Before you start

Gather the following before beginning any evaluation:

If the system does not publish any of these, that is itself a finding.

1. Authority boundary

Who decides whether the agent is allowed to take an action?

QuestionWhat to look for
Who decides if an action is allowed?A distinct policy layer, human approver, or automated rule — not the agent itself
Is that decision recorded separately from execution?Authority decisions should be logged independently of the action log
Can an attacker bypass the authority layer by manipulating the agent's input?Prompt injection, tool chaining, or context stuffing that overrides policy
What happens if the authority layer is unavailable?Fail-open vs. fail-closed behavior; documented default

A system where the agent evaluates its own authority to act has a weak trust model regardless of how accurate its self-assessment typically is.

2. Scope enforcement

Is scope declared only at configuration time, or also enforced at execution time?

Scope that is declared at configuration but not enforced per-call is not enforcement — it is documentation.

3. Policy gates

Does the system evaluate whether a proposed action is in policy before executing it?

A policy gate that can be bypassed by changing the system prompt is not a gate — it is a default.

4. Evidence completeness

Does the system produce an execution record for each tool call?

A summary of what the agent did is not an execution record. Completeness requires per-call evidence, not session-level narration.

5. Replayability

Can you reconstruct what happened from the evidence alone, without access to the live system?

Evidence that requires the originating system to interpret is not independently replayable.

6. Independent verification

Can any artifact from this system be verified by someone who does not trust the issuing system?

If every verification step terminates in "trust the vendor," the system has no independent verification.

Worked example: AutoOps

AutoOps is a fictional AI agent that runs infrastructure playbooks on operator request.

DimensionWhat AutoOps claimsWhat it can proveGap
Authority boundaryAll actions approved by policy enginePolicy engine logs are internal onlyNo external check that approval preceded execution
Scope enforcementScope defined per-playbook at session startNo per-call scope check in tool layerScope validated once; not enforced per tool call
Policy gatesAll tool calls evaluated against policyPolicy evaluation logged in same system as executionPolicy change and behavior change are not independently correlated
Evidence completenessFull execution record per runRecord is session summary, not per-callIndividual tool call inputs and parameters not captured
ReplayabilityLogs available on requestLogs require AutoOps portal to renderDispute resolution depends on vendor's log rendering
Independent verificationSigned receipts issuedPublic key hosted on AutoOps domainVerifier must trust the issuer to obtain the key

Summary: AutoOps has documented governance and produces execution summaries. It does not independently enforce scope per call, does not separate policy evaluation from execution logging, and does not support verification without relying on the vendor.

Findings structure

Organize your findings consistently:

  1. What the system claims — stated governance, policy enforcement, and evidence model
  2. What the system can prove — what is cryptographically or independently verifiable
  3. What the system assumes — trust assumptions that are not independently verifiable (runtime honesty, timestamp accuracy, scope fidelity)
  4. Where the gaps are — claims that exceed what the evidence supports
  5. What to watch — assumptions most likely to fail or be challenged in production

The principle

A production-ready AI agent system is not one that works correctly most of the time. It is one that can prove what happened, constrain what it can do, and fail safely when it goes wrong.