April 16, 2026
WitnessOps

Why Third-Party Verifiers Still Fail When the Evidence Path Is Controlled

A third-party verifier that reads only what the system under review provides is not an independent check. Independence of the verifier is necessary but not sufficient — the evidence path must also be independent.

The Pattern

The SOC 2 audit is the canonical example. A company hires an accredited CPA firm with no financial relationship to the vendor, no shared personnel, and a contractual obligation to provide an independent opinion. The auditor is, by every structural definition, a third party. The engagement produces a signed Type II report. The vendor posts the badge.

What the audit actually reviewed was an evidence package assembled by the vendor: exported log samples, screenshots of configuration screens, access control records pulled from the vendor's own systems, and policy documents written by the vendor's compliance team. The auditor examined what the vendor provided. There is no mechanism in a standard SOC 2 engagement that requires the auditor to independently capture raw system state, verify that the exported records are complete rather than curated, or replay the execution environment to confirm the controls were actually operating as described.

The same structure appears in AI model evaluations. An AI lab commissions an external red team or safety evaluator. The evaluator has no stake in the outcome. But they test the model through the lab's own API, against a prompt set the lab may have approved in advance, using a scoring rubric the lab provided. The evaluator is independent. The evidence path runs entirely through the system under review. The resulting report certifies what the evaluator observed — not what the system does in production, not what it was doing before the evaluation window, not what it would do if the API layer were removed.


What Looks Strong


Where the Evidence Path Is Actually Weak

1. Export dependency — the verifier receives only what the system chooses to export

The verifier's entire evidence base was assembled by the system being evaluated. In a SOC 2 engagement, the vendor exports log samples from its SIEM, pulls access records from its identity provider, and generates configuration screenshots from its admin console. The auditor receives this package and tests it for internal consistency. They do not have a channel to the underlying log storage that bypasses the vendor's export layer.

The practical consequence: if the system exports a curated 90-day window of clean logs, the auditor evaluates that window. Anomalies outside the export boundary, or suppressed by the export configuration, are invisible. The auditor is not negligent — they examined what was provided. The export dependency is structural, not a matter of auditor diligence.

2. Selection control — the system decides which records are included in the evidence set

Export dependency is about the channel; selection control is about the contents. Even where the export mechanism is technically open, the system determines the query parameters, the date range, the record types, and the sample size. An AI evaluation lab that provides a prompt benchmark controls which capability surface is tested. A vendor providing "representative" log samples controls what representative means.

A verifier cannot reliably detect selection control from inside the evidence set. The set may be internally complete and consistent. The question is whether it is a representative draw from the full population of system behavior, and the verifier has no independent access to the population to check. Selection control means the verifier is assessing a sample they did not draw.

3. Format control — the evidence is serialized by the system, not captured independently

Records exported by a system are serialized by that system's export pipeline. The verifier receives structured data — JSON logs, PDF reports, CSV access lists — that was formatted, filtered, and possibly post-processed before delivery. The raw substrate (append-only log store, kernel audit trail, model weight checkpoint) remains in the vendor's custody.

Format control matters because serialization is transformation. Fields can be omitted, timestamps normalized, identifiers pseudonymized, and sequences reordered — all without falsifying any individual record. A verifier comparing exported records for internal consistency will find them consistent. Consistency within a formatted artifact does not imply fidelity to the underlying system state. The verifier is auditing the export, not the system.

4. Replay denial — the verifier cannot reconstruct the execution without the originating platform

The strongest form of independent verification is reconstruction: run the system from a known state, observe the outputs, compare to claimed behavior. Replay denial is the condition where this is structurally impossible. The verifier cannot re-execute the SOC 2 audit period because the live infrastructure is not in their custody. The AI evaluator cannot re-run the model against a different prompt set without the lab's API cooperation. The certified configuration cannot be reproduced outside the vendor's deployment environment.

Without replay, the verifier's finding is: "the evidence provided is consistent with the claimed controls." It cannot be: "we independently confirmed the controls operated as described." These are not the same statement. A competent, skeptical verifier operating under replay denial is limited to assessing coherence of the evidence package. Coherence is not correctness.


What a More Governable Version Would Need to Show


The Principle

A third party that reads only what the system under review provides is a trusted reader of untrusted input — the independence of the reader does not change the provenance of what is being read. Verification independence requires independence of both the verifier and the evidence path. When the evidence path is controlled, the audit certifies the package, not the system.


If this looks familiar, reading more won’t fix it → /review


See also: How to Test Whether a Proof Surface Is Actually Independent — the structured test for whether the evidence path meets the independence threshold.