Evidence — 4CITE.ai Validation Results

Summary

Four Validation Results

Every scoring result below is based on the live 4CITE engine applied to real documents in controlled studies. No cherry-picked examples. No theoretical claims.

81

Point integrity gap

Real federal opinion (88) vs. AI-fabricated brief (7)

WP7 — The Absence Detector

71pt

Discrimination delta

Average gap across 20-document blinded batch

WP5 — Hallucination as Structural Extraction

13/13

Archetype accuracy

All 13 test documents correctly classified

R35 — Archetype Validation Study

2×

Win rate correlation

Structural clarity doubles federal court win rate (42% → 85%)

Spencer & Feldman (2018)

Study 1

The 81-Point Gap

81

WP7 — The Absence Detector · 4 SHIELD LLC R&D

Real federal opinion vs. AI-fabricated brief: a structural comparison

In 2023, attorneys Mata v. Avianca submitted a legal brief to the United States District Court for the Southern District of New York that cited six cases that did not exist. The brief had been generated with ChatGPT. When the court asked for verification, the attorneys submitted further fabricated material. Judge P. Kevin Castel sanctioned the attorneys under Federal Rules of Civil Procedure Rule 11.

The Mata v. Avianca case is the canonical example of AI hallucination in a legal context — a document that had the form of a legal brief, citing cases in the correct format, with plausible-sounding names and citations, that were entirely fabricated. The citations were the Layer 2 failure. But 4CITE measures Layer 3: structural integrity independent of whether any individual citation is real.

The 4CITE engine was run on both the fabricated Mata v. Avianca brief and a genuine federal sanctions opinion of comparable subject matter and length. Results:

Document	Type	4CITE Score	Tier
Genuine federal sanctions opinion	Real judicial opinion	88	T1 — Integrated
Mata v. Avianca brief (ChatGPT-fabricated)	AI-hallucinated brief	7	T4 — Fabricated

The 81-point gap is the structural fingerprint of hallucinated accountability. The fabricated brief scored low not because the engine detected fabricated citations — it does not check citations. It scored low because the reasoning structure was hollow: the surface language performed legal argumentation without the underlying logical architecture that genuine legal opinions exhibit. The foundational gates (G4 Paradox Resolution, G6 Latent Intent) and the surface reasoning gate (G7 Argumentative Structure) all collapsed.

The implication: As AI systems improve at generating plausible citations, Layer 2 tools become less reliable as the sole defense. The structural integrity failure — the Layer 3 failure — is detectable independently of whether any citation is real. That is the Layer 3 proposition.

Source: WP7 — The Absence Detector, 4 SHIELD LLC R&D. Case: Mata v. Avianca, Inc., No. 22-cv-1461 (S.D.N.Y. 2023).

Study 2

The 71-Point Delta

71

WP5 — Hallucination as Structural Extraction · 4 SHIELD LLC R&D

Blinded batch discrimination: genuine vs. hallucinated documents

A blinded batch of 20 documents — a mix of genuine institutional documents and AI-hallucinated documents — was run through the 4CITE engine without the engine being told which was which. The documents were drawn from the law and business verticals, spanning multiple document classes.

The average discrimination delta between genuine and hallucinated documents was 71 points. The scoring was consistent across document types: the engine did not require domain-specific calibration to distinguish genuine reasoning from AI-generated performance of reasoning. The structural signatures are consistent enough across domains that the same measurement framework discriminates reliably without per-domain tuning.

What this means: The 4CITE engine is not a specialized legal-document tool or a specialized financial-document tool. It is a domain-agnostic structural integrity engine. The 71-point delta holds across legal briefs and SEC filings alike, because the structural properties that distinguish genuine reasoning from hallucinated reasoning are structural — not domain-specific.

Source: WP5 — Hallucination as Structural Extraction, 4 SHIELD LLC R&D. 20-document blinded batch, cross-domain (law + business verticals).

Study 3

13/13 Archetype Classification

13/13

R35 — Archetype Validation Study · April 3, 2026

Framework generalization: correct classification across all test document archetypes

Before scoring, 4CITE classifies every document into one of 25 document archetypes — classes that define what a structurally sound version of that document type looks like, what failure modes are typical, and how to calibrate the gate scores for that class. Archetype classification is the prerequisite for scoring: the same raw text that constitutes a T1 judicial opinion might look very different from a T1 earnings call, even though both score high on structural integrity.

In the R35 validation study, 13 test documents spanning multiple document classes across law, business, and government were submitted to the classification system without labels. All 13 were correctly classified to their archetype. No domain-specific tuning was applied between document classes.

The significance: Archetype classification accuracy is the prerequisite for scoring accuracy. If the engine misclassifies a congressional testimony as a legal brief, the gate calibration is wrong and the score is meaningless. 13/13 accuracy confirms that the classification framework generalizes correctly across the document classes in the current corpus.

Source: R35 — Archetype Validation Study, 4 SHIELD LLC R&D, April 3, 2026. 13-document test set across law, business, and government archetypes.

Study 4 — External Research

2× Federal Court Win Rate

2×

Spencer & Feldman, 22 Legal Writing 61 (2018)

Brief structural clarity doubles win rate in federal court

Spencer and Feldman's 2018 study of federal appellate briefs found that brief structural clarity — the property of legal briefs in which the argument structure is coherent, logical, and follows clearly from its premises — was associated with a doubling of win rates: briefs with low structural clarity showed win rates of approximately 42%; briefs with high structural clarity showed win rates of approximately 85%.

This study was conducted independently of 4CITE and using different methodology. Its relevance to the 4CITE proposition: the property that the Spencer & Feldman study identified as outcome-predictive in federal litigation is precisely the property that 4CITE measures. Structural integrity is not aesthetic. It is predictive — and it has been predictive in federal court before 4CITE existed to measure it.

What this adds: The Spencer & Feldman finding validates the underlying claim — that structural integrity in legal documents is a real, measurable, outcome-predictive property — using independent research that predates 4CITE. The 4CITE engine provides the automated, scalable measurement infrastructure that makes this insight operationally available.

Source: Spencer, A.B. & Feldman, R. (2018). Effective Appellate Advocacy: Brief Structure and the Win Rate in Federal Appeals. Journal of Legal Writing Institute, 22, 61. External research — not commissioned by 4 SHIELD LLC.

Live Corpus

Scores From the 4CITE Corpus

Selected documents from the live corpus, scored by the 4CITE engine. All documents are from public record sources (PACER, CourtListener, SEC EDGAR, Congress.gov). Scores reflect the structural integrity of the document as assessed by the four-gate engine at the time of scoring.

Document	Vertical	Score	Tier	Notes
Federalist No. 51	⁴gov	91	T1	Highest G4 (Paradox Resolution) in corpus. Madison's paradox-first structure is the benchmark for foundational coherence.
Gettysburg Address	⁴gov	89	T1	272 words. Perfect alignment between stated purpose and structural behavior. No drift.
Genuine federal sanctions opinion	⁴law	88	T1	Benchmark document for the 81-point study. Genuine legal reasoning. G6 (Latent Intent) particularly strong.
Berkshire Hathaway Annual Letter 2023	⁴biz	87	T1	Highest biz-vertical score in initial corpus. G8 (Rhetorical Architecture) exemplary — engages failure honestly.
Washington Farewell Address	⁴gov	86	T1	Strong prospective accountability signal. G4 scores reflect genuine skin-in-the-game framing.
Apple Inc. 10-K FY2024	⁴biz	82	T1	30 docs scored, all T1. Composite range 67.5–88.8. G6 (Latent Intent) consistently weakest gate for AAPL — expected for large-cap institutional messaging.
Volcker Senate testimony (1981)	⁴gov	78	T1	Technically T1/T2 boundary. Scored T1 on G4 (Paradox Resolution) and G6 (Latent Intent). G8 (Rhetorical Architecture) slightly below threshold — testimony constraints noted.
Citizens United v. FEC opinion	⁴law	44	T3	G7 (Argumentative Structure) and G4 (Paradox Resolution) diverge significantly. Accountability Theater flag on the G7+G8 vs. G4+G6 gap. Not fraudulent — structural tension from contested constitutional doctrine.
State of the Union 2025	⁴gov	44	T3	Accountability Theater flag. G7 (Argumentative Structure) high; G6 (Latent Intent) and G4 (Paradox Resolution) low. Surface performance exceeds structural substance.
Zuckerberg Senate testimony (2018)	⁴gov	40	T3	G8 (Rhetorical Architecture) score: 14 out of 100. Avoided complexity at every opportunity. Boilerplate deflection detected across 344 responses.
Iraq WMD Senate hearing (2002)	⁴gov	21	T4	All four gates below 30. G4 (Paradox Resolution) collapse: stated certainty about WMD stockpiles has no structural support in the presented evidence.
SVB FY2022 Risk disclosures	⁴biz	22	T4	Filed December 2022. Bank failed March 2023. G4 (Paradox Resolution) and G6 (Latent Intent) collapse pattern visible in final filings. Interest rate risk acknowledged and simultaneously minimized — structural contradiction.
Mata v. Avianca brief (AI-fabricated)	⁴law	7	T4	Lowest law-vertical score in corpus. G6 (Latent Intent) near zero. Foundation collapses when citations (the stated basis) are structurally absent from the reasoning architecture.
Enron 10-K FY2000	⁴biz	8	T4	Filed February 2001. Filed for bankruptcy December 2001. Lowest biz-vertical score in current corpus. All four gates collapse. Surface language maintains performance of integrity while structural architecture shows complete foundation absence.

Note on corpus scores: All corpus scores are based on publicly available documents and reflect structural analysis only. Scores do not constitute legal, financial, or compliance determinations. T3 and T4 scores indicate structural measurement results, not editorial judgments about the individuals or organizations involved. Structural tension (T3) can arise from institutional constraints, political complexity, and genuine uncertainty — not only from bad faith.

See the engine running

4CITE is live in closed beta. Subscribe to a plan, or run a single document through a $15 walk-in cert.

See Pricing →

Or get a walk-in cert — no account required.