Hallucination-Free? Not Even Close: Stanford Tests the Big Legal AI Tools

Loodud: 2026-04-15 | Suurus: 8671 baiti

TL;DR

A junior associate pastes Westlaw's summary into a brief. The cited case holds the opposite of what the summary says. Nobody notices until opposing counsel does. That is the kind of failure Stanford's RegLab documented in the first preregistered evaluation of the RAG-based legal AI tools sold by LexisNexis and Thomson Reuters (Journal of Empirical Legal Studies, March 2025). Across 202 real legal questions, Lexis+ AI and Ask Practical Law AI hallucinated on 1 in 6 queries. Westlaw AI-Assisted Research hallucinated on 1 in 3. RAG helps. It does not deliver the "hallucination-free" product the marketing promised.

This Is Not a General LLM Benchmark

This is a closed-universe test of vendor tools already in production at top law firms, sold under contracts that imply they are safer than a chatbot because they retrieve from real legal databases. The comparison to GPT-4 is a baseline, not the point.

The Vendor Claims on Record

The paper pins both vendors to their own marketing copy, verbatim:

LexisNexis (Lexis+ AI): "Lexis+ AI delivers 100% hallucination-free linked legal citations."
Thomson Reuters (Westlaw): "We avoid [hallucinations] by relying on the trusted content within Westlaw."

Methodology

202 queries split across general legal research (80), jurisdiction or time-specific (70), false premise (22), and factual recall (30). Sources include LegalBench Rule QA, BARBRI bar prep, and hand-curated items probing circuit splits and overturned cases. Inter-rater Cohen's kappa 0.77, strong for a hand-graded legal task.

Correctness vs. Groundedness

The paper's key framing splits hallucination into two axes:

Correctness: Does the answer state the law accurately?
Groundedness: Do the cited sources actually support the claims?

A response is a hallucination if it is either wrong or misgrounded. Vendors quietly redefine "grounded" to mean "we linked a real case next to the sentence," not "the case supports the sentence." A tool that cites a real Supreme Court case that says the opposite of its answer is still hallucinating.

The Numbers

System	Accurate	Incomplete	Hallucinated	Mean words
Lexis+ AI	65%	18%	17%	219
Westlaw AI-Assisted Research	42%	25%	33%	350
Ask Practical Law AI	19%	63%	17%	175
GPT-4 (baseline)	49%	8%	43%	-

Westlaw is worst. It hallucinates nearly twice as often as Lexis+ AI and produces the longest answers. Longer answers correlate with more false propositions.
The narrower tool is not the safer tool. Ask Practical Law AI ties Lexis on hallucination rate but refuses or gives incomplete answers 63% of the time. When it does answer, it hallucinates just as often. Corpus size is not a reliability knob.
GPT-4 still hallucinates on 43% of queries. RAG helps, just not as much as the brochures claim.

Four Failure Modes

1. Naive retrieval. Matching keywords, wrong legal question. Dominant Lexis+ AI failure (0.47 prevalence). Semantic similarity is not legal relevance.

2. Inappropriate authority. Real case, wrong jurisdiction, overruled, or superseded. Westlaw 0.40. Knowing Roe was overturned by Dobbs is table stakes, and systems still miss it.

3. Reasoning errors. Right docs, wrong reading. The LLM inverts a holding or attributes a dissent to the majority. Westlaw 0.61, which explains why its long answers are the most dangerous.

4. Sycophancy. Agreeing with a false premise. Rare in RAG tools (0.00 to 0.06), common in GPT-4.

Failures That Do Not Look Like Failures

The errors are doctrinally subtle, not obviously wrong. A sample:

Westlaw claimed a U.S. Supreme Court case was reversed by the Nebraska Supreme Court, which is structurally impossible.
Lexis+ AI invented a nonexistent Tenth Circuit ruling on the equity "clean hands" doctrine.
Westlaw stated Robers v. U.S. held that collateral is a return of "any part" of a loan. The case held the exact opposite.
Ask Practical Law AI rewrote the FRE 804(b)(2) dying declaration exception as universal, citing a Fifth Circuit opinion that says it is not.

Paste any of these into a brief and you are the next Michael Cohen headline.

Your RAG System Has These Same Failure Modes

Swap "legal" for "clinical" or "compliance" and the typology transfers cleanly. A medical RAG chatbot will hit naive retrieval when "beta blockers for migraine" pulls a paper on beta blockers for hypertension. Inappropriate authority arrives the moment a 2011 guideline surfaces that a 2023 one has superseded. Reasoning errors happen when the LLM inverts the effect direction of an RCT it just read. Sycophancy is "isn't amoxicillin safe in pregnancy?" getting a confident yes. Every failure mode Stanford catalogued in law has a direct analogue in every RAG system shipping in regulated industries right now. The legal vertical just has the clearest vendor marketing to falsify.

The Vendor Cat and Mouse

LexisNexis shipped a "second generation" Lexis+ AI after the Stanford queries were run and declined to provide access for re-evaluation, noting the old results do not speak to the new version. Thomson Reuters initially denied the researchers access to Ask Practical Law AI and only relented after the study had begun. This is the shape of the next two years: researchers benchmark, vendors ship an update and declare the benchmark obsolete, no new public number replaces it. Without preregistered, ongoing, vendor-cooperating evaluation, "we fixed it" is a marketing claim, not a measurement.

What To Do If You Ship RAG

Measure groundedness separately from correctness. Cited ≠ supported.
Track retrieval failures by cause. Naive retrieval, wrong authority, and reasoning errors need different fixes.
Watch answer length. Verbose outputs should trip a reviewer flag, not reassure the user.
Evaluate at the task the customer cares about, not the benchmark the lab published.
Assume users will not verify. Design for that.

Trust Is Earned in Public

Stanford's real contribution is not the hallucination rate. It is that they preregistered the dataset, published the typology, and named the vendors. The industry has spent two years operating on vendor self-report. "Hallucination-free" is a marketing claim that never survived first contact with an independent evaluator, and the takeaway is not "avoid legal AI," it is "stop treating vendor marketing as evaluation." Until every product in this category ships public, preregistered benchmarks, the default assumption on any RAG reliability claim should be: prove it.

References

Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools - Magesh, Surani, Dahl, Suzgun, Manning, Ho. Journal of Empirical Legal Studies, 2025. Original source.
RegLab Dataset on Hugging Face - Preregistered dataset and outputs.
Large Legal Fictions - Dahl et al., the predecessor paper on legal hallucinations in general LLMs.
Here's What Happens When Your Lawyer Uses ChatGPT - Weiser 2023, the Avianca case.
2023 Year-End Report on the Federal Judiciary - Chief Justice Roberts on AI in legal practice.
FTC: Keep Your AI Claims in Check - FTC guidance on AI marketing claims.
Your AI Agent Aces the Benchmark. It Still Can't Be Trusted. - Daita blog on agent reliability.
Knowledge Graphs for AI Agents: When Vector Search Hits a Wall - Daita blog on structured retrieval.
Arize Phoenix vs Laminar: Picking the Right LLM Observability Stack - Daita blog on LLM observability.