How We Built a Rigorous LLM Evaluation Framework for Legal Document Analysis
When you use large language models to analyze legal documents, one question inevitably comes up:
How do you know the output is actually good?
In most domains, an imperfect answer is an inconvenience. In law, it’s a risk.
A deposition summary that misses a key admission.A contract review that overlooks a liability carve-out.An evidence analysis that misstates the timeline. These aren’t cosmetic issues - they’re the kinds of errors that can materially affect strategy, outcomes, and professional responsibility. At DecoverAI, we needed a rigorous, repeatable, and defensible way to measure how well our LLMs perform on legal document analysis. So we built one. This post walks through how our evaluation framework works, what we learned building it, and why we think this approach sets a higher bar for the broader legal tech community.
The Datasets
We used discovery and court documents from the case Depp v. Heard that were available through CourtListener and are publicly accessible. These consist of deposition transcripts, trial testimony, motions, exhibits, evidentiary filings, and judicial rulings spanning multiple phases of the litigation. The dataset includes a mix of narrative testimony, adversarial questioning, procedural filings, and factual exhibits. These are exactly the kinds of heterogeneous, high-stakes documents that legal teams routinely analyze under time pressure. Importantly, the materials contain known facts, disputed narratives, and documented inconsistencies, which made them well-suited for evaluating not just surface-level summarization, but deeper legal reasoning tasks such as timeline reconstruction, issue spotting, contradiction detection, and attribution of statements to the correct speakers and sources.
The Core Idea: LLM-as-a-Judge
At the heart of our framework is a simple but powerful idea:
Use LLMs to judge LLMs.
We start with a source legal document such as a deposition, contract, or expert report. One model generates findings from that document. A separate LLM, acting as a judge, then evaluates those findings across multiple dimensions of legal quality. You can think of it like a senior attorney reviewing a junior associate’s work except the senior attorney is also an LLM, operating against a precise and explicit rubric. Crucially, this is not a single thumbs-up or thumbs-down score. We evaluate outputs across multiple dimensions, each with its own criteria, scoring scale, and weight.
Eight Dimensions of Legal Quality
Not all errors are created equal. A formatting issue is very different from incorrect legal reasoning. Our framework breaks evaluation down into eight dimensions that together capture what “good legal analysis” actually means in practice:
1. Legal Accuracy
This is the most important dimension. If a model misstates the law confusing negligence with strict liability, for example nothing else matters. Legal accuracy is weighted double relative to other dimensions.
2. Factual Completeness
Did the model extract all relevant facts? Dates, parties, amounts, contradictions, and key admissions all matter.
3. Evidence Identification
Does the model correctly identify what constitutes evidence, and distinguish between direct evidence, circumstantial evidence, and inference?
4. Legal Relevance
Are the findings actually relevant to the legal issues at hand, or merely accurate but beside the point?
5. Procedural Accuracy
Does the model correctly understand procedural concepts such as jurisdiction, statutes of limitation, or filing requirements?
6. Argumentation Structure
Is the reasoning coherent? Are conclusions supported by premises, or does the analysis jump to unsupported assertions?
7. Citation Quality
Does the model point to specific portions of the source document, or rely on vague references like “the testimony”?
8. Critical Issue Identification
Did the model surface the most important issues, or bury them in a wall of text?
Each dimension is scored on a 1–5 rubric with explicit criteria for every level. The final score is a weighted aggregate, with legal accuracy carrying double weight.
Consensus Through Multiple Judges
A single evaluator human or machine can be biased or inconsistent. To address this, our framework supports multi-judge consensus evaluation. Multiple LLM judges independently score the same output. We then aggregate their scores and measure inter-judge agreement. We currently use four distinct judge personas:
- A senior attorney (15+ years) focused on overall legal quality
- A litigation specialist emphasizing trial-readiness and adversarial robustness
- A legal researcher prioritizing thoroughness and citation accuracy
- An e-discovery expert evaluating document handling and evidentiary rigor
When agreement drops below a defined threshold (0.8), that disagreement itself becomes a signal typically indicating a gray area that warrants human review.
The End-to-End Evaluation Pipeline
The framework is not just a scoring script. It’s a full pipeline from raw documents to statistically meaningful reports:
- Load source documents: Depositions, contracts, expert reports, discovery responses provided in text or markdown.
- Generate findings: One or more LLMs (e.g., GPT, Gemini, Claude) generate structured findings from each document.
- Evaluate: Judge models score the findings across all eight dimensions.
- Analyze: We compute means, standard deviations, correlations between dimensions, and pairwise statistical significance across models.
- Report: The system produces markdown reports and visualizations including box plots, heat maps, correlation matrices, and trend charts.
The entire pipeline runs asynchronously, allowing large evaluation runs to complete efficiently.
Five Types of Experiments We Run
We designed the framework to answer the real questions teams face when deploying legal AI:
Baseline Comparison
Compare multiple models on the same document set to determine which performs best overall.
Prompt Ablation
Test how different prompting strategies affect performance—chain-of-thought, examples, system prompt length, and more.
Consensus Reliability
Measure how stable and repeatable the judges themselves are over time.
Document Type Analysis
Break down performance by document category. A model strong on depositions may struggle with contracts or expert reports.
Legal Domain Specificity
Analyze performance across practice areas, helping us route documents to the models best suited for them.
What We Learned
A few insights stood out while building and running this framework:
- Weighted dimensions matter.Equal weighting produced misleading rankings. A polished but legally incorrect output is worse than a rough but accurate one. Doubling the weight of legal accuracy materially changed model selection decisions.
- Multiple judges are essential.Single-judge evaluations showed ~15% variance on repeat runs. Adding judges significantly reduced noise and improved trustworthiness.
- Document type is a confounding variable.Early results favored one model until we controlled for document type. The “better” model was simply stronger on the most common document in our test set.
- Statistical significance prevents false confidence.Apparent score differences often disappear under t-tests. We require p < 0.05 before declaring one model meaningfully better than another.
What the Results Actually Showed
A framework is only as valuable as the signal it produces. Here’s what we observed when we ran our evaluation suite against a representative set of legal documents (depositions, contracts, and discovery materials), using our baseline production configuration.
Overall Performance by Dimension
Across most dimensions, the system performed strongly and consistently:
- Evidence Detection - Average score: 0.96 | Pass rate: 100%The model reliably identified what constituted evidence and distinguished factual testimony from inference.
- Completeness Check - Average score: 1.00 | Pass rate: 100%All materially relevant facts dates, parties, monetary figures, and key statements—were captured across the evaluated documents.
- Strategy & Planning - Average score: 0.91 | Pass rate: 100%The findings aligned well with downstream legal strategy, surfacing issues that would plausibly inform motion practice, discovery scope, or deposition planning.
- Tool Efficiency - Average score: 0.85 | Pass rate: 100% Outputs were concise, well-structured, and usable without unnecessary verbosity an important signal for real-world legal workflows.
- Reasoning Transparency - Average score: 0.87 | Pass rate: 100% Judges consistently found that conclusions were traceable to underlying facts and reasoning steps, rather than appearing as unexplained assertions.
The Hard Part: Legal Reasoning
One dimension stood out and this is precisely why we built the framework.
- Legal Reasoning - Average score: 0.57 | Pass rate: 33.3%
In plain terms: this is where the model struggled, and the evaluation framework surfaced it clearly. The judges flagged issues such as:
- Overgeneralized legal conclusions not sufficiently tied to jurisdiction-specific doctrine
- Correct factual extraction paired with incomplete or imprecise legal analysis
- Reasoning that was directionally plausible but not yet “partner-grade”
This was not a surprise and it was not a failure of the evaluation system. That was the point.
Without dimension-level scoring and explicit rubrics, these weaknesses would have been masked by strong performance elsewhere. A single aggregate score would have suggested the system was “good enough.” Our framework made it obvious where focused improvement was required.
Why These Results Matter
Two takeaways were especially important for us:
- High pass rates elsewhere do not compensate for weak legal reasoning. This validated our decision to weight legal accuracy and reasoning more heavily than other dimensions.
- The framework gives us a precise roadmap for improvement. Instead of vague intuition (“this feels a bit off”), we can target legal reasoning specifically through better prompts, domain-specific routing, or human-anchored calibration without regressing on areas where the system already performs well.
Just as importantly, these results are reproducible. Re-running the same evaluations yields stable scores within a narrow variance band, especially when using multi-judge consensus.
How This Integrates Into DecoverAI
The evaluation framework is tightly integrated into our production stack:
- We use the same set of abstractions for both evaluation and production, ensuring we evaluate the exact code paths that run in real workflows.
- Configuration flows from a centralized config system, keeping model identifiers and provider settings consistent.
- Evaluation traces can be routed through third-party tools (such as Langfuse) for observability tracking cost, latency, and trends over time.
What’s Next
The framework is already production-ready, but we’re extending it further:
- Human-in-the-loop calibration, anchoring LLM judges to attorney-scored gold standards
- Temporal drift detection, flagging performance degradation as document distributions change
- Expanded domain-specific rubrics, tuned to particular practice areas
Why This Matters
You cannot responsibly deploy AI to analyze depositions, contracts, and evidence without knowing quantitatively how good it is. This framework gives us that confidence. More importantly, it gives us a way to maintain that confidence as models, prompts, and legal contexts evolve. The framework spans roughly 2,600 lines of Python across six modules. While built for legal analysis, the underlying patterns weighted multi-dimension scoring, multi-judge consensus, statistical testing, and document-aware evaluation are broadly applicable to any high-stakes domain. If you’re building LLM systems where correctness truly matters, this is the bar we believe the industry needs to meet.