Legal Discovery Methodology Documentation
Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center
Document Purpose
This methodology documentation satisfies legal counsel requirements for
defensible, documented discovery processes with human verification.
Case Background
- Plaintiff: Jennifer Capasso
- Defendant: Memorial Sloan Kettering Cancer Center (MSK)
- Claim: Discrimination based on gender identity
- Data Source: Signal chat messages (200,000 messages over 6 years)
- Format: CSV with message, timestamp, sender columns
Subpoena Criteria (Complete)
Messages responsive if they relate to:
- Jennifer Capasso's treatment at MSK
- Complaints to MSK staff about Jennifer Capasso
- Requests to update Jennifer Capasso's pronouns/gender markers at MSK
- Gender markers for Jennifer Capasso at other hospitals
- Prior discrimination Jennifer Capasso experienced (any setting)
- Jennifer Capasso's March 7, 2022 surgery at MSK
- Emotional distress/economic loss from MSK treatment
Methodology Overview
Hybrid approach combining:
- Text normalization and keyword expansion
- Semantic analysis via embeddings
- Large language model classification
- Human verification
Stage 1: Text Normalization
Purpose: Improve matching accuracy
Process:
- Lowercase conversion
- Abbreviation expansion (MSK → Memorial Sloan Kettering, etc.)
- Preserve original text for production
Rationale: Informal chat language requires normalization for consistent matching
Stage 2: Chunk Creation
Purpose: Preserve conversational context
Parameters:
- Chunk size: 20 messages
- Overlap: 5 messages
- Rationale: Balances context preservation with focused analysis
Context Preservation:
- Topics may reappear after hundreds of messages
- Overlapping chunks ensure no context loss at boundaries
- LLM analyzes chunks as conversational units
Stage 3: Keyword Filtering
Purpose: Initial reduction while maintaining high recall
Keywords Derived From:
- Plaintiff name variations
- Facility names (MSK, Memorial Sloan Kettering, etc.)
- Treatment terms (surgery, doctor, appointment, etc.)
- Discrimination terms (bias, unfair, misgendered, etc.)
- Specific dates (March 7, 2022)
- Emotional distress indicators
Expected Reduction: ~50%
Rationale: Conservative keyword matching ensures high recall
Stage 4: Semantic Filtering
Purpose: Capture semantic meaning beyond exact keywords
Model: sentence-transformers/all-MiniLM-L6-v2
- Open source, well-validated
- Runs locally (no data transmission)
- Efficient for large corpora
Process:
- Generate query vectors from each subpoena criterion
- Compute embeddings for all chunks
- Calculate cosine similarity
- Filter by threshold (0.25 = conservative for high recall)
Expected Reduction: Additional 40-50% (80-90% total)
Rationale: Semantic similarity captures implicit references and synonyms
Stage 5: LLM Classification
Purpose: Detailed analysis with reasoning
Model: OpenAI GPT-4o-mini via Batch API
- Cost-effective ($0.05-0.10 for entire corpus)
- High accuracy for legal text analysis
- Batch API: 50% cost savings, no data retention
Prompt Design:
- Includes complete subpoena criteria
- Provides case context
- Explicitly instructs to err on side of over-inclusion
- Requests structured JSON output with reasoning
- Analyzes chunks in conversational context
Temperature: 0.1 (low for consistency)
Output Format:
{
"chunk_responsive": true/false,
"responsive_line_numbers": [list],
"reasoning": "explanation",
"confidence": "high/medium/low",
"key_topics": ["topics"]
}
Expected Processing Time: 2-12 hours (typically 4-6)
Stage 6: Human Verification
Purpose: Final review and production decisions
Process:
- All responsive messages reviewed by case team
- High confidence messages: Full review
- Medium confidence messages: Sample review
- Low confidence messages: Spot check
- Sample of non-responsive messages reviewed for false negatives
- Attorney approval before production
Redaction Capability:
- Spreadsheet format allows row-level or cell-level redaction
- Non-responsive portions can be marked/deleted
- Maintains audit trail of redaction decisions
Quality Assurance Measures
- Reproducibility: All parameters documented and saved
- Audit Trail: Complete log of filtering decisions
- Confidence Scoring: Enables risk-based review prioritization
- Statistical Validation: Sample testing on subset before full run
- Human Oversight: Multiple review stages
- Documentation: Methodology, prompts, and results preserved
Recall vs Precision Balance
Approach: Err on side of OVER-INCLUSION (high recall)
Rationale:
- Legal discovery favors over-production vs under-production
- Human review filters false positives
- Conservative thresholds at each stage
- Explicit LLM instruction to include borderline cases
Expected Performance:
- Recall: 85-95% (captures most responsive messages)
- Precision: 60-80% (some false positives acceptable)
- Human review corrects false positives
Limitations and Mitigations
Limitation 1: Attachments not included in initial analysis
- Mitigation: Review attachments for responsive messages after text analysis
Limitation 2: Context limited to 20-message chunks
- Mitigation: Overlapping chunks, can increase size if needed
Limitation 3: LLM may miss highly implicit references
- Mitigation: Conservative filtering, human review, false negative sampling
Limitation 4: Informal language and abbreviations
- Mitigation: Text normalization, abbreviation expansion
Cost and Timeline
Budget: $100 allocated
Actual Cost: $0.05-0.10 (OpenAI Batch API)
Timeline: 24 hours (including wait time)
Labor: 10-30 hours manual review (vs 200+ hours full manual)
Defensibility
This methodology is defensible because:
- Documented: Complete documentation of all steps
- Reproducible: Saved parameters and prompts
- Validated: Human verification at multiple stages
- Conservative: Errs on side of over-inclusion
- Transparent: Reasoning provided for each classification
- Auditable: Complete trail of decisions
- Approved: Legal counsel reviewed and approved approach
Conclusion
This hybrid methodology balances efficiency with accuracy while maintaining
high recall as required for legal discovery. The multi-stage approach with
human verification ensures defensible results suitable for production in
response to the subpoena.
Prepared: December 7, 2025
Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center