METHODOLOGY_DOCUMENTATION.md 6.6 KB

Legal Discovery Methodology Documentation

Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center

Document Purpose

This methodology documentation satisfies legal counsel requirements for defensible, documented discovery processes with human verification.

Case Background

  • Plaintiff: Jennifer Capasso
  • Defendant: Memorial Sloan Kettering Cancer Center (MSK)
  • Claim: Discrimination based on gender identity
  • Data Source: Signal chat messages (200,000 messages over 6 years)
  • Format: CSV with message, timestamp, sender columns

Subpoena Criteria (Complete)

Messages responsive if they relate to:

  1. Jennifer Capasso's treatment at MSK
  2. Complaints to MSK staff about Jennifer Capasso
  3. Requests to update Jennifer Capasso's pronouns/gender markers at MSK
  4. Gender markers for Jennifer Capasso at other hospitals
  5. Prior discrimination Jennifer Capasso experienced (any setting)
  6. Jennifer Capasso's March 7, 2022 surgery at MSK
  7. Emotional distress/economic loss from MSK treatment

Methodology Overview

Hybrid approach combining:

  • Text normalization and keyword expansion
  • Semantic analysis via embeddings
  • Large language model classification
  • Human verification

Stage 1: Text Normalization

Purpose: Improve matching accuracy

Process:

  • Lowercase conversion
  • Abbreviation expansion (MSK → Memorial Sloan Kettering, etc.)
  • Preserve original text for production

Rationale: Informal chat language requires normalization for consistent matching

Stage 2: Chunk Creation

Purpose: Preserve conversational context

Parameters:

  • Chunk size: 20 messages
  • Overlap: 5 messages
  • Rationale: Balances context preservation with focused analysis

Context Preservation:

  • Topics may reappear after hundreds of messages
  • Overlapping chunks ensure no context loss at boundaries
  • LLM analyzes chunks as conversational units

Stage 3: Keyword Filtering

Purpose: Initial reduction while maintaining high recall

Keywords Derived From:

  • Plaintiff name variations
  • Facility names (MSK, Memorial Sloan Kettering, etc.)
  • Treatment terms (surgery, doctor, appointment, etc.)
  • Discrimination terms (bias, unfair, misgendered, etc.)
  • Specific dates (March 7, 2022)
  • Emotional distress indicators

Expected Reduction: ~50%

Rationale: Conservative keyword matching ensures high recall

Stage 4: Semantic Filtering

Purpose: Capture semantic meaning beyond exact keywords

Model: sentence-transformers/all-MiniLM-L6-v2

  • Open source, well-validated
  • Runs locally (no data transmission)
  • Efficient for large corpora

Process:

  1. Generate query vectors from each subpoena criterion
  2. Compute embeddings for all chunks
  3. Calculate cosine similarity
  4. Filter by threshold (0.25 = conservative for high recall)

Expected Reduction: Additional 40-50% (80-90% total)

Rationale: Semantic similarity captures implicit references and synonyms

Stage 5: LLM Classification

Purpose: Detailed analysis with reasoning

Model: OpenAI GPT-4o-mini via Batch API

  • Cost-effective ($0.05-0.10 for entire corpus)
  • High accuracy for legal text analysis
  • Batch API: 50% cost savings, no data retention

Prompt Design:

  • Includes complete subpoena criteria
  • Provides case context
  • Explicitly instructs to err on side of over-inclusion
  • Requests structured JSON output with reasoning
  • Analyzes chunks in conversational context

Temperature: 0.1 (low for consistency)

Output Format:

{
  "chunk_responsive": true/false,
  "responsive_line_numbers": [list],
  "reasoning": "explanation",
  "confidence": "high/medium/low",
  "key_topics": ["topics"]
}

Expected Processing Time: 2-12 hours (typically 4-6)

Stage 6: Human Verification

Purpose: Final review and production decisions

Process:

  1. All responsive messages reviewed by case team
  2. High confidence messages: Full review
  3. Medium confidence messages: Sample review
  4. Low confidence messages: Spot check
  5. Sample of non-responsive messages reviewed for false negatives
  6. Attorney approval before production

Redaction Capability:

  • Spreadsheet format allows row-level or cell-level redaction
  • Non-responsive portions can be marked/deleted
  • Maintains audit trail of redaction decisions

Quality Assurance Measures

  1. Reproducibility: All parameters documented and saved
  2. Audit Trail: Complete log of filtering decisions
  3. Confidence Scoring: Enables risk-based review prioritization
  4. Statistical Validation: Sample testing on subset before full run
  5. Human Oversight: Multiple review stages
  6. Documentation: Methodology, prompts, and results preserved

Recall vs Precision Balance

Approach: Err on side of OVER-INCLUSION (high recall)

Rationale:

  • Legal discovery favors over-production vs under-production
  • Human review filters false positives
  • Conservative thresholds at each stage
  • Explicit LLM instruction to include borderline cases

Expected Performance:

  • Recall: 85-95% (captures most responsive messages)
  • Precision: 60-80% (some false positives acceptable)
  • Human review corrects false positives

Limitations and Mitigations

Limitation 1: Attachments not included in initial analysis

  • Mitigation: Review attachments for responsive messages after text analysis

Limitation 2: Context limited to 20-message chunks

  • Mitigation: Overlapping chunks, can increase size if needed

Limitation 3: LLM may miss highly implicit references

  • Mitigation: Conservative filtering, human review, false negative sampling

Limitation 4: Informal language and abbreviations

  • Mitigation: Text normalization, abbreviation expansion

Cost and Timeline

Budget: $100 allocated Actual Cost: $0.05-0.10 (OpenAI Batch API) Timeline: 24 hours (including wait time) Labor: 10-30 hours manual review (vs 200+ hours full manual)

Defensibility

This methodology is defensible because:

  1. Documented: Complete documentation of all steps
  2. Reproducible: Saved parameters and prompts
  3. Validated: Human verification at multiple stages
  4. Conservative: Errs on side of over-inclusion
  5. Transparent: Reasoning provided for each classification
  6. Auditable: Complete trail of decisions
  7. Approved: Legal counsel reviewed and approved approach

Conclusion

This hybrid methodology balances efficiency with accuracy while maintaining high recall as required for legal discovery. The multi-stage approach with human verification ensures defensible results suitable for production in response to the subpoena.


Prepared: December 7, 2025 Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center