# Legal Discovery Methodology Documentation ## Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center ### Document Purpose This methodology documentation satisfies legal counsel requirements for defensible, documented discovery processes with human verification. ### Case Background - Plaintiff: Jennifer Capasso - Defendant: Memorial Sloan Kettering Cancer Center (MSK) - Claim: Discrimination based on gender identity - Data Source: Signal chat messages (200,000 messages over 6 years) - Format: CSV with message, timestamp, sender columns ### Subpoena Criteria (Complete) Messages responsive if they relate to: 1. Jennifer Capasso's treatment at MSK 2. Complaints to MSK staff about Jennifer Capasso 3. Requests to update Jennifer Capasso's pronouns/gender markers at MSK 4. Gender markers for Jennifer Capasso at other hospitals 5. Prior discrimination Jennifer Capasso experienced (any setting) 6. Jennifer Capasso's March 7, 2022 surgery at MSK 7. Emotional distress/economic loss from MSK treatment ### Methodology Overview Hybrid approach combining: - Text normalization and keyword expansion - Semantic analysis via embeddings - Large language model classification - Human verification ### Stage 1: Text Normalization **Purpose**: Improve matching accuracy **Process**: - Lowercase conversion - Abbreviation expansion (MSK → Memorial Sloan Kettering, etc.) - Preserve original text for production **Rationale**: Informal chat language requires normalization for consistent matching ### Stage 2: Chunk Creation **Purpose**: Preserve conversational context **Parameters**: - Chunk size: 20 messages - Overlap: 5 messages - Rationale: Balances context preservation with focused analysis **Context Preservation**: - Topics may reappear after hundreds of messages - Overlapping chunks ensure no context loss at boundaries - LLM analyzes chunks as conversational units ### Stage 3: Keyword Filtering **Purpose**: Initial reduction while maintaining high recall **Keywords Derived From**: - Plaintiff name variations - Facility names (MSK, Memorial Sloan Kettering, etc.) - Treatment terms (surgery, doctor, appointment, etc.) - Discrimination terms (bias, unfair, misgendered, etc.) - Specific dates (March 7, 2022) - Emotional distress indicators **Expected Reduction**: ~50% **Rationale**: Conservative keyword matching ensures high recall ### Stage 4: Semantic Filtering **Purpose**: Capture semantic meaning beyond exact keywords **Model**: sentence-transformers/all-MiniLM-L6-v2 - Open source, well-validated - Runs locally (no data transmission) - Efficient for large corpora **Process**: 1. Generate query vectors from each subpoena criterion 2. Compute embeddings for all chunks 3. Calculate cosine similarity 4. Filter by threshold (0.25 = conservative for high recall) **Expected Reduction**: Additional 40-50% (80-90% total) **Rationale**: Semantic similarity captures implicit references and synonyms ### Stage 5: LLM Classification **Purpose**: Detailed analysis with reasoning **Model**: OpenAI GPT-4o-mini via Batch API - Cost-effective ($0.05-0.10 for entire corpus) - High accuracy for legal text analysis - Batch API: 50% cost savings, no data retention **Prompt Design**: - Includes complete subpoena criteria - Provides case context - Explicitly instructs to err on side of over-inclusion - Requests structured JSON output with reasoning - Analyzes chunks in conversational context **Temperature**: 0.1 (low for consistency) **Output Format**: ```json { "chunk_responsive": true/false, "responsive_line_numbers": [list], "reasoning": "explanation", "confidence": "high/medium/low", "key_topics": ["topics"] } ``` **Expected Processing Time**: 2-12 hours (typically 4-6) ### Stage 6: Human Verification **Purpose**: Final review and production decisions **Process**: 1. All responsive messages reviewed by case team 2. High confidence messages: Full review 3. Medium confidence messages: Sample review 4. Low confidence messages: Spot check 5. Sample of non-responsive messages reviewed for false negatives 6. Attorney approval before production **Redaction Capability**: - Spreadsheet format allows row-level or cell-level redaction - Non-responsive portions can be marked/deleted - Maintains audit trail of redaction decisions ### Quality Assurance Measures 1. **Reproducibility**: All parameters documented and saved 2. **Audit Trail**: Complete log of filtering decisions 3. **Confidence Scoring**: Enables risk-based review prioritization 4. **Statistical Validation**: Sample testing on subset before full run 5. **Human Oversight**: Multiple review stages 6. **Documentation**: Methodology, prompts, and results preserved ### Recall vs Precision Balance **Approach**: Err on side of OVER-INCLUSION (high recall) **Rationale**: - Legal discovery favors over-production vs under-production - Human review filters false positives - Conservative thresholds at each stage - Explicit LLM instruction to include borderline cases **Expected Performance**: - Recall: 85-95% (captures most responsive messages) - Precision: 60-80% (some false positives acceptable) - Human review corrects false positives ### Limitations and Mitigations **Limitation 1**: Attachments not included in initial analysis - **Mitigation**: Review attachments for responsive messages after text analysis **Limitation 2**: Context limited to 20-message chunks - **Mitigation**: Overlapping chunks, can increase size if needed **Limitation 3**: LLM may miss highly implicit references - **Mitigation**: Conservative filtering, human review, false negative sampling **Limitation 4**: Informal language and abbreviations - **Mitigation**: Text normalization, abbreviation expansion ### Cost and Timeline **Budget**: $100 allocated **Actual Cost**: $0.05-0.10 (OpenAI Batch API) **Timeline**: 24 hours (including wait time) **Labor**: 10-30 hours manual review (vs 200+ hours full manual) ### Defensibility This methodology is defensible because: 1. **Documented**: Complete documentation of all steps 2. **Reproducible**: Saved parameters and prompts 3. **Validated**: Human verification at multiple stages 4. **Conservative**: Errs on side of over-inclusion 5. **Transparent**: Reasoning provided for each classification 6. **Auditable**: Complete trail of decisions 7. **Approved**: Legal counsel reviewed and approved approach ### Conclusion This hybrid methodology balances efficiency with accuracy while maintaining high recall as required for legal discovery. The multi-stage approach with human verification ensures defensible results suitable for production in response to the subpoena. --- Prepared: December 7, 2025 Case: Jennifer Capasso v. Memorial Sloan Kettering Cancer Center