# Signal Chat Discovery - Step-by-Step Guide ## Jennifer Capasso v. Memorial Sloan Kettering Cancer Center ### Prerequisites - Signal chat exported to CSV with columns: message, timestamp, sender - Python 3.8+ installed - OpenAI account with $100 credit - ~24 hours timeline ### Step 1: Setup (15 minutes) ```bash chmod +x install.sh ./install.sh ``` ### Step 2: Run Local Filtering (2-3 hours) ```bash python signal_chat_discovery_complete.py ``` This will: - Load your CSV - Create overlapping chunks (20 messages, 5 overlap) - Apply keyword filter (~50% reduction) - Apply semantic filter (~80-90% total reduction) - Generate batch_requests.jsonl Expected output: ~300-500 chunks for LLM processing ### Step 3: Submit to OpenAI Batch API (5 minutes) Option A - Via Web Interface: 1. Go to platform.openai.com/batches 2. Click "Create batch" 3. Upload batch_requests.jsonl 4. Wait for completion (2-12 hours, typically 4-6) 5. Download batch_results.jsonl Option B - Via API: ```python from openai import OpenAI client = OpenAI() batch_input_file = client.files.create( file=open("discovery_results/batch_requests.jsonl", "rb"), purpose="batch" ) batch = client.batches.create( input_file_id=batch_input_file.id, endpoint="/v1/chat/completions", completion_window="24h" ) print(f"Batch ID: {batch.id}") # Check status: client.batches.retrieve(batch.id) ``` ### Step 4: Process Results (1 hour) ```python from signal_chat_discovery_complete import SignalChatDiscovery discovery = SignalChatDiscovery('signal_messages.csv') df = discovery.load_and_preprocess() results_df = discovery.process_batch_results('batch_results.jsonl', df) ``` Output: discovery_results.xlsx ### Step 5: Manual Review 1. Open discovery_results.xlsx 2. Filter by responsive='YES' 3. Review high confidence messages first 4. Sample medium/low confidence 5. Add 'redacted' column for non-responsive portions 6. Export final production set ### Cost Breakdown - Keyword filtering: $0 (local) - Semantic filtering: $0 (local) - OpenAI Batch API: $0.05-$0.10 - Total: < $1 (well under $100 budget) ### Timeline - Setup: 15 min - Local filtering: 2-3 hours - Batch submission: 5 min - OpenAI processing: 2-12 hours (wait time) - Results processing: 1 hour - Manual review: 10-30 hours - Total: ~24 hours ### Troubleshooting - If CSV columns don't match: Check column names in your CSV - If filtering too aggressive: Lower semantic threshold to 0.20 - If filtering too lenient: Raise semantic threshold to 0.30 - If over budget: Use gpt-3.5-turbo instead of gpt-4o-mini ### Quality Assurance - Spot-check keyword matches - Verify semantic scores make sense - Review sample of LLM classifications - Test on small subset first (1000 messages)