# Signal Chat Discovery - Step-by-Step Guide
## Jennifer Capasso v. Memorial Sloan Kettering Cancer Center

### Prerequisites
- Signal chat exported to CSV with columns: message, timestamp, sender
- Python 3.8+ installed
- OpenAI account with $100 credit
- ~24 hours timeline

### Step 1: Setup (15 minutes)
```bash
chmod +x install.sh
./install.sh
```

### Step 2: Run Local Filtering (2-3 hours)
```bash
python signal_chat_discovery_complete.py
```

This will:
- Load your CSV
- Create overlapping chunks (20 messages, 5 overlap)
- Apply keyword filter (~50% reduction)
- Apply semantic filter (~80-90% total reduction)
- Generate batch_requests.jsonl

Expected output: ~300-500 chunks for LLM processing

### Step 3: Submit to OpenAI Batch API (5 minutes)

Option A - Via Web Interface:
1. Go to platform.openai.com/batches
2. Click "Create batch"
3. Upload batch_requests.jsonl
4. Wait for completion (2-12 hours, typically 4-6)
5. Download batch_results.jsonl

Option B - Via API:
```python
from openai import OpenAI
client = OpenAI()

batch_input_file = client.files.create(
    file=open("discovery_results/batch_requests.jsonl", "rb"),
    purpose="batch"
)

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
# Check status: client.batches.retrieve(batch.id)
```

### Step 4: Process Results (1 hour)
```python
from signal_chat_discovery_complete import SignalChatDiscovery

discovery = SignalChatDiscovery('signal_messages.csv')
df = discovery.load_and_preprocess()
results_df = discovery.process_batch_results('batch_results.jsonl', df)
```

Output: discovery_results.xlsx

### Step 5: Manual Review
1. Open discovery_results.xlsx
2. Filter by responsive='YES'
3. Review high confidence messages first
4. Sample medium/low confidence
5. Add 'redacted' column for non-responsive portions
6. Export final production set

### Cost Breakdown
- Keyword filtering: $0 (local)
- Semantic filtering: $0 (local)
- OpenAI Batch API: $0.05-$0.10
- Total: < $1 (well under $100 budget)

### Timeline
- Setup: 15 min
- Local filtering: 2-3 hours
- Batch submission: 5 min
- OpenAI processing: 2-12 hours (wait time)
- Results processing: 1 hour
- Manual review: 10-30 hours
- Total: ~24 hours

### Troubleshooting
- If CSV columns don't match: Check column names in your CSV
- If filtering too aggressive: Lower semantic threshold to 0.20
- If filtering too lenient: Raise semantic threshold to 0.30
- If over budget: Use gpt-3.5-turbo instead of gpt-4o-mini

### Quality Assurance
- Spot-check keyword matches
- Verify semantic scores make sense
- Review sample of LLM classifications
- Test on small subset first (1000 messages)