Signal Chat Discovery - Step-by-Step Guide
Jennifer Capasso v. Memorial Sloan Kettering Cancer Center
Prerequisites
- Signal chat exported to CSV with columns: message, timestamp, sender
- Python 3.8+ installed
- OpenAI account with $100 credit
- ~24 hours timeline
Step 1: Setup (15 minutes)
chmod +x install.sh
./install.sh
Step 2: Run Local Filtering (2-3 hours)
python signal_chat_discovery_complete.py
This will:
- Load your CSV
- Create overlapping chunks (20 messages, 5 overlap)
- Apply keyword filter (~50% reduction)
- Apply semantic filter (~80-90% total reduction)
- Generate batch_requests.jsonl
Expected output: ~300-500 chunks for LLM processing
Step 3: Submit to OpenAI Batch API (5 minutes)
Option A - Via Web Interface:
- Go to platform.openai.com/batches
- Click "Create batch"
- Upload batch_requests.jsonl
- Wait for completion (2-12 hours, typically 4-6)
- Download batch_results.jsonl
Option B - Via API:
from openai import OpenAI
client = OpenAI()
batch_input_file = client.files.create(
file=open("discovery_results/batch_requests.jsonl", "rb"),
purpose="batch"
)
batch = client.batches.create(
input_file_id=batch_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
# Check status: client.batches.retrieve(batch.id)
Step 4: Process Results (1 hour)
from signal_chat_discovery_complete import SignalChatDiscovery
discovery = SignalChatDiscovery('signal_messages.csv')
df = discovery.load_and_preprocess()
results_df = discovery.process_batch_results('batch_results.jsonl', df)
Output: discovery_results.xlsx
Step 5: Manual Review
- Open discovery_results.xlsx
- Filter by responsive='YES'
- Review high confidence messages first
- Sample medium/low confidence
- Add 'redacted' column for non-responsive portions
- Export final production set
Cost Breakdown
- Keyword filtering: $0 (local)
- Semantic filtering: $0 (local)
- OpenAI Batch API: $0.05-$0.10
- Total: < $1 (well under $100 budget)
Timeline
- Setup: 15 min
- Local filtering: 2-3 hours
- Batch submission: 5 min
- OpenAI processing: 2-12 hours (wait time)
- Results processing: 1 hour
- Manual review: 10-30 hours
- Total: ~24 hours
Troubleshooting
- If CSV columns don't match: Check column names in your CSV
- If filtering too aggressive: Lower semantic threshold to 0.20
- If filtering too lenient: Raise semantic threshold to 0.30
- If over budget: Use gpt-3.5-turbo instead of gpt-4o-mini
Quality Assurance
- Spot-check keyword matches
- Verify semantic scores make sense
- Review sample of LLM classifications
- Test on small subset first (1000 messages)