STEP_BY_STEP_GUIDE.md 2.7 KB

Signal Chat Discovery - Step-by-Step Guide

Jennifer Capasso v. Memorial Sloan Kettering Cancer Center

Prerequisites

  • Signal chat exported to CSV with columns: message, timestamp, sender
  • Python 3.8+ installed
  • OpenAI account with $100 credit
  • ~24 hours timeline

Step 1: Setup (15 minutes)

chmod +x install.sh
./install.sh

Step 2: Run Local Filtering (2-3 hours)

python signal_chat_discovery_complete.py

This will:

  • Load your CSV
  • Create overlapping chunks (20 messages, 5 overlap)
  • Apply keyword filter (~50% reduction)
  • Apply semantic filter (~80-90% total reduction)
  • Generate batch_requests.jsonl

Expected output: ~300-500 chunks for LLM processing

Step 3: Submit to OpenAI Batch API (5 minutes)

Option A - Via Web Interface:

  1. Go to platform.openai.com/batches
  2. Click "Create batch"
  3. Upload batch_requests.jsonl
  4. Wait for completion (2-12 hours, typically 4-6)
  5. Download batch_results.jsonl

Option B - Via API:

from openai import OpenAI
client = OpenAI()

batch_input_file = client.files.create(
    file=open("discovery_results/batch_requests.jsonl", "rb"),
    purpose="batch"
)

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
# Check status: client.batches.retrieve(batch.id)

Step 4: Process Results (1 hour)

from signal_chat_discovery_complete import SignalChatDiscovery

discovery = SignalChatDiscovery('signal_messages.csv')
df = discovery.load_and_preprocess()
results_df = discovery.process_batch_results('batch_results.jsonl', df)

Output: discovery_results.xlsx

Step 5: Manual Review

  1. Open discovery_results.xlsx
  2. Filter by responsive='YES'
  3. Review high confidence messages first
  4. Sample medium/low confidence
  5. Add 'redacted' column for non-responsive portions
  6. Export final production set

Cost Breakdown

  • Keyword filtering: $0 (local)
  • Semantic filtering: $0 (local)
  • OpenAI Batch API: $0.05-$0.10
  • Total: < $1 (well under $100 budget)

Timeline

  • Setup: 15 min
  • Local filtering: 2-3 hours
  • Batch submission: 5 min
  • OpenAI processing: 2-12 hours (wait time)
  • Results processing: 1 hour
  • Manual review: 10-30 hours
  • Total: ~24 hours

Troubleshooting

  • If CSV columns don't match: Check column names in your CSV
  • If filtering too aggressive: Lower semantic threshold to 0.20
  • If filtering too lenient: Raise semantic threshold to 0.30
  • If over budget: Use gpt-3.5-turbo instead of gpt-4o-mini

Quality Assurance

  • Spot-check keyword matches
  • Verify semantic scores make sense
  • Review sample of LLM classifications
  • Test on small subset first (1000 messages)