Signal Chat Discovery - Step-by-Step Guide

Jennifer Capasso v. Memorial Sloan Kettering Cancer Center

Prerequisites

Signal chat exported to CSV with columns: message, timestamp, sender
Python 3.8+ installed
OpenAI account with $100 credit
~24 hours timeline

Step 1: Setup (15 minutes)

chmod +x install.sh
./install.sh

Step 2: Run Local Filtering (2-3 hours)

python signal_chat_discovery_complete.py

This will:

Load your CSV
Create overlapping chunks (20 messages, 5 overlap)
Apply keyword filter (~50% reduction)
Apply semantic filter (~80-90% total reduction)
Generate batch_requests.jsonl

Expected output: ~300-500 chunks for LLM processing

Step 3: Submit to OpenAI Batch API (5 minutes)

Option A - Via Web Interface:

Go to platform.openai.com/batches
Click "Create batch"
Upload batch_requests.jsonl
Wait for completion (2-12 hours, typically 4-6)
Download batch_results.jsonl

Option B - Via API:

from openai import OpenAI
client = OpenAI()

batch_input_file = client.files.create(
    file=open("discovery_results/batch_requests.jsonl", "rb"),
    purpose="batch"
)

batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
# Check status: client.batches.retrieve(batch.id)

Step 4: Process Results (1 hour)

from signal_chat_discovery_complete import SignalChatDiscovery

discovery = SignalChatDiscovery('signal_messages.csv')
df = discovery.load_and_preprocess()
results_df = discovery.process_batch_results('batch_results.jsonl', df)

Output: discovery_results.xlsx

Step 5: Manual Review

Open discovery_results.xlsx
Filter by responsive='YES'
Review high confidence messages first
Sample medium/low confidence
Add 'redacted' column for non-responsive portions
Export final production set

Cost Breakdown

Keyword filtering: $0 (local)
Semantic filtering: $0 (local)
OpenAI Batch API: $0.05-$0.10
Total: < $1 (well under $100 budget)

Timeline

Setup: 15 min
Local filtering: 2-3 hours
Batch submission: 5 min
OpenAI processing: 2-12 hours (wait time)
Results processing: 1 hour
Manual review: 10-30 hours
Total: ~24 hours

Troubleshooting

If CSV columns don't match: Check column names in your CSV
If filtering too aggressive: Lower semantic threshold to 0.20
If filtering too lenient: Raise semantic threshold to 0.30
If over budget: Use gpt-3.5-turbo instead of gpt-4o-mini

Quality Assurance

Spot-check keyword matches
Verify semantic scores make sense
Review sample of LLM classifications
Test on small subset first (1000 messages)

STEP_BY_STEP_GUIDE.md 2.7 KB História Raw

Signal Chat Discovery - Step-by-Step Guide

Jennifer Capasso v. Memorial Sloan Kettering Cancer Center

Prerequisites

Step 1: Setup (15 minutes)

Step 2: Run Local Filtering (2-3 hours)

Step 3: Submit to OpenAI Batch API (5 minutes)

Step 4: Process Results (1 hour)

Step 5: Manual Review

Cost Breakdown

Timeline

Troubleshooting

Quality Assurance

STEP_BY_STEP_GUIDE.md 2.7 KB

História Raw