# Qwen 3 + Qwen 2.5 Pipeline - Complete Usage Guide

## Overview

This object-oriented pipeline processes Signal chat messages for legal discovery using:
- **Primary Model**: Qwen 3 235B (state-of-the-art, April 2025)
- **Secondary Model**: Qwen 2.5 72B (proven 24.85% benchmark)
- **Architecture**: Object-oriented with base classes and inheritance
- **Total Cost**: $515-968 (including attorney labeling)

## Installation

```bash
cd pipeline
pip install -r requirements.txt
```

## Step-by-Step Usage

### Step 1: Run Preprocessing

```bash
python main_pipeline.py /path/to/signal_messages.csv --step preprocess
```

This will:
1. Load and normalize 200K messages
2. Create 20-message chunks with 5-message overlap
3. Apply keyword filtering (~60% reduction)
4. Apply dual-model semantic filtering (~97% total reduction)
5. Select 20 random stratified samples
6. Generate attorney labeling template
7. Prepare inference requests

**Output**: `pipeline_output/attorney_labeling_template.txt`

### Step 2: Attorney Completes Labeling

Attorney reviews and labels 15-20 sample messages in the template:
- Mark each as RESPONSIVE: YES or NO
- Provide REASONING for decision
- Note which CRITERIA matched (1-7)

**Time**: 2-2.5 hours
**Cost**: $500-937 @ $250-375/hr

### Step 3: Deploy Models

```python
from pipeline.utils.deployment_helper import ModelDeployer

deployer = ModelDeployer()
deployer.print_deployment_instructions()
```

**On Vast.ai GPU 1 (4 × A100):**
```bash
pip install vllm transformers accelerate

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-235B-Instruct \
    --tensor-parallel-size 4 \
    --quantization awq \
    --port 8000 \
    --max-model-len 4096
```

**On Vast.ai GPU 2 (2 × A100):**
```bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-72B-Instruct \
    --tensor-parallel-size 2 \
    --port 8001 \
    --max-model-len 4096
```

**Cost**: $3.84/hr × 4-8 hours = $15.36-30.72

### Step 4: Run Inference

```bash
python utils/inference_runner.py \
    pipeline_output/dual_qwen_inference_requests.jsonl \
    --qwen3-url http://localhost:8000 \
    --qwen25-url http://localhost:8001
```

This runs inference on both models and saves results:
- `pipeline_output/qwen3_results.jsonl`
- `pipeline_output/qwen25_results.jsonl`

### Step 5: Merge Results

```bash
python main_pipeline.py /path/to/signal_messages.csv --step merge \
    --qwen3-results pipeline_output/qwen3_results.jsonl \
    --qwen25-results pipeline_output/qwen25_results.jsonl
```

This merges results with confidence scoring:
- **High confidence**: Both models agree
- **Medium confidence**: One model flags
- **Low confidence**: Disagreement

**Output**: `pipeline_output/merged_results.json`

## Individual Step Usage

Each step can be run independently:

```python
from pipeline.steps.step1_load_data import DataLoader

loader = DataLoader('signal_messages.csv')
df = loader.execute()
```

## Customization

Edit `pipeline/common_defs.py` to customize:
- Case-specific criteria
- Keyword lists
- Model configurations
- Semantic queries

## Expected Results

For 200K message corpus:
- **Recall**: 88-97% (finds most responsive messages)
- **Precision**: 65-85% (acceptable with attorney review)
- **High confidence**: 60-70% of chunks (minimal review)
- **Medium confidence**: 25-35% of chunks (standard review)
- **Low confidence**: 5-10% of chunks (detailed review)

## Troubleshooting

**Issue**: Model deployment fails
- Check GPU memory (need 4 × 80GB for Qwen 3)
- Verify vLLM installation
- Check quantization settings

**Issue**: Inference times out
- Increase timeout in inference_runner.py
- Check model health endpoints
- Verify network connectivity

**Issue**: Low agreement between models
- Review few-shot examples
- Adjust semantic thresholds
- Check prompt formatting