# Legal Discovery Pipeline - Object-Oriented Design

Complete object-oriented pipeline for legal discovery using Qwen 3 235B + Qwen 2.5 72B.

## Directory Structure

```
pipeline/
├── common_defs.py          # Common definitions and data classes
├── main_pipeline.py        # Main orchestrator
├── models/
│   └── base.py            # Base classes
├── utils/
│   ├── text_utils.py      # Text processing utilities
│   ├── deployment_helper.py  # Deployment helper
│   └── inference_runner.py   # Inference runner
└── steps/
    ├── step1_load_data.py       # Load and preprocess CSV
    ├── step2_create_chunks.py   # Create overlapping chunks
    ├── step3_keyword_filter.py  # Keyword filtering
    ├── step4_semantic_filter.py # Semantic filtering
    ├── step5_random_sampling.py # Random sampling
    ├── step6_labeling_template.py # Generate template
    ├── step7_inference_prep.py  # Prepare inference
    └── step8_merge_results.py   # Merge results
```

## Quick Start

### 1. Run Preprocessing

```bash
python pipeline/main_pipeline.py signal_messages.csv --step preprocess
```

### 2. Attorney Labels Samples

Complete the template at: `pipeline_output/attorney_labeling_template.txt`

### 3. Deploy Models

```python
from pipeline.utils.deployment_helper import ModelDeployer
deployer = ModelDeployer()
deployer.print_deployment_instructions()
```

### 4. Run Inference

```bash
python pipeline/utils/inference_runner.py pipeline_output/dual_qwen_inference_requests.jsonl
```

### 5. Merge Results

```bash
python pipeline/main_pipeline.py signal_messages.csv --step merge \
  --qwen3-results pipeline_output/qwen3_results.jsonl \
  --qwen25-results pipeline_output/qwen25_results.jsonl
```

## Configuration

Edit `pipeline/common_defs.py` to customize:
- Case-specific criteria
- Keyword lists
- Model configurations
- Semantic queries

## Cost Estimate

- Qwen 3 235B: $2.56/hr × 4-8 hrs = $10.24-20.48
- Qwen 2.5 72B: $1.28/hr × 4-8 hrs = $5.12-10.24
- Total GPU: $15.36-30.72
- Attorney: $500-937
- **Grand Total: $515-968**