# Legal Discovery Pipeline - Object-Oriented Design Complete object-oriented pipeline for legal discovery using Qwen 3 235B + Qwen 2.5 72B. ## Directory Structure ``` pipeline/ ├── common_defs.py # Common definitions and data classes ├── main_pipeline.py # Main orchestrator ├── models/ │ └── base.py # Base classes ├── utils/ │ ├── text_utils.py # Text processing utilities │ ├── deployment_helper.py # Deployment helper │ └── inference_runner.py # Inference runner └── steps/ ├── step1_load_data.py # Load and preprocess CSV ├── step2_create_chunks.py # Create overlapping chunks ├── step3_keyword_filter.py # Keyword filtering ├── step4_semantic_filter.py # Semantic filtering ├── step5_random_sampling.py # Random sampling ├── step6_labeling_template.py # Generate template ├── step7_inference_prep.py # Prepare inference └── step8_merge_results.py # Merge results ``` ## Quick Start ### 1. Run Preprocessing ```bash python pipeline/main_pipeline.py signal_messages.csv --step preprocess ``` ### 2. Attorney Labels Samples Complete the template at: `pipeline_output/attorney_labeling_template.txt` ### 3. Deploy Models ```python from pipeline.utils.deployment_helper import ModelDeployer deployer = ModelDeployer() deployer.print_deployment_instructions() ``` ### 4. Run Inference ```bash python pipeline/utils/inference_runner.py pipeline_output/dual_qwen_inference_requests.jsonl ``` ### 5. Merge Results ```bash python pipeline/main_pipeline.py signal_messages.csv --step merge \ --qwen3-results pipeline_output/qwen3_results.jsonl \ --qwen25-results pipeline_output/qwen25_results.jsonl ``` ## Configuration Edit `pipeline/common_defs.py` to customize: - Case-specific criteria - Keyword lists - Model configurations - Semantic queries ## Cost Estimate - Qwen 3 235B: $2.56/hr × 4-8 hrs = $10.24-20.48 - Qwen 2.5 72B: $1.28/hr × 4-8 hrs = $5.12-10.24 - Total GPU: $15.36-30.72 - Attorney: $500-937 - **Grand Total: $515-968**