Brak opisu

adri bb1d2edadf parallel universe		4 tygodni temu
_docs	dba66894ae all the things	4 tygodni temu
_scratch	bda9dd5d5d intiial commmit	1 miesiąc temu
_test	bda9dd5d5d intiial commmit	1 miesiąc temu
pipeline	bb1d2edadf parallel universe	4 tygodni temu
.gitignore	86f5789c71 updated text normalization & new replacements	1 miesiąc temu
.python-version	bda9dd5d5d intiial commmit	1 miesiąc temu
README.md	bda9dd5d5d intiial commmit	1 miesiąc temu
install.sh	bda9dd5d5d intiial commmit	1 miesiąc temu
main.py	bda9dd5d5d intiial commmit	1 miesiąc temu
pyproject.toml	dba66894ae all the things	4 tygodni temu
uv.lock	86f5789c71 updated text normalization & new replacements	1 miesiąc temu

Legal Discovery Pipeline - Object-Oriented Design

Complete object-oriented pipeline for legal discovery using Qwen 3 235B + Qwen 2.5 72B.

Directory Structure

pipeline/
├── common_defs.py          # Common definitions and data classes
├── main_pipeline.py        # Main orchestrator
├── models/
│   └── base.py            # Base classes
├── utils/
│   ├── text_utils.py      # Text processing utilities
│   ├── deployment_helper.py  # Deployment helper
│   └── inference_runner.py   # Inference runner
└── steps/
    ├── step1_load_data.py       # Load and preprocess CSV
    ├── step2_create_chunks.py   # Create overlapping chunks
    ├── step3_keyword_filter.py  # Keyword filtering
    ├── step4_semantic_filter.py # Semantic filtering
    ├── step5_random_sampling.py # Random sampling
    ├── step6_labeling_template.py # Generate template
    ├── step7_inference_prep.py  # Prepare inference
    └── step8_merge_results.py   # Merge results

Quick Start

1. Run Preprocessing

python pipeline/main_pipeline.py signal_messages.csv --step preprocess

2. Attorney Labels Samples

Complete the template at: pipeline_output/attorney_labeling_template.txt

3. Deploy Models

from pipeline.utils.deployment_helper import ModelDeployer
deployer = ModelDeployer()
deployer.print_deployment_instructions()

4. Run Inference

python pipeline/utils/inference_runner.py pipeline_output/dual_qwen_inference_requests.jsonl

5. Merge Results

python pipeline/main_pipeline.py signal_messages.csv --step merge \
  --qwen3-results pipeline_output/qwen3_results.jsonl \
  --qwen25-results pipeline_output/qwen25_results.jsonl

Configuration

Edit pipeline/common_defs.py to customize:

Case-specific criteria
Keyword lists
Model configurations
Semantic queries

Cost Estimate

Qwen 3 235B: $2.56/hr × 4-8 hrs = $10.24-20.48
Qwen 2.5 72B: $1.28/hr × 4-8 hrs = $5.12-10.24
Total GPU: $15.36-30.72
Attorney: $500-937
Grand Total: $515-968

README.md