Nincs leírás

adri bb1d2edadf parallel universe 4 hete
_docs dba66894ae all the things 4 hete
_scratch bda9dd5d5d intiial commmit 1 hónapja
_test bda9dd5d5d intiial commmit 1 hónapja
pipeline bb1d2edadf parallel universe 4 hete
.gitignore 86f5789c71 updated text normalization & new replacements 1 hónapja
.python-version bda9dd5d5d intiial commmit 1 hónapja
README.md bda9dd5d5d intiial commmit 1 hónapja
install.sh bda9dd5d5d intiial commmit 1 hónapja
main.py bda9dd5d5d intiial commmit 1 hónapja
pyproject.toml dba66894ae all the things 4 hete
uv.lock 86f5789c71 updated text normalization & new replacements 1 hónapja

README.md

Legal Discovery Pipeline - Object-Oriented Design

Complete object-oriented pipeline for legal discovery using Qwen 3 235B + Qwen 2.5 72B.

Directory Structure

pipeline/
├── common_defs.py          # Common definitions and data classes
├── main_pipeline.py        # Main orchestrator
├── models/
│   └── base.py            # Base classes
├── utils/
│   ├── text_utils.py      # Text processing utilities
│   ├── deployment_helper.py  # Deployment helper
│   └── inference_runner.py   # Inference runner
└── steps/
    ├── step1_load_data.py       # Load and preprocess CSV
    ├── step2_create_chunks.py   # Create overlapping chunks
    ├── step3_keyword_filter.py  # Keyword filtering
    ├── step4_semantic_filter.py # Semantic filtering
    ├── step5_random_sampling.py # Random sampling
    ├── step6_labeling_template.py # Generate template
    ├── step7_inference_prep.py  # Prepare inference
    └── step8_merge_results.py   # Merge results

Quick Start

1. Run Preprocessing

python pipeline/main_pipeline.py signal_messages.csv --step preprocess

2. Attorney Labels Samples

Complete the template at: pipeline_output/attorney_labeling_template.txt

3. Deploy Models

from pipeline.utils.deployment_helper import ModelDeployer
deployer = ModelDeployer()
deployer.print_deployment_instructions()

4. Run Inference

python pipeline/utils/inference_runner.py pipeline_output/dual_qwen_inference_requests.jsonl

5. Merge Results

python pipeline/main_pipeline.py signal_messages.csv --step merge \
  --qwen3-results pipeline_output/qwen3_results.jsonl \
  --qwen25-results pipeline_output/qwen25_results.jsonl

Configuration

Edit pipeline/common_defs.py to customize:

  • Case-specific criteria
  • Keyword lists
  • Model configurations
  • Semantic queries

Cost Estimate

  • Qwen 3 235B: $2.56/hr × 4-8 hrs = $10.24-20.48
  • Qwen 2.5 72B: $1.28/hr × 4-8 hrs = $5.12-10.24
  • Total GPU: $15.36-30.72
  • Attorney: $500-937
  • Grand Total: $515-968