ADVANCED_FEATURES.md 1.5 KB

Advanced Features

Keyword Identification (Step 0a)

Automatically identify relevant keywords from your data:

from pipeline.steps.step0a_keyword_identification import KeywordIdentifier

identifier = KeywordIdentifier(min_frequency=5, max_keywords=100)
categories = identifier.execute(df)

Output: pipeline_output/keyword_analysis.json and keyword_analysis.txt

Categories:

  • Names
  • Medical terms
  • Locations
  • Actions
  • Emotions
  • Dates
  • Other

Normalization Analysis (Step 0b)

Analyze text patterns and get suggestions for normalizations:

from pipeline.steps.step0b_normalization_analysis import NormalizationAnalyzer

analyzer = NormalizationAnalyzer()
suggestions = analyzer.execute(df)

Output: pipeline_output/normalization_suggestions.json and normalization_suggestions.txt

Identifies:

  • Abbreviations (dr., appt, etc.)
  • Acronyms (MSK, ER, ICU, etc.)
  • Common misspellings
  • Date/time patterns

Parallel Inference Processing

Process inference requests 3-4x faster with parallel workers:

from pipeline.utils.parallel_inference_runner import ParallelInferenceRunner

runner = ParallelInferenceRunner(max_workers=4)
runner.run_inference('pipeline_output/dual_qwen_inference_requests.jsonl')

Benefits:

  • 3-4x faster than sequential processing
  • Automatic error handling and retries
  • Progress tracking with tqdm
  • Configurable worker count

Performance:

  • Sequential: ~2-3 requests/second
  • Parallel (4 workers): ~8-12 requests/second
  • For 300 chunks: ~25 minutes vs ~100 minutes