## Advanced Features ### Keyword Identification (Step 0a) Automatically identify relevant keywords from your data: ```python from pipeline.steps.step0a_keyword_identification import KeywordIdentifier identifier = KeywordIdentifier(min_frequency=5, max_keywords=100) categories = identifier.execute(df) ``` **Output**: `pipeline_output/keyword_analysis.json` and `keyword_analysis.txt` **Categories**: - Names - Medical terms - Locations - Actions - Emotions - Dates - Other ### Normalization Analysis (Step 0b) Analyze text patterns and get suggestions for normalizations: ```python from pipeline.steps.step0b_normalization_analysis import NormalizationAnalyzer analyzer = NormalizationAnalyzer() suggestions = analyzer.execute(df) ``` **Output**: `pipeline_output/normalization_suggestions.json` and `normalization_suggestions.txt` **Identifies**: - Abbreviations (dr., appt, etc.) - Acronyms (MSK, ER, ICU, etc.) - Common misspellings - Date/time patterns ### Parallel Inference Processing Process inference requests 3-4x faster with parallel workers: ```python from pipeline.utils.parallel_inference_runner import ParallelInferenceRunner runner = ParallelInferenceRunner(max_workers=4) runner.run_inference('pipeline_output/dual_qwen_inference_requests.jsonl') ``` **Benefits**: - 3-4x faster than sequential processing - Automatic error handling and retries - Progress tracking with tqdm - Configurable worker count **Performance**: - Sequential: ~2-3 requests/second - Parallel (4 workers): ~8-12 requests/second - For 300 chunks: ~25 minutes vs ~100 minutes