# 📊 MedAI Data Processing Techniques This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning. ## 🎯 Project Overview The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity. ## 🏗️ System Architecture ### Core Components - **FastAPI Web Service**: RESTful API for dataset processing - **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration - **Centralized Writer**: Parallel JSONL + CSV output generation - **Google Drive Integration**: Automated artifact storage - **Progress Monitoring**: Real-time job status tracking ### Supported Datasets 1. **HealthCareMagic** (100k medical dialogues) 2. **iCliniq** (10k medical consultations) 3. **PubMedQA-Labelled** (biomedical Q&A with answers) 4. **PubMedQA-Unlabelled** (biomedical Q&A without answers) 5. **PubMedQA-Map** (biomedical Q&A mapping format) ## 🔧 Data Processing Pipeline ### 1. Data Ingestion & Download - **Hugging Face Hub Integration**: Automatic dataset downloading - **Format Detection**: JSON/JSONL auto-detection and parsing - **Caching System**: Local storage with symlink optimization ### 2. Data Cleaning & Preprocessing #### Text Normalization - **Unicode Fixing**: `ftfy` library for text encoding issues - **Whitespace Standardization**: Consistent spacing and line breaks - **Quote Canonicalization**: Standard quote character conversion - **Terminal Punctuation**: Ensures proper sentence endings #### Content Sanitization - **Length Capping**: Configurable maximum character limits (default: 5000) - **Language Detection**: English language validation using `langid` - **Content Truncation**: Smart sentence boundary cutting for long texts ### 3. Data Augmentation Techniques #### LLM-Based Paraphrasing - **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback) - **Difficulty Levels**: Easy vs. Hard paraphrasing modes - **Medical Context Preservation**: Maintains clinical terminology accuracy - **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0) #### Back-Translation Augmentation - **Multi-Language Support**: German as intermediate language - **Meaning Preservation**: Maintains semantic accuracy through translation cycles - **Fallback Mechanisms**: Automatic retry with alternative models - **Quality Control**: Length and content validation #### Style Standardization - **Clinical Voice Enforcement**: Neutral, professional medical tone - **Absolute Language Removal**: Replaces guarantees with probabilistic language - **Forum Sign-off Removal**: Eliminates informal communication patterns - **Consistent Punctuation**: Standardized sentence structure ### 4. Data Quality Assurance #### De-identification (PHI Removal) - **Email Redaction**: `[REDACTED_EMAIL]` placeholder - **Phone Number Masking**: `[REDACTED_PHONE]` placeholder - **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders - **Configurable Privacy**: Optional PHI removal per dataset #### Deduplication - **Fingerprinting Algorithm**: MD5-based content hashing - **Multi-Field Matching**: Instruction + Input + Output combination - **Normalized Comparison**: Case-insensitive, whitespace-normalized matching - **Performance Optimized**: In-memory set-based deduplication #### Consistency Validation - **LLM-Based QA Check**: Automated answer validation against context - **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01) - **Medical Safety Validation**: Ensures clinical accuracy and safety - **Failure Tagging**: Marks samples with consistency issues ### 5. Advanced Augmentation Features #### Knowledge Distillation - **Pseudo-Label Generation**: Creates labels for unlabeled data - **Fractional Processing**: Configurable percentage for distillation - **Single-Prompt Approach**: Efficient single LLM call per sample - **Length Control**: Maintains reasonable output lengths #### Multi-Variant Generation - **Configurable Counts**: 1-3 augmented variants per sample - **Tagged Augmentations**: Tracks applied augmentation techniques - **Original Preservation**: Always maintains base sample - **Randomized IDs**: Unique identifiers for augmented variants ### 6. Output Generation & Storage #### Centralized Format - **SFT Schema**: Standardized Supervised Fine-Tuning format - **Metadata Preservation**: Source, task type, and augmentation tags - **Dual Output**: Simultaneous JSONL and CSV generation - **Memory-Safe Streaming**: Handles large datasets efficiently #### Storage Integration - **Local Caching**: `cache/outputs/` directory storage - **Google Drive Upload**: Automated cloud storage integration - **Timestamped Naming**: Unique file identification - **MIME Type Handling**: Proper content type specification ## ⚙️ Configuration Options ### Augmentation Parameters ```python class AugmentOptions: paraphrase_ratio: float = 0.0 # 0.0-1.0 paraphrase_outputs: bool = False # Augment model answers backtranslate_ratio: float = 0.0 # 0.0-1.0 style_standardize: bool = True # Enforce clinical style deidentify: bool = True # Remove PHI dedupe: bool = True # Remove duplicates max_chars: int = 5000 # Text length limit consistency_check_ratio: float = 0.0 # 0.0-1.0 distill_fraction: float = 0.0 # 0.0-1.0 for unlabeled expand: bool = True # Enable augmentation max_aug_per_sample: int = 2 # 1-3 variants ``` ### Processing Parameters ```python class ProcessParams: augment: AugmentOptions # Augmentation settings sample_limit: Optional[int] = None # Dataset sampling seed: int = 42 # Reproducibility ``` ## 📈 Performance & Monitoring ### Progress Tracking - **Real-time Updates**: Live progress percentage and status messages - **Background Processing**: Non-blocking job execution - **State Management**: Thread-safe status tracking - **Error Handling**: Comprehensive exception logging ### Resource Management - **API Key Rotation**: Automatic fallback between multiple API keys - **Rate Limiting**: Configurable request throttling - **Memory Optimization**: Streaming processing for large datasets - **Concurrent Processing**: Background task execution ## 🔒 Security & Privacy ### Data Protection - **PHI Removal**: Automatic sensitive information redaction - **Secure Storage**: Google Drive integration with OAuth2 - **Access Control**: Environment-based API key management - **Audit Logging**: Comprehensive processing logs ### API Security - **OAuth2 Integration**: Google Drive authentication - **Token Management**: Secure credential handling - **Request Validation**: Pydantic model validation - **Error Sanitization**: Safe error message handling ## 🚀 Usage Examples ### Basic Processing ```bash # Process HealthCareMagic with default settings curl -X POST \ -H "Content-Type: application/json" \ -d '{"augment": {"paraphrase_ratio": 0.1}}' \ https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic ``` ### Advanced Augmentation ```bash # Process with comprehensive augmentation curl -X POST \ -H "Content-Type: application/json" \ -d '{ "augment": { "paraphrase_ratio": 0.2, "backtranslate_ratio": 0.1, "paraphrase_outputs": true, "style_standardize": true, "deidentify": true, "dedupe": true, "max_chars": 5000, "consistency_check_ratio": 0.01, "max_aug_per_sample": 3 }, "sample_limit": 1000, "seed": 42 }' \ https://binkhoale1812-medai-processing.hf.space/process/icliniq ``` ## 📊 Output Statistics ### Processing Metrics - **Written Rows**: Total processed samples - **Paraphrased Inputs**: Count of augmented user inputs - **Paraphrased Outputs**: Count of augmented model responses - **Back-translated**: Count of translation-augmented samples - **Deduplication**: Count of skipped duplicate samples - **Consistency Failures**: Count of validation failures ### File Outputs - **JSONL Format**: Structured fine-tuning data with metadata - **CSV Format**: Simplified tabular representation - **Google Drive**: Cloud storage with automatic upload - **Local Cache**: Persistent local storage ## 🔮 Future Enhancements ### Planned Features - **Additional Dataset Support**: More medical dataset types - **Advanced Augmentation**: More sophisticated LLM techniques - **Quality Metrics**: Automated data quality scoring - **Batch Processing**: Multiple dataset concurrent processing - **Custom Schemas**: User-defined output formats ### Scalability Improvements - **Distributed Processing**: Multi-node processing support - **Streaming Augmentation**: Real-time data enhancement - **Caching Optimization**: Improved performance and cost efficiency - **API Rate Limiting**: Better resource management ## 📚 Technical Dependencies ### Core Libraries - **FastAPI**: Web framework for API development - **Hugging Face Hub**: Dataset downloading and management - **Google GenAI**: Gemini model integration - **ftfy**: Text encoding and normalization - **langid**: Language detection - **orjson**: High-performance JSON processing ### External Services - **NVIDIA API**: Primary LLM service for paraphrasing - **Google Gemini**: Fallback LLM service - **Google Drive**: Cloud storage integration - **Hugging Face Spaces**: Deployment platform --- *This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.*