Spaces:
Sleeping
Sleeping
| # 📊 MedAI Data Processing Techniques | |
| This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning. | |
| ## 🎯 Project Overview | |
| The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity. | |
| ## 🏗️ System Architecture | |
| ### Core Components | |
| - **FastAPI Web Service**: RESTful API for dataset processing | |
| - **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration | |
| - **Centralized Writer**: Parallel JSONL + CSV output generation | |
| - **Google Drive Integration**: Automated artifact storage | |
| - **Progress Monitoring**: Real-time job status tracking | |
| ### Supported Datasets | |
| 1. **HealthCareMagic** (100k medical dialogues) | |
| 2. **iCliniq** (10k medical consultations) | |
| 3. **PubMedQA-Labelled** (biomedical Q&A with answers) | |
| 4. **PubMedQA-Unlabelled** (biomedical Q&A without answers) | |
| 5. **PubMedQA-Map** (biomedical Q&A mapping format) | |
| ## 🔧 Data Processing Pipeline | |
| ### 1. Data Ingestion & Download | |
| - **Hugging Face Hub Integration**: Automatic dataset downloading | |
| - **Format Detection**: JSON/JSONL auto-detection and parsing | |
| - **Caching System**: Local storage with symlink optimization | |
| ### 2. Data Cleaning & Preprocessing | |
| #### Text Normalization | |
| - **Unicode Fixing**: `ftfy` library for text encoding issues | |
| - **Whitespace Standardization**: Consistent spacing and line breaks | |
| - **Quote Canonicalization**: Standard quote character conversion | |
| - **Terminal Punctuation**: Ensures proper sentence endings | |
| #### Content Sanitization | |
| - **Length Capping**: Configurable maximum character limits (default: 5000) | |
| - **Language Detection**: English language validation using `langid` | |
| - **Content Truncation**: Smart sentence boundary cutting for long texts | |
| ### 3. Data Augmentation Techniques | |
| #### LLM-Based Paraphrasing | |
| - **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback) | |
| - **Difficulty Levels**: Easy vs. Hard paraphrasing modes | |
| - **Medical Context Preservation**: Maintains clinical terminology accuracy | |
| - **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0) | |
| #### Back-Translation Augmentation | |
| - **Multi-Language Support**: German as intermediate language | |
| - **Meaning Preservation**: Maintains semantic accuracy through translation cycles | |
| - **Fallback Mechanisms**: Automatic retry with alternative models | |
| - **Quality Control**: Length and content validation | |
| #### Style Standardization | |
| - **Clinical Voice Enforcement**: Neutral, professional medical tone | |
| - **Absolute Language Removal**: Replaces guarantees with probabilistic language | |
| - **Forum Sign-off Removal**: Eliminates informal communication patterns | |
| - **Consistent Punctuation**: Standardized sentence structure | |
| ### 4. Data Quality Assurance | |
| #### De-identification (PHI Removal) | |
| - **Email Redaction**: `[REDACTED_EMAIL]` placeholder | |
| - **Phone Number Masking**: `[REDACTED_PHONE]` placeholder | |
| - **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders | |
| - **Configurable Privacy**: Optional PHI removal per dataset | |
| #### Deduplication | |
| - **Fingerprinting Algorithm**: MD5-based content hashing | |
| - **Multi-Field Matching**: Instruction + Input + Output combination | |
| - **Normalized Comparison**: Case-insensitive, whitespace-normalized matching | |
| - **Performance Optimized**: In-memory set-based deduplication | |
| #### Consistency Validation | |
| - **LLM-Based QA Check**: Automated answer validation against context | |
| - **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01) | |
| - **Medical Safety Validation**: Ensures clinical accuracy and safety | |
| - **Failure Tagging**: Marks samples with consistency issues | |
| ### 5. Advanced Augmentation Features | |
| #### Knowledge Distillation | |
| - **Pseudo-Label Generation**: Creates labels for unlabeled data | |
| - **Fractional Processing**: Configurable percentage for distillation | |
| - **Single-Prompt Approach**: Efficient single LLM call per sample | |
| - **Length Control**: Maintains reasonable output lengths | |
| #### Multi-Variant Generation | |
| - **Configurable Counts**: 1-3 augmented variants per sample | |
| - **Tagged Augmentations**: Tracks applied augmentation techniques | |
| - **Original Preservation**: Always maintains base sample | |
| - **Randomized IDs**: Unique identifiers for augmented variants | |
| ### 6. Output Generation & Storage | |
| #### Centralized Format | |
| - **SFT Schema**: Standardized Supervised Fine-Tuning format | |
| - **Metadata Preservation**: Source, task type, and augmentation tags | |
| - **Dual Output**: Simultaneous JSONL and CSV generation | |
| - **Memory-Safe Streaming**: Handles large datasets efficiently | |
| #### Storage Integration | |
| - **Local Caching**: `cache/outputs/` directory storage | |
| - **Google Drive Upload**: Automated cloud storage integration | |
| - **Timestamped Naming**: Unique file identification | |
| - **MIME Type Handling**: Proper content type specification | |
| ## ⚙️ Configuration Options | |
| ### Augmentation Parameters | |
| ```python | |
| class AugmentOptions: | |
| paraphrase_ratio: float = 0.0 # 0.0-1.0 | |
| paraphrase_outputs: bool = False # Augment model answers | |
| backtranslate_ratio: float = 0.0 # 0.0-1.0 | |
| style_standardize: bool = True # Enforce clinical style | |
| deidentify: bool = True # Remove PHI | |
| dedupe: bool = True # Remove duplicates | |
| max_chars: int = 5000 # Text length limit | |
| consistency_check_ratio: float = 0.0 # 0.0-1.0 | |
| distill_fraction: float = 0.0 # 0.0-1.0 for unlabeled | |
| expand: bool = True # Enable augmentation | |
| max_aug_per_sample: int = 2 # 1-3 variants | |
| ``` | |
| ### Processing Parameters | |
| ```python | |
| class ProcessParams: | |
| augment: AugmentOptions # Augmentation settings | |
| sample_limit: Optional[int] = None # Dataset sampling | |
| seed: int = 42 # Reproducibility | |
| ``` | |
| ## 📈 Performance & Monitoring | |
| ### Progress Tracking | |
| - **Real-time Updates**: Live progress percentage and status messages | |
| - **Background Processing**: Non-blocking job execution | |
| - **State Management**: Thread-safe status tracking | |
| - **Error Handling**: Comprehensive exception logging | |
| ### Resource Management | |
| - **API Key Rotation**: Automatic fallback between multiple API keys | |
| - **Rate Limiting**: Configurable request throttling | |
| - **Memory Optimization**: Streaming processing for large datasets | |
| - **Concurrent Processing**: Background task execution | |
| ## 🔒 Security & Privacy | |
| ### Data Protection | |
| - **PHI Removal**: Automatic sensitive information redaction | |
| - **Secure Storage**: Google Drive integration with OAuth2 | |
| - **Access Control**: Environment-based API key management | |
| - **Audit Logging**: Comprehensive processing logs | |
| ### API Security | |
| - **OAuth2 Integration**: Google Drive authentication | |
| - **Token Management**: Secure credential handling | |
| - **Request Validation**: Pydantic model validation | |
| - **Error Sanitization**: Safe error message handling | |
| ## 🚀 Usage Examples | |
| ### Basic Processing | |
| ```bash | |
| # Process HealthCareMagic with default settings | |
| curl -X POST \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"augment": {"paraphrase_ratio": 0.1}}' \ | |
| https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic | |
| ``` | |
| ### Advanced Augmentation | |
| ```bash | |
| # Process with comprehensive augmentation | |
| curl -X POST \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "augment": { | |
| "paraphrase_ratio": 0.2, | |
| "backtranslate_ratio": 0.1, | |
| "paraphrase_outputs": true, | |
| "style_standardize": true, | |
| "deidentify": true, | |
| "dedupe": true, | |
| "max_chars": 5000, | |
| "consistency_check_ratio": 0.01, | |
| "max_aug_per_sample": 3 | |
| }, | |
| "sample_limit": 1000, | |
| "seed": 42 | |
| }' \ | |
| https://binkhoale1812-medai-processing.hf.space/process/icliniq | |
| ``` | |
| ## 📊 Output Statistics | |
| ### Processing Metrics | |
| - **Written Rows**: Total processed samples | |
| - **Paraphrased Inputs**: Count of augmented user inputs | |
| - **Paraphrased Outputs**: Count of augmented model responses | |
| - **Back-translated**: Count of translation-augmented samples | |
| - **Deduplication**: Count of skipped duplicate samples | |
| - **Consistency Failures**: Count of validation failures | |
| ### File Outputs | |
| - **JSONL Format**: Structured fine-tuning data with metadata | |
| - **CSV Format**: Simplified tabular representation | |
| - **Google Drive**: Cloud storage with automatic upload | |
| - **Local Cache**: Persistent local storage | |
| ## 🔮 Future Enhancements | |
| ### Planned Features | |
| - **Additional Dataset Support**: More medical dataset types | |
| - **Advanced Augmentation**: More sophisticated LLM techniques | |
| - **Quality Metrics**: Automated data quality scoring | |
| - **Batch Processing**: Multiple dataset concurrent processing | |
| - **Custom Schemas**: User-defined output formats | |
| ### Scalability Improvements | |
| - **Distributed Processing**: Multi-node processing support | |
| - **Streaming Augmentation**: Real-time data enhancement | |
| - **Caching Optimization**: Improved performance and cost efficiency | |
| - **API Rate Limiting**: Better resource management | |
| ## 📚 Technical Dependencies | |
| ### Core Libraries | |
| - **FastAPI**: Web framework for API development | |
| - **Hugging Face Hub**: Dataset downloading and management | |
| - **Google GenAI**: Gemini model integration | |
| - **ftfy**: Text encoding and normalization | |
| - **langid**: Language detection | |
| - **orjson**: High-performance JSON processing | |
| ### External Services | |
| - **NVIDIA API**: Primary LLM service for paraphrasing | |
| - **Google Gemini**: Fallback LLM service | |
| - **Google Drive**: Cloud storage integration | |
| - **Hugging Face Spaces**: Deployment platform | |
| --- | |
| *This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.* | |