Spaces:
Sleeping
Sleeping
📊 MedAI Data Processing Techniques
This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.
🎯 Project Overview
The MedAI Processing system transforms raw medical datasets into a centralized fine-tuning format (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.
🏗️ System Architecture
Core Components
- FastAPI Web Service: RESTful API for dataset processing
- Multi-LLM Rotator: NVIDIA API + Google Gemini integration
- Centralized Writer: Parallel JSONL + CSV output generation
- Google Drive Integration: Automated artifact storage
- Progress Monitoring: Real-time job status tracking
Supported Datasets
- HealthCareMagic (100k medical dialogues)
- iCliniq (10k medical consultations)
- PubMedQA-Labelled (biomedical Q&A with answers)
- PubMedQA-Unlabelled (biomedical Q&A without answers)
- PubMedQA-Map (biomedical Q&A mapping format)
🔧 Data Processing Pipeline
1. Data Ingestion & Download
- Hugging Face Hub Integration: Automatic dataset downloading
- Format Detection: JSON/JSONL auto-detection and parsing
- Caching System: Local storage with symlink optimization
2. Data Cleaning & Preprocessing
Text Normalization
- Unicode Fixing:
ftfylibrary for text encoding issues - Whitespace Standardization: Consistent spacing and line breaks
- Quote Canonicalization: Standard quote character conversion
- Terminal Punctuation: Ensures proper sentence endings
Content Sanitization
- Length Capping: Configurable maximum character limits (default: 5000)
- Language Detection: English language validation using
langid - Content Truncation: Smart sentence boundary cutting for long texts
3. Data Augmentation Techniques
LLM-Based Paraphrasing
- Multi-Model Rotation: NVIDIA API (primary) + Gemini (fallback)
- Difficulty Levels: Easy vs. Hard paraphrasing modes
- Medical Context Preservation: Maintains clinical terminology accuracy
- Configurable Ratios: User-defined augmentation percentages (0.0-1.0)
Back-Translation Augmentation
- Multi-Language Support: German as intermediate language
- Meaning Preservation: Maintains semantic accuracy through translation cycles
- Fallback Mechanisms: Automatic retry with alternative models
- Quality Control: Length and content validation
Style Standardization
- Clinical Voice Enforcement: Neutral, professional medical tone
- Absolute Language Removal: Replaces guarantees with probabilistic language
- Forum Sign-off Removal: Eliminates informal communication patterns
- Consistent Punctuation: Standardized sentence structure
4. Data Quality Assurance
De-identification (PHI Removal)
- Email Redaction:
[REDACTED_EMAIL]placeholder - Phone Number Masking:
[REDACTED_PHONE]placeholder - URL/IP Address Removal:
[REDACTED_URL]and[REDACTED_IP]placeholders - Configurable Privacy: Optional PHI removal per dataset
Deduplication
- Fingerprinting Algorithm: MD5-based content hashing
- Multi-Field Matching: Instruction + Input + Output combination
- Normalized Comparison: Case-insensitive, whitespace-normalized matching
- Performance Optimized: In-memory set-based deduplication
Consistency Validation
- LLM-Based QA Check: Automated answer validation against context
- Configurable Sampling: Ratio-based consistency checking (e.g., 0.01)
- Medical Safety Validation: Ensures clinical accuracy and safety
- Failure Tagging: Marks samples with consistency issues
5. Advanced Augmentation Features
Knowledge Distillation
- Pseudo-Label Generation: Creates labels for unlabeled data
- Fractional Processing: Configurable percentage for distillation
- Single-Prompt Approach: Efficient single LLM call per sample
- Length Control: Maintains reasonable output lengths
Multi-Variant Generation
- Configurable Counts: 1-3 augmented variants per sample
- Tagged Augmentations: Tracks applied augmentation techniques
- Original Preservation: Always maintains base sample
- Randomized IDs: Unique identifiers for augmented variants
6. Output Generation & Storage
Centralized Format
- SFT Schema: Standardized Supervised Fine-Tuning format
- Metadata Preservation: Source, task type, and augmentation tags
- Dual Output: Simultaneous JSONL and CSV generation
- Memory-Safe Streaming: Handles large datasets efficiently
Storage Integration
- Local Caching:
cache/outputs/directory storage - Google Drive Upload: Automated cloud storage integration
- Timestamped Naming: Unique file identification
- MIME Type Handling: Proper content type specification
⚙️ Configuration Options
Augmentation Parameters
class AugmentOptions:
paraphrase_ratio: float = 0.0 # 0.0-1.0
paraphrase_outputs: bool = False # Augment model answers
backtranslate_ratio: float = 0.0 # 0.0-1.0
style_standardize: bool = True # Enforce clinical style
deidentify: bool = True # Remove PHI
dedupe: bool = True # Remove duplicates
max_chars: int = 5000 # Text length limit
consistency_check_ratio: float = 0.0 # 0.0-1.0
distill_fraction: float = 0.0 # 0.0-1.0 for unlabeled
expand: bool = True # Enable augmentation
max_aug_per_sample: int = 2 # 1-3 variants
Processing Parameters
class ProcessParams:
augment: AugmentOptions # Augmentation settings
sample_limit: Optional[int] = None # Dataset sampling
seed: int = 42 # Reproducibility
📈 Performance & Monitoring
Progress Tracking
- Real-time Updates: Live progress percentage and status messages
- Background Processing: Non-blocking job execution
- State Management: Thread-safe status tracking
- Error Handling: Comprehensive exception logging
Resource Management
- API Key Rotation: Automatic fallback between multiple API keys
- Rate Limiting: Configurable request throttling
- Memory Optimization: Streaming processing for large datasets
- Concurrent Processing: Background task execution
🔒 Security & Privacy
Data Protection
- PHI Removal: Automatic sensitive information redaction
- Secure Storage: Google Drive integration with OAuth2
- Access Control: Environment-based API key management
- Audit Logging: Comprehensive processing logs
API Security
- OAuth2 Integration: Google Drive authentication
- Token Management: Secure credential handling
- Request Validation: Pydantic model validation
- Error Sanitization: Safe error message handling
🚀 Usage Examples
Basic Processing
# Process HealthCareMagic with default settings
curl -X POST \
-H "Content-Type: application/json" \
-d '{"augment": {"paraphrase_ratio": 0.1}}' \
https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
Advanced Augmentation
# Process with comprehensive augmentation
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"augment": {
"paraphrase_ratio": 0.2,
"backtranslate_ratio": 0.1,
"paraphrase_outputs": true,
"style_standardize": true,
"deidentify": true,
"dedupe": true,
"max_chars": 5000,
"consistency_check_ratio": 0.01,
"max_aug_per_sample": 3
},
"sample_limit": 1000,
"seed": 42
}' \
https://binkhoale1812-medai-processing.hf.space/process/icliniq
📊 Output Statistics
Processing Metrics
- Written Rows: Total processed samples
- Paraphrased Inputs: Count of augmented user inputs
- Paraphrased Outputs: Count of augmented model responses
- Back-translated: Count of translation-augmented samples
- Deduplication: Count of skipped duplicate samples
- Consistency Failures: Count of validation failures
File Outputs
- JSONL Format: Structured fine-tuning data with metadata
- CSV Format: Simplified tabular representation
- Google Drive: Cloud storage with automatic upload
- Local Cache: Persistent local storage
🔮 Future Enhancements
Planned Features
- Additional Dataset Support: More medical dataset types
- Advanced Augmentation: More sophisticated LLM techniques
- Quality Metrics: Automated data quality scoring
- Batch Processing: Multiple dataset concurrent processing
- Custom Schemas: User-defined output formats
Scalability Improvements
- Distributed Processing: Multi-node processing support
- Streaming Augmentation: Real-time data enhancement
- Caching Optimization: Improved performance and cost efficiency
- API Rate Limiting: Better resource management
📚 Technical Dependencies
Core Libraries
- FastAPI: Web framework for API development
- Hugging Face Hub: Dataset downloading and management
- Google GenAI: Gemini model integration
- ftfy: Text encoding and normalization
- langid: Language detection
- orjson: High-performance JSON processing
External Services
- NVIDIA API: Primary LLM service for paraphrasing
- Google Gemini: Fallback LLM service
- Google Drive: Cloud storage integration
- Hugging Face Spaces: Deployment platform
This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the utils/ directory.