Spaces:
Sleeping
Sleeping
File size: 10,013 Bytes
80cb919 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
# 📊 MedAI Data Processing Techniques
This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.
## 🎯 Project Overview
The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.
## 🏗️ System Architecture
### Core Components
- **FastAPI Web Service**: RESTful API for dataset processing
- **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration
- **Centralized Writer**: Parallel JSONL + CSV output generation
- **Google Drive Integration**: Automated artifact storage
- **Progress Monitoring**: Real-time job status tracking
### Supported Datasets
1. **HealthCareMagic** (100k medical dialogues)
2. **iCliniq** (10k medical consultations)
3. **PubMedQA-Labelled** (biomedical Q&A with answers)
4. **PubMedQA-Unlabelled** (biomedical Q&A without answers)
5. **PubMedQA-Map** (biomedical Q&A mapping format)
## 🔧 Data Processing Pipeline
### 1. Data Ingestion & Download
- **Hugging Face Hub Integration**: Automatic dataset downloading
- **Format Detection**: JSON/JSONL auto-detection and parsing
- **Caching System**: Local storage with symlink optimization
### 2. Data Cleaning & Preprocessing
#### Text Normalization
- **Unicode Fixing**: `ftfy` library for text encoding issues
- **Whitespace Standardization**: Consistent spacing and line breaks
- **Quote Canonicalization**: Standard quote character conversion
- **Terminal Punctuation**: Ensures proper sentence endings
#### Content Sanitization
- **Length Capping**: Configurable maximum character limits (default: 5000)
- **Language Detection**: English language validation using `langid`
- **Content Truncation**: Smart sentence boundary cutting for long texts
### 3. Data Augmentation Techniques
#### LLM-Based Paraphrasing
- **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback)
- **Difficulty Levels**: Easy vs. Hard paraphrasing modes
- **Medical Context Preservation**: Maintains clinical terminology accuracy
- **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0)
#### Back-Translation Augmentation
- **Multi-Language Support**: German as intermediate language
- **Meaning Preservation**: Maintains semantic accuracy through translation cycles
- **Fallback Mechanisms**: Automatic retry with alternative models
- **Quality Control**: Length and content validation
#### Style Standardization
- **Clinical Voice Enforcement**: Neutral, professional medical tone
- **Absolute Language Removal**: Replaces guarantees with probabilistic language
- **Forum Sign-off Removal**: Eliminates informal communication patterns
- **Consistent Punctuation**: Standardized sentence structure
### 4. Data Quality Assurance
#### De-identification (PHI Removal)
- **Email Redaction**: `[REDACTED_EMAIL]` placeholder
- **Phone Number Masking**: `[REDACTED_PHONE]` placeholder
- **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders
- **Configurable Privacy**: Optional PHI removal per dataset
#### Deduplication
- **Fingerprinting Algorithm**: MD5-based content hashing
- **Multi-Field Matching**: Instruction + Input + Output combination
- **Normalized Comparison**: Case-insensitive, whitespace-normalized matching
- **Performance Optimized**: In-memory set-based deduplication
#### Consistency Validation
- **LLM-Based QA Check**: Automated answer validation against context
- **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01)
- **Medical Safety Validation**: Ensures clinical accuracy and safety
- **Failure Tagging**: Marks samples with consistency issues
### 5. Advanced Augmentation Features
#### Knowledge Distillation
- **Pseudo-Label Generation**: Creates labels for unlabeled data
- **Fractional Processing**: Configurable percentage for distillation
- **Single-Prompt Approach**: Efficient single LLM call per sample
- **Length Control**: Maintains reasonable output lengths
#### Multi-Variant Generation
- **Configurable Counts**: 1-3 augmented variants per sample
- **Tagged Augmentations**: Tracks applied augmentation techniques
- **Original Preservation**: Always maintains base sample
- **Randomized IDs**: Unique identifiers for augmented variants
### 6. Output Generation & Storage
#### Centralized Format
- **SFT Schema**: Standardized Supervised Fine-Tuning format
- **Metadata Preservation**: Source, task type, and augmentation tags
- **Dual Output**: Simultaneous JSONL and CSV generation
- **Memory-Safe Streaming**: Handles large datasets efficiently
#### Storage Integration
- **Local Caching**: `cache/outputs/` directory storage
- **Google Drive Upload**: Automated cloud storage integration
- **Timestamped Naming**: Unique file identification
- **MIME Type Handling**: Proper content type specification
## ⚙️ Configuration Options
### Augmentation Parameters
```python
class AugmentOptions:
paraphrase_ratio: float = 0.0 # 0.0-1.0
paraphrase_outputs: bool = False # Augment model answers
backtranslate_ratio: float = 0.0 # 0.0-1.0
style_standardize: bool = True # Enforce clinical style
deidentify: bool = True # Remove PHI
dedupe: bool = True # Remove duplicates
max_chars: int = 5000 # Text length limit
consistency_check_ratio: float = 0.0 # 0.0-1.0
distill_fraction: float = 0.0 # 0.0-1.0 for unlabeled
expand: bool = True # Enable augmentation
max_aug_per_sample: int = 2 # 1-3 variants
```
### Processing Parameters
```python
class ProcessParams:
augment: AugmentOptions # Augmentation settings
sample_limit: Optional[int] = None # Dataset sampling
seed: int = 42 # Reproducibility
```
## 📈 Performance & Monitoring
### Progress Tracking
- **Real-time Updates**: Live progress percentage and status messages
- **Background Processing**: Non-blocking job execution
- **State Management**: Thread-safe status tracking
- **Error Handling**: Comprehensive exception logging
### Resource Management
- **API Key Rotation**: Automatic fallback between multiple API keys
- **Rate Limiting**: Configurable request throttling
- **Memory Optimization**: Streaming processing for large datasets
- **Concurrent Processing**: Background task execution
## 🔒 Security & Privacy
### Data Protection
- **PHI Removal**: Automatic sensitive information redaction
- **Secure Storage**: Google Drive integration with OAuth2
- **Access Control**: Environment-based API key management
- **Audit Logging**: Comprehensive processing logs
### API Security
- **OAuth2 Integration**: Google Drive authentication
- **Token Management**: Secure credential handling
- **Request Validation**: Pydantic model validation
- **Error Sanitization**: Safe error message handling
## 🚀 Usage Examples
### Basic Processing
```bash
# Process HealthCareMagic with default settings
curl -X POST \
-H "Content-Type: application/json" \
-d '{"augment": {"paraphrase_ratio": 0.1}}' \
https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
```
### Advanced Augmentation
```bash
# Process with comprehensive augmentation
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"augment": {
"paraphrase_ratio": 0.2,
"backtranslate_ratio": 0.1,
"paraphrase_outputs": true,
"style_standardize": true,
"deidentify": true,
"dedupe": true,
"max_chars": 5000,
"consistency_check_ratio": 0.01,
"max_aug_per_sample": 3
},
"sample_limit": 1000,
"seed": 42
}' \
https://binkhoale1812-medai-processing.hf.space/process/icliniq
```
## 📊 Output Statistics
### Processing Metrics
- **Written Rows**: Total processed samples
- **Paraphrased Inputs**: Count of augmented user inputs
- **Paraphrased Outputs**: Count of augmented model responses
- **Back-translated**: Count of translation-augmented samples
- **Deduplication**: Count of skipped duplicate samples
- **Consistency Failures**: Count of validation failures
### File Outputs
- **JSONL Format**: Structured fine-tuning data with metadata
- **CSV Format**: Simplified tabular representation
- **Google Drive**: Cloud storage with automatic upload
- **Local Cache**: Persistent local storage
## 🔮 Future Enhancements
### Planned Features
- **Additional Dataset Support**: More medical dataset types
- **Advanced Augmentation**: More sophisticated LLM techniques
- **Quality Metrics**: Automated data quality scoring
- **Batch Processing**: Multiple dataset concurrent processing
- **Custom Schemas**: User-defined output formats
### Scalability Improvements
- **Distributed Processing**: Multi-node processing support
- **Streaming Augmentation**: Real-time data enhancement
- **Caching Optimization**: Improved performance and cost efficiency
- **API Rate Limiting**: Better resource management
## 📚 Technical Dependencies
### Core Libraries
- **FastAPI**: Web framework for API development
- **Hugging Face Hub**: Dataset downloading and management
- **Google GenAI**: Gemini model integration
- **ftfy**: Text encoding and normalization
- **langid**: Language detection
- **orjson**: High-performance JSON processing
### External Services
- **NVIDIA API**: Primary LLM service for paraphrasing
- **Google Gemini**: Fallback LLM service
- **Google Drive**: Cloud storage integration
- **Hugging Face Spaces**: Deployment platform
---
*This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.*
|