Spaces:

MedSwin
/

MedAI_Processing

Sleeping

File size: 10,013 Bytes

80cb919

# 📊 MedAI Data Processing Techniques

This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.

## 🎯 Project Overview

The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.

## 🏗️ System Architecture

### Core Components
- **FastAPI Web Service**: RESTful API for dataset processing
- **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration
- **Centralized Writer**: Parallel JSONL + CSV output generation
- **Google Drive Integration**: Automated artifact storage
- **Progress Monitoring**: Real-time job status tracking

### Supported Datasets
1. **HealthCareMagic** (100k medical dialogues)
2. **iCliniq** (10k medical consultations)
3. **PubMedQA-Labelled** (biomedical Q&A with answers)
4. **PubMedQA-Unlabelled** (biomedical Q&A without answers)
5. **PubMedQA-Map** (biomedical Q&A mapping format)

## 🔧 Data Processing Pipeline

### 1. Data Ingestion & Download
- **Hugging Face Hub Integration**: Automatic dataset downloading
- **Format Detection**: JSON/JSONL auto-detection and parsing
- **Caching System**: Local storage with symlink optimization

### 2. Data Cleaning & Preprocessing

#### Text Normalization
- **Unicode Fixing**: `ftfy` library for text encoding issues
- **Whitespace Standardization**: Consistent spacing and line breaks
- **Quote Canonicalization**: Standard quote character conversion
- **Terminal Punctuation**: Ensures proper sentence endings

#### Content Sanitization
- **Length Capping**: Configurable maximum character limits (default: 5000)
- **Language Detection**: English language validation using `langid`
- **Content Truncation**: Smart sentence boundary cutting for long texts

### 3. Data Augmentation Techniques

#### LLM-Based Paraphrasing
- **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback)
- **Difficulty Levels**: Easy vs. Hard paraphrasing modes
- **Medical Context Preservation**: Maintains clinical terminology accuracy
- **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0)

#### Back-Translation Augmentation
- **Multi-Language Support**: German as intermediate language
- **Meaning Preservation**: Maintains semantic accuracy through translation cycles
- **Fallback Mechanisms**: Automatic retry with alternative models
- **Quality Control**: Length and content validation

#### Style Standardization
- **Clinical Voice Enforcement**: Neutral, professional medical tone
- **Absolute Language Removal**: Replaces guarantees with probabilistic language
- **Forum Sign-off Removal**: Eliminates informal communication patterns
- **Consistent Punctuation**: Standardized sentence structure

### 4. Data Quality Assurance

#### De-identification (PHI Removal)
- **Email Redaction**: `[REDACTED_EMAIL]` placeholder
- **Phone Number Masking**: `[REDACTED_PHONE]` placeholder
- **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders
- **Configurable Privacy**: Optional PHI removal per dataset

#### Deduplication
- **Fingerprinting Algorithm**: MD5-based content hashing
- **Multi-Field Matching**: Instruction + Input + Output combination
- **Normalized Comparison**: Case-insensitive, whitespace-normalized matching
- **Performance Optimized**: In-memory set-based deduplication

#### Consistency Validation
- **LLM-Based QA Check**: Automated answer validation against context
- **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01)
- **Medical Safety Validation**: Ensures clinical accuracy and safety
- **Failure Tagging**: Marks samples with consistency issues

### 5. Advanced Augmentation Features

#### Knowledge Distillation
- **Pseudo-Label Generation**: Creates labels for unlabeled data
- **Fractional Processing**: Configurable percentage for distillation
- **Single-Prompt Approach**: Efficient single LLM call per sample
- **Length Control**: Maintains reasonable output lengths

#### Multi-Variant Generation
- **Configurable Counts**: 1-3 augmented variants per sample
- **Tagged Augmentations**: Tracks applied augmentation techniques
- **Original Preservation**: Always maintains base sample
- **Randomized IDs**: Unique identifiers for augmented variants

### 6. Output Generation & Storage

#### Centralized Format
- **SFT Schema**: Standardized Supervised Fine-Tuning format
- **Metadata Preservation**: Source, task type, and augmentation tags
- **Dual Output**: Simultaneous JSONL and CSV generation
- **Memory-Safe Streaming**: Handles large datasets efficiently

#### Storage Integration
- **Local Caching**: `cache/outputs/` directory storage
- **Google Drive Upload**: Automated cloud storage integration
- **Timestamped Naming**: Unique file identification
- **MIME Type Handling**: Proper content type specification

## ⚙️ Configuration Options

### Augmentation Parameters
```python
class AugmentOptions:
    paraphrase_ratio: float = 0.0          # 0.0-1.0
    paraphrase_outputs: bool = False       # Augment model answers
    backtranslate_ratio: float = 0.0       # 0.0-1.0
    style_standardize: bool = True         # Enforce clinical style
    deidentify: bool = True                # Remove PHI
    dedupe: bool = True                    # Remove duplicates
    max_chars: int = 5000                  # Text length limit
    consistency_check_ratio: float = 0.0   # 0.0-1.0
    distill_fraction: float = 0.0          # 0.0-1.0 for unlabeled
    expand: bool = True                    # Enable augmentation
    max_aug_per_sample: int = 2            # 1-3 variants
```

### Processing Parameters
```python
class ProcessParams:
    augment: AugmentOptions                # Augmentation settings
    sample_limit: Optional[int] = None     # Dataset sampling
    seed: int = 42                        # Reproducibility
```

## 📈 Performance & Monitoring

### Progress Tracking
- **Real-time Updates**: Live progress percentage and status messages
- **Background Processing**: Non-blocking job execution
- **State Management**: Thread-safe status tracking
- **Error Handling**: Comprehensive exception logging

### Resource Management
- **API Key Rotation**: Automatic fallback between multiple API keys
- **Rate Limiting**: Configurable request throttling
- **Memory Optimization**: Streaming processing for large datasets
- **Concurrent Processing**: Background task execution

## 🔒 Security & Privacy

### Data Protection
- **PHI Removal**: Automatic sensitive information redaction
- **Secure Storage**: Google Drive integration with OAuth2
- **Access Control**: Environment-based API key management
- **Audit Logging**: Comprehensive processing logs

### API Security
- **OAuth2 Integration**: Google Drive authentication
- **Token Management**: Secure credential handling
- **Request Validation**: Pydantic model validation
- **Error Sanitization**: Safe error message handling

## 🚀 Usage Examples

### Basic Processing
```bash
# Process HealthCareMagic with default settings
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"augment": {"paraphrase_ratio": 0.1}}' \
  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
```

### Advanced Augmentation
```bash
# Process with comprehensive augmentation
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "max_chars": 5000,
      "consistency_check_ratio": 0.01,
      "max_aug_per_sample": 3
    },
    "sample_limit": 1000,
    "seed": 42
  }' \
  https://binkhoale1812-medai-processing.hf.space/process/icliniq
```

## 📊 Output Statistics

### Processing Metrics
- **Written Rows**: Total processed samples
- **Paraphrased Inputs**: Count of augmented user inputs
- **Paraphrased Outputs**: Count of augmented model responses
- **Back-translated**: Count of translation-augmented samples
- **Deduplication**: Count of skipped duplicate samples
- **Consistency Failures**: Count of validation failures

### File Outputs
- **JSONL Format**: Structured fine-tuning data with metadata
- **CSV Format**: Simplified tabular representation
- **Google Drive**: Cloud storage with automatic upload
- **Local Cache**: Persistent local storage

## 🔮 Future Enhancements

### Planned Features
- **Additional Dataset Support**: More medical dataset types
- **Advanced Augmentation**: More sophisticated LLM techniques
- **Quality Metrics**: Automated data quality scoring
- **Batch Processing**: Multiple dataset concurrent processing
- **Custom Schemas**: User-defined output formats

### Scalability Improvements
- **Distributed Processing**: Multi-node processing support
- **Streaming Augmentation**: Real-time data enhancement
- **Caching Optimization**: Improved performance and cost efficiency
- **API Rate Limiting**: Better resource management

## 📚 Technical Dependencies

### Core Libraries
- **FastAPI**: Web framework for API development
- **Hugging Face Hub**: Dataset downloading and management
- **Google GenAI**: Gemini model integration
- **ftfy**: Text encoding and normalization
- **langid**: Language detection
- **orjson**: High-performance JSON processing

### External Services
- **NVIDIA API**: Primary LLM service for paraphrasing
- **Google Gemini**: Fallback LLM service
- **Google Drive**: Cloud storage integration
- **Hugging Face Spaces**: Deployment platform

---

*This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.*