MedAI_Processing / docs /DATA_PROCESSING.md
LiamKhoaLe's picture
Upd local setups with dynamic mode setter
a89888b

📊 MedAI Data Processing Techniques

This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.

🎯 Project Overview

The MedAI Processing system transforms raw medical datasets into a centralized fine-tuning format (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.

🏗️ System Architecture

Core Components

  • FastAPI Web Service: RESTful API for dataset processing
  • Multi-LLM Rotator: NVIDIA API + Google Gemini integration
  • Centralized Writer: Parallel JSONL + CSV output generation
  • Google Drive Integration: Automated artifact storage
  • Progress Monitoring: Real-time job status tracking

Supported Datasets

  1. HealthCareMagic (100k medical dialogues)
  2. iCliniq (10k medical consultations)
  3. PubMedQA-Labelled (biomedical Q&A with answers)
  4. PubMedQA-Unlabelled (biomedical Q&A without answers)
  5. PubMedQA-Map (biomedical Q&A mapping format)

🔧 Data Processing Pipeline

1. Data Ingestion & Download

  • Hugging Face Hub Integration: Automatic dataset downloading
  • Format Detection: JSON/JSONL auto-detection and parsing
  • Caching System: Local storage with symlink optimization

2. Data Cleaning & Preprocessing

Text Normalization

  • Unicode Fixing: ftfy library for text encoding issues
  • Whitespace Standardization: Consistent spacing and line breaks
  • Quote Canonicalization: Standard quote character conversion
  • Terminal Punctuation: Ensures proper sentence endings

Content Sanitization

  • Length Capping: Configurable maximum character limits (default: 5000)
  • Language Detection: English language validation using langid
  • Content Truncation: Smart sentence boundary cutting for long texts

3. Data Augmentation Techniques

LLM-Based Paraphrasing

  • Multi-Model Rotation: NVIDIA API (primary) + Gemini (fallback)
  • Difficulty Levels: Easy vs. Hard paraphrasing modes
  • Medical Context Preservation: Maintains clinical terminology accuracy
  • Configurable Ratios: User-defined augmentation percentages (0.0-1.0)

Back-Translation Augmentation

  • Multi-Language Support: German as intermediate language
  • Meaning Preservation: Maintains semantic accuracy through translation cycles
  • Fallback Mechanisms: Automatic retry with alternative models
  • Quality Control: Length and content validation

Style Standardization

  • Clinical Voice Enforcement: Neutral, professional medical tone
  • Absolute Language Removal: Replaces guarantees with probabilistic language
  • Forum Sign-off Removal: Eliminates informal communication patterns
  • Consistent Punctuation: Standardized sentence structure

4. Data Quality Assurance

De-identification (PHI Removal)

  • Email Redaction: [REDACTED_EMAIL] placeholder
  • Phone Number Masking: [REDACTED_PHONE] placeholder
  • URL/IP Address Removal: [REDACTED_URL] and [REDACTED_IP] placeholders
  • Configurable Privacy: Optional PHI removal per dataset

Deduplication

  • Fingerprinting Algorithm: MD5-based content hashing
  • Multi-Field Matching: Instruction + Input + Output combination
  • Normalized Comparison: Case-insensitive, whitespace-normalized matching
  • Performance Optimized: In-memory set-based deduplication

Consistency Validation

  • LLM-Based QA Check: Automated answer validation against context
  • Configurable Sampling: Ratio-based consistency checking (e.g., 0.01)
  • Medical Safety Validation: Ensures clinical accuracy and safety
  • Failure Tagging: Marks samples with consistency issues

5. Advanced Augmentation Features

Knowledge Distillation

  • Pseudo-Label Generation: Creates labels for unlabeled data
  • Fractional Processing: Configurable percentage for distillation
  • Single-Prompt Approach: Efficient single LLM call per sample
  • Length Control: Maintains reasonable output lengths

Multi-Variant Generation

  • Configurable Counts: 1-3 augmented variants per sample
  • Tagged Augmentations: Tracks applied augmentation techniques
  • Original Preservation: Always maintains base sample
  • Randomized IDs: Unique identifiers for augmented variants

6. Output Generation & Storage

Centralized Format

  • SFT Schema: Standardized Supervised Fine-Tuning format
  • Metadata Preservation: Source, task type, and augmentation tags
  • Dual Output: Simultaneous JSONL and CSV generation
  • Memory-Safe Streaming: Handles large datasets efficiently

Storage Integration

  • Local Caching: cache/outputs/ directory storage
  • Google Drive Upload: Automated cloud storage integration
  • Timestamped Naming: Unique file identification
  • MIME Type Handling: Proper content type specification

⚙️ Configuration Options

Augmentation Parameters

class AugmentOptions:
    paraphrase_ratio: float = 0.0          # 0.0-1.0
    paraphrase_outputs: bool = False       # Augment model answers
    backtranslate_ratio: float = 0.0       # 0.0-1.0
    style_standardize: bool = True         # Enforce clinical style
    deidentify: bool = True                # Remove PHI
    dedupe: bool = True                    # Remove duplicates
    max_chars: int = 5000                  # Text length limit
    consistency_check_ratio: float = 0.0   # 0.0-1.0
    distill_fraction: float = 0.0          # 0.0-1.0 for unlabeled
    expand: bool = True                    # Enable augmentation
    max_aug_per_sample: int = 2            # 1-3 variants

Processing Parameters

class ProcessParams:
    augment: AugmentOptions                # Augmentation settings
    sample_limit: Optional[int] = None     # Dataset sampling
    seed: int = 42                        # Reproducibility

📈 Performance & Monitoring

Progress Tracking

  • Real-time Updates: Live progress percentage and status messages
  • Background Processing: Non-blocking job execution
  • State Management: Thread-safe status tracking
  • Error Handling: Comprehensive exception logging

Resource Management

  • API Key Rotation: Automatic fallback between multiple API keys
  • Rate Limiting: Configurable request throttling
  • Memory Optimization: Streaming processing for large datasets
  • Concurrent Processing: Background task execution

🔒 Security & Privacy

Data Protection

  • PHI Removal: Automatic sensitive information redaction
  • Secure Storage: Google Drive integration with OAuth2
  • Access Control: Environment-based API key management
  • Audit Logging: Comprehensive processing logs

API Security

  • OAuth2 Integration: Google Drive authentication
  • Token Management: Secure credential handling
  • Request Validation: Pydantic model validation
  • Error Sanitization: Safe error message handling

🚀 Usage Examples

Basic Processing

# Process HealthCareMagic with default settings
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"augment": {"paraphrase_ratio": 0.1}}' \
  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic

Advanced Augmentation

# Process with comprehensive augmentation
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "max_chars": 5000,
      "consistency_check_ratio": 0.01,
      "max_aug_per_sample": 3
    },
    "sample_limit": 1000,
    "seed": 42
  }' \
  https://binkhoale1812-medai-processing.hf.space/process/icliniq

📊 Output Statistics

Processing Metrics

  • Written Rows: Total processed samples
  • Paraphrased Inputs: Count of augmented user inputs
  • Paraphrased Outputs: Count of augmented model responses
  • Back-translated: Count of translation-augmented samples
  • Deduplication: Count of skipped duplicate samples
  • Consistency Failures: Count of validation failures

File Outputs

  • JSONL Format: Structured fine-tuning data with metadata
  • CSV Format: Simplified tabular representation
  • Google Drive: Cloud storage with automatic upload
  • Local Cache: Persistent local storage

🔮 Future Enhancements

Planned Features

  • Additional Dataset Support: More medical dataset types
  • Advanced Augmentation: More sophisticated LLM techniques
  • Quality Metrics: Automated data quality scoring
  • Batch Processing: Multiple dataset concurrent processing
  • Custom Schemas: User-defined output formats

Scalability Improvements

  • Distributed Processing: Multi-node processing support
  • Streaming Augmentation: Real-time data enhancement
  • Caching Optimization: Improved performance and cost efficiency
  • API Rate Limiting: Better resource management

📚 Technical Dependencies

Core Libraries

  • FastAPI: Web framework for API development
  • Hugging Face Hub: Dataset downloading and management
  • Google GenAI: Gemini model integration
  • ftfy: Text encoding and normalization
  • langid: Language detection
  • orjson: High-performance JSON processing

External Services

  • NVIDIA API: Primary LLM service for paraphrasing
  • Google Gemini: Fallback LLM service
  • Google Drive: Cloud storage integration
  • Hugging Face Spaces: Deployment platform

This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the utils/ directory.