MedAI_Processing / README.md
LiamKhoaLe's picture
Upd syntax
fb6b1e8
metadata
title: Medical Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing. Derived from 500k medical knowledge mix

🚀 Quick Access

HF Space

MedDialog-100k

MedDialog-10k

PubMedQA-Labelled

PubMedQA-Unlabelled

PubMedQA-Mapper

🎯 Features

🏠 Dual Mode Operation

  • Local Mode: MedAlpaca-13b model running locally for privacy and cost efficiency
  • Cloud Mode: NVIDIA + Gemini API integration for scalable processing
  • Dynamic Switching: Toggle between modes via environment variables
  • Medical Specialization: MedAlpaca-13b specifically fine-tuned for medical tasks

🔄 Advanced Data Augmentation

  • Paraphrasing: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
  • Backtranslation: Vietnamese pivot language for semantic preservation
  • Style Standardization: Clinical voice enforcement and professional medical tone
  • Response Validation: Invalid response detection and retry logic (max 3 attempts)
  • Quality Guards: Length/semantic validation for backtranslation outputs

📊 SFT Data Enrichment

  • Multiple Answer Variants: 2-3 different answers per question for better reasoning
  • Multiple Question Variants: 2-3 different questions per answer for diverse training
  • Cross Combinations: All question × answer variant combinations (up to 9 per sample)
  • Vietnamese Variants: Translated versions of enriched combinations
  • Reasoning Enhancement: Multiple reasoning paths for improved model training

🔍 Quality Assurance

  • Invalid Response Detection: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
  • Retry Logic: Up to 3 attempts with different paraphrasing difficulties
  • Drop Strategy: Samples dropped if retry fails (no fallback answers)
  • Consistency Checking: LLM-based validation of answer quality
  • De-identification: PHI removal with configurable strictness

🎯 RAG Optimization

  • Embedding-Friendly: Concise, direct text optimized for dense retrieval
  • Context Generation: Synthetic context creation when missing
  • Content Cleaning: Conversational element removal for medical focus
  • Length Control: Hard caps on question/answer/context lengths
  • Quality Filtering: Invalid response cleaning for RAG corpora

📋 Supported Datasets

Medical Dialogue

  • HealthCareMagic: 100k medical conversations
  • iCliniq: 10k derived medical Q&A

Biomedical QA

  • PubMedQA-L: Labeled biomedical questions
  • PubMedQA-U: Unlabeled biomedical questions
  • PubMedQA-MAP: Mapped biomedical Q&A pairs

⚙️ Configuration

Mode Selection

# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token

# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key

Augmentation Parameters

class AugmentOptions:
    paraphrase_ratio: float = 0.2          # 0.0-1.0
    paraphrase_outputs: bool = True         # Augment model answers
    backtranslate_ratio: float = 0.1        # 0.0-1.0 (Vietnamese pivot)
    style_standardize: bool = True          # Enforce clinical style
    deidentify: bool = True                 # Remove PHI
    dedupe: bool = True                     # Remove duplicates
    max_chars: int = 5000                   # Text length limit
    consistency_check_ratio: float = 0.05   # 0.0-1.0
    expand: bool = True                     # Enable enrichment
    max_aug_per_sample: int = 2             # 1-3 variants

Processing Modes

  • SFT Processing: Supervised Fine-Tuning format with enrichment
  • RAG Processing: Question-Context-Answer format for retrieval
  • Vietnamese Mode: Complete translation of all text fields

📈 Output Statistics

The system tracks comprehensive statistics:

  • written: Successfully processed samples
  • paraphrased_input/output: Paraphrasing counts
  • backtranslated_input/output: Backtranslation counts
  • dropped_invalid: Samples dropped due to failed retries
  • vietnamese_variants: Vietnamese variants created
  • dedup_skipped: Duplicate samples removed
  • consistency_failed: Samples flagged for quality issues

🔧 Usage

Web Interface

  1. Visit the HF Space
  2. Select dataset and processing mode (SFT/RAG)
  3. Enable Vietnamese translation if needed
  4. Click process button

API Usage

# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "expand": true
    },
    "vietnamese_translation": true
  }'

# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "vietnamese_translation": true
  }'

📚 Documentation

📄 License

Apache-2.0 LICENSE