Spaces:

MedSwin
/

MedAI_Processing

Sleeping

App Files Files Community

MedAI_Processing / README.md

LiamKhoaLe

Upd syntax

fb6b1e8 about 1 month ago

preview code

raw

history blame contribute delete

6.19 kB

metadata

title: Medical Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing. Derived from 500k medical knowledge mix

🚀 Quick Access

🎯 Features

🏠 Dual Mode Operation

Local Mode: MedAlpaca-13b model running locally for privacy and cost efficiency
Cloud Mode: NVIDIA + Gemini API integration for scalable processing
Dynamic Switching: Toggle between modes via environment variables
Medical Specialization: MedAlpaca-13b specifically fine-tuned for medical tasks

🔄 Advanced Data Augmentation

Paraphrasing: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
Backtranslation: Vietnamese pivot language for semantic preservation
Style Standardization: Clinical voice enforcement and professional medical tone
Response Validation: Invalid response detection and retry logic (max 3 attempts)
Quality Guards: Length/semantic validation for backtranslation outputs

📊 SFT Data Enrichment

Multiple Answer Variants: 2-3 different answers per question for better reasoning
Multiple Question Variants: 2-3 different questions per answer for diverse training
Cross Combinations: All question × answer variant combinations (up to 9 per sample)
Vietnamese Variants: Translated versions of enriched combinations
Reasoning Enhancement: Multiple reasoning paths for improved model training

🔍 Quality Assurance

Invalid Response Detection: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
Retry Logic: Up to 3 attempts with different paraphrasing difficulties
Drop Strategy: Samples dropped if retry fails (no fallback answers)
Consistency Checking: LLM-based validation of answer quality
De-identification: PHI removal with configurable strictness

🎯 RAG Optimization

Embedding-Friendly: Concise, direct text optimized for dense retrieval
Context Generation: Synthetic context creation when missing
Content Cleaning: Conversational element removal for medical focus
Length Control: Hard caps on question/answer/context lengths
Quality Filtering: Invalid response cleaning for RAG corpora

📋 Supported Datasets

Medical Dialogue

HealthCareMagic: 100k medical conversations
iCliniq: 10k derived medical Q&A

Biomedical QA

PubMedQA-L: Labeled biomedical questions
PubMedQA-U: Unlabeled biomedical questions
PubMedQA-MAP: Mapped biomedical Q&A pairs

⚙️ Configuration

Mode Selection

# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token

# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key

Augmentation Parameters

class AugmentOptions:
    paraphrase_ratio: float = 0.2          # 0.0-1.0
    paraphrase_outputs: bool = True         # Augment model answers
    backtranslate_ratio: float = 0.1        # 0.0-1.0 (Vietnamese pivot)
    style_standardize: bool = True          # Enforce clinical style
    deidentify: bool = True                 # Remove PHI
    dedupe: bool = True                     # Remove duplicates
    max_chars: int = 5000                   # Text length limit
    consistency_check_ratio: float = 0.05   # 0.0-1.0
    expand: bool = True                     # Enable enrichment
    max_aug_per_sample: int = 2             # 1-3 variants

Processing Modes

SFT Processing: Supervised Fine-Tuning format with enrichment
RAG Processing: Question-Context-Answer format for retrieval
Vietnamese Mode: Complete translation of all text fields

📈 Output Statistics

The system tracks comprehensive statistics:

written: Successfully processed samples
paraphrased_input/output: Paraphrasing counts
backtranslated_input/output: Backtranslation counts
dropped_invalid: Samples dropped due to failed retries
vietnamese_variants: Vietnamese variants created
dedup_skipped: Duplicate samples removed
consistency_failed: Samples flagged for quality issues

🔧 Usage

Web Interface

Visit the HF Space
Select dataset and processing mode (SFT/RAG)
Enable Vietnamese translation if needed
Click process button

API Usage

# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "expand": true
    },
    "vietnamese_translation": true
  }'

# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \
  -H "Content-Type: application/json" \
  -d '{
    "vietnamese_translation": true
  }'

📚 Documentation

📄 License

Apache-2.0 LICENSE