Spaces:
Sleeping
Sleeping
metadata
title: Medical Processing
emoji: ⚕️
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
license: apache-2.0
short_description: Data processing. Derived from 500k medical knowledge mix
🚀 Quick Access
🎯 Features
🏠 Dual Mode Operation
- Local Mode: MedAlpaca-13b model running locally for privacy and cost efficiency
- Cloud Mode: NVIDIA + Gemini API integration for scalable processing
- Dynamic Switching: Toggle between modes via environment variables
- Medical Specialization: MedAlpaca-13b specifically fine-tuned for medical tasks
🔄 Advanced Data Augmentation
- Paraphrasing: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
- Backtranslation: Vietnamese pivot language for semantic preservation
- Style Standardization: Clinical voice enforcement and professional medical tone
- Response Validation: Invalid response detection and retry logic (max 3 attempts)
- Quality Guards: Length/semantic validation for backtranslation outputs
📊 SFT Data Enrichment
- Multiple Answer Variants: 2-3 different answers per question for better reasoning
- Multiple Question Variants: 2-3 different questions per answer for diverse training
- Cross Combinations: All question × answer variant combinations (up to 9 per sample)
- Vietnamese Variants: Translated versions of enriched combinations
- Reasoning Enhancement: Multiple reasoning paths for improved model training
🔍 Quality Assurance
- Invalid Response Detection: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
- Retry Logic: Up to 3 attempts with different paraphrasing difficulties
- Drop Strategy: Samples dropped if retry fails (no fallback answers)
- Consistency Checking: LLM-based validation of answer quality
- De-identification: PHI removal with configurable strictness
🎯 RAG Optimization
- Embedding-Friendly: Concise, direct text optimized for dense retrieval
- Context Generation: Synthetic context creation when missing
- Content Cleaning: Conversational element removal for medical focus
- Length Control: Hard caps on question/answer/context lengths
- Quality Filtering: Invalid response cleaning for RAG corpora
📋 Supported Datasets
Medical Dialogue
- HealthCareMagic: 100k medical conversations
- iCliniq: 10k derived medical Q&A
Biomedical QA
- PubMedQA-L: Labeled biomedical questions
- PubMedQA-U: Unlabeled biomedical questions
- PubMedQA-MAP: Mapped biomedical Q&A pairs
⚙️ Configuration
Mode Selection
# Local Mode (MedAlpaca-13b)
IS_LOCAL=true
HF_TOKEN=your_huggingface_token
# Cloud Mode (NVIDIA/Gemini APIs)
IS_LOCAL=false
NVIDIA_API_1=your_nvidia_key
GEMINI_API_1=your_gemini_key
Augmentation Parameters
class AugmentOptions:
paraphrase_ratio: float = 0.2 # 0.0-1.0
paraphrase_outputs: bool = True # Augment model answers
backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
style_standardize: bool = True # Enforce clinical style
deidentify: bool = True # Remove PHI
dedupe: bool = True # Remove duplicates
max_chars: int = 5000 # Text length limit
consistency_check_ratio: float = 0.05 # 0.0-1.0
expand: bool = True # Enable enrichment
max_aug_per_sample: int = 2 # 1-3 variants
Processing Modes
- SFT Processing: Supervised Fine-Tuning format with enrichment
- RAG Processing: Question-Context-Answer format for retrieval
- Vietnamese Mode: Complete translation of all text fields
📈 Output Statistics
The system tracks comprehensive statistics:
written: Successfully processed samplesparaphrased_input/output: Paraphrasing countsbacktranslated_input/output: Backtranslation countsdropped_invalid: Samples dropped due to failed retriesvietnamese_variants: Vietnamese variants createddedup_skipped: Duplicate samples removedconsistency_failed: Samples flagged for quality issues
🔧 Usage
Web Interface
- Visit the HF Space
- Select dataset and processing mode (SFT/RAG)
- Enable Vietnamese translation if needed
- Click process button
API Usage
# SFT Processing with Vietnamese translation
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/process/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"augment": {
"paraphrase_ratio": 0.2,
"backtranslate_ratio": 0.1,
"paraphrase_outputs": true,
"style_standardize": true,
"deidentify": true,
"dedupe": true,
"expand": true
},
"vietnamese_translation": true
}'
# RAG Processing
curl -X POST "https://huggingface.co/spaces/MedSwin/medai-processing/rag/healthcaremagic" \
-H "Content-Type: application/json" \
-d '{
"vietnamese_translation": true
}'