Spaces:
Sleeping
Sleeping
Commit
·
65da874
1
Parent(s):
5dcfc82
Upd README
Browse files
README.md
CHANGED
|
@@ -6,16 +6,16 @@ colorTo: pink
|
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: apache-2.0
|
| 9 |
-
short_description:
|
| 10 |
---
|
| 11 |
|
| 12 |
-
## Quick Access
|
| 13 |
|
| 14 |
[HF Space](https://huggingface.co/spaces/MedVietAI/processing)
|
| 15 |
|
| 16 |
[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
|
| 17 |
|
| 18 |
-
[MedDialog-
|
| 19 |
|
| 20 |
[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
|
| 21 |
|
|
@@ -23,10 +23,127 @@ short_description: Data processing with en-vi translation. Derived from 500k mi
|
|
| 23 |
|
| 24 |
[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
|
| 25 |
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
## License
|
| 31 |
[Apache-2.0 LICENSE](https://huggingface.co/spaces/MedVietAI/processing/blob/main/LICENSE.txt)
|
| 32 |
|
|
|
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: apache-2.0
|
| 9 |
+
short_description: Advanced medical data processing with Vietnamese translation, data augmentation, and quality validation
|
| 10 |
---
|
| 11 |
|
| 12 |
+
## 🚀 Quick Access
|
| 13 |
|
| 14 |
[HF Space](https://huggingface.co/spaces/MedVietAI/processing)
|
| 15 |
|
| 16 |
[MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
|
| 17 |
|
| 18 |
+
[MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)
|
| 19 |
|
| 20 |
[PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
|
| 21 |
|
|
|
|
| 23 |
|
| 24 |
[PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
|
| 25 |
|
| 26 |
+
## 🎯 Features
|
| 27 |
|
| 28 |
+
### 🔄 Advanced Data Augmentation
|
| 29 |
+
- **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
|
| 30 |
+
- **Backtranslation**: Vietnamese pivot language for semantic preservation
|
| 31 |
+
- **Style Standardization**: Clinical voice enforcement and professional medical tone
|
| 32 |
+
- **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
|
| 33 |
+
- **Quality Guards**: Length/semantic validation for backtranslation outputs
|
| 34 |
+
|
| 35 |
+
### 🇻🇳 Vietnamese Translation
|
| 36 |
+
- **Complete Translation**: All text fields translated when Vietnamese mode is enabled
|
| 37 |
+
- **Quality Validation**: Translation quality checks with fallback to original text
|
| 38 |
+
- **SFT Format**: `instruction`, `input`, `output` fields translated
|
| 39 |
+
- **RAG Format**: `question`, `answer`, `context` fields translated
|
| 40 |
+
- **Sanitization**: Repetition reduction and whitespace normalization
|
| 41 |
+
|
| 42 |
+
### 📊 SFT Data Enrichment
|
| 43 |
+
- **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
|
| 44 |
+
- **Multiple Question Variants**: 2-3 different questions per answer for diverse training
|
| 45 |
+
- **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
|
| 46 |
+
- **Vietnamese Variants**: Translated versions of enriched combinations
|
| 47 |
+
- **Reasoning Enhancement**: Multiple reasoning paths for improved model training
|
| 48 |
+
|
| 49 |
+
### 🔍 Quality Assurance
|
| 50 |
+
- **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
|
| 51 |
+
- **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
|
| 52 |
+
- **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
|
| 53 |
+
- **Consistency Checking**: LLM-based validation of answer quality
|
| 54 |
+
- **De-identification**: PHI removal with configurable strictness
|
| 55 |
+
|
| 56 |
+
### 🎯 RAG Optimization
|
| 57 |
+
- **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
|
| 58 |
+
- **Context Generation**: Synthetic context creation when missing
|
| 59 |
+
- **Content Cleaning**: Conversational element removal for medical focus
|
| 60 |
+
- **Length Control**: Hard caps on question/answer/context lengths
|
| 61 |
+
- **Quality Filtering**: Invalid response cleaning for RAG corpora
|
| 62 |
+
|
| 63 |
+
## 📋 Supported Datasets
|
| 64 |
+
|
| 65 |
+
### Medical Dialogue
|
| 66 |
+
- **HealthCareMagic**: 100k medical conversations
|
| 67 |
+
- **iCliniq**: 10k derived medical Q&A
|
| 68 |
+
|
| 69 |
+
### Biomedical QA
|
| 70 |
+
- **PubMedQA-L**: Labeled biomedical questions
|
| 71 |
+
- **PubMedQA-U**: Unlabeled biomedical questions
|
| 72 |
+
- **PubMedQA-MAP**: Mapped biomedical Q&A pairs
|
| 73 |
+
|
| 74 |
+
## ⚙️ Configuration
|
| 75 |
+
|
| 76 |
+
### Augmentation Parameters
|
| 77 |
+
```python
|
| 78 |
+
class AugmentOptions:
|
| 79 |
+
paraphrase_ratio: float = 0.2 # 0.0-1.0
|
| 80 |
+
paraphrase_outputs: bool = True # Augment model answers
|
| 81 |
+
backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
|
| 82 |
+
style_standardize: bool = True # Enforce clinical style
|
| 83 |
+
deidentify: bool = True # Remove PHI
|
| 84 |
+
dedupe: bool = True # Remove duplicates
|
| 85 |
+
max_chars: int = 5000 # Text length limit
|
| 86 |
+
consistency_check_ratio: float = 0.05 # 0.0-1.0
|
| 87 |
+
expand: bool = True # Enable enrichment
|
| 88 |
+
max_aug_per_sample: int = 2 # 1-3 variants
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### Processing Modes
|
| 92 |
+
- **SFT Processing**: Supervised Fine-Tuning format with enrichment
|
| 93 |
+
- **RAG Processing**: Question-Context-Answer format for retrieval
|
| 94 |
+
- **Vietnamese Mode**: Complete translation of all text fields
|
| 95 |
+
|
| 96 |
+
## 📈 Output Statistics
|
| 97 |
+
|
| 98 |
+
The system tracks comprehensive statistics:
|
| 99 |
+
- `written`: Successfully processed samples
|
| 100 |
+
- `paraphrased_input/output`: Paraphrasing counts
|
| 101 |
+
- `backtranslated_input/output`: Backtranslation counts
|
| 102 |
+
- `dropped_invalid`: Samples dropped due to failed retries
|
| 103 |
+
- `vietnamese_variants`: Vietnamese variants created
|
| 104 |
+
- `dedup_skipped`: Duplicate samples removed
|
| 105 |
+
- `consistency_failed`: Samples flagged for quality issues
|
| 106 |
+
|
| 107 |
+
## 🔧 Usage
|
| 108 |
+
|
| 109 |
+
### Web Interface
|
| 110 |
+
1. Visit the [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
|
| 111 |
+
2. Select dataset and processing mode (SFT/RAG)
|
| 112 |
+
3. Enable Vietnamese translation if needed
|
| 113 |
+
4. Click process button
|
| 114 |
+
|
| 115 |
+
### API Usage
|
| 116 |
+
```bash
|
| 117 |
+
# SFT Processing with Vietnamese translation
|
| 118 |
+
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/process/healthcaremagic" \
|
| 119 |
+
-H "Content-Type: application/json" \
|
| 120 |
+
-d '{
|
| 121 |
+
"augment": {
|
| 122 |
+
"paraphrase_ratio": 0.2,
|
| 123 |
+
"backtranslate_ratio": 0.1,
|
| 124 |
+
"paraphrase_outputs": true,
|
| 125 |
+
"style_standardize": true,
|
| 126 |
+
"deidentify": true,
|
| 127 |
+
"dedupe": true,
|
| 128 |
+
"expand": true
|
| 129 |
+
},
|
| 130 |
+
"vietnamese_translation": true
|
| 131 |
+
}'
|
| 132 |
+
|
| 133 |
+
# RAG Processing
|
| 134 |
+
curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/rag/healthcaremagic" \
|
| 135 |
+
-H "Content-Type: application/json" \
|
| 136 |
+
-d '{
|
| 137 |
+
"vietnamese_translation": true
|
| 138 |
+
}'
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
## 📚 Documentation
|
| 142 |
+
|
| 143 |
+
- [Request Documentation](https://huggingface.co/spaces/MedVietAI/processing/blob/main/REQUEST.md)
|
| 144 |
+
- [Data Processing Guide](https://huggingface.co/spaces/MedVietAI/processing/blob/main/DATA_PROCESSING.md)
|
| 145 |
+
|
| 146 |
+
## 📄 License
|
| 147 |
|
|
|
|
| 148 |
[Apache-2.0 LICENSE](https://huggingface.co/spaces/MedVietAI/processing/blob/main/LICENSE.txt)
|
| 149 |
|