LiamKhoaLe commited on
Commit
65da874
·
1 Parent(s): 5dcfc82

Upd README

Browse files
Files changed (1) hide show
  1. README.md +123 -6
README.md CHANGED
@@ -6,16 +6,16 @@ colorTo: pink
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
- short_description: Data processing with en-vi translation. Derived from 500k mi
10
  ---
11
 
12
- ## Quick Access:
13
 
14
  [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
15
 
16
  [MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
17
 
18
- [MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)
19
 
20
  [PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
21
 
@@ -23,10 +23,127 @@ short_description: Data processing with en-vi translation. Derived from 500k mi
23
 
24
  [PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
25
 
 
26
 
27
- ## CURL Request Instruction
28
- [Request Doc](https://huggingface.co/spaces/MedVietAI/processing/blob/main/REQUEST.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- ## License
31
  [Apache-2.0 LICENSE](https://huggingface.co/spaces/MedVietAI/processing/blob/main/LICENSE.txt)
32
 
 
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
+ short_description: Advanced medical data processing with Vietnamese translation, data augmentation, and quality validation
10
  ---
11
 
12
+ ## 🚀 Quick Access
13
 
14
  [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
15
 
16
  [MedDialog-100k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-100k)
17
 
18
+ [MedDialog-10k](https://huggingface.co/datasets/MedAI-COS30018/MedDialog-EN-10k)
19
 
20
  [PubMedQA-Labelled](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-L)
21
 
 
23
 
24
  [PubMedQA-Mapper](https://huggingface.co/datasets/MedAI-COS30018/PubMedQA-MAP)
25
 
26
+ ## 🎯 Features
27
 
28
+ ### 🔄 Advanced Data Augmentation
29
+ - **Paraphrasing**: Multi-model rotation (NVIDIA + Gemini) with easy/hard difficulty levels
30
+ - **Backtranslation**: Vietnamese pivot language for semantic preservation
31
+ - **Style Standardization**: Clinical voice enforcement and professional medical tone
32
+ - **Response Validation**: Invalid response detection and retry logic (max 3 attempts)
33
+ - **Quality Guards**: Length/semantic validation for backtranslation outputs
34
+
35
+ ### 🇻🇳 Vietnamese Translation
36
+ - **Complete Translation**: All text fields translated when Vietnamese mode is enabled
37
+ - **Quality Validation**: Translation quality checks with fallback to original text
38
+ - **SFT Format**: `instruction`, `input`, `output` fields translated
39
+ - **RAG Format**: `question`, `answer`, `context` fields translated
40
+ - **Sanitization**: Repetition reduction and whitespace normalization
41
+
42
+ ### 📊 SFT Data Enrichment
43
+ - **Multiple Answer Variants**: 2-3 different answers per question for better reasoning
44
+ - **Multiple Question Variants**: 2-3 different questions per answer for diverse training
45
+ - **Cross Combinations**: All question × answer variant combinations (up to 9 per sample)
46
+ - **Vietnamese Variants**: Translated versions of enriched combinations
47
+ - **Reasoning Enhancement**: Multiple reasoning paths for improved model training
48
+
49
+ ### 🔍 Quality Assurance
50
+ - **Invalid Response Detection**: Catches "Fail", "Invalid", "I can't", "Sorry", etc.
51
+ - **Retry Logic**: Up to 3 attempts with different paraphrasing difficulties
52
+ - **Drop Strategy**: Samples dropped if retry fails (no fallback answers)
53
+ - **Consistency Checking**: LLM-based validation of answer quality
54
+ - **De-identification**: PHI removal with configurable strictness
55
+
56
+ ### 🎯 RAG Optimization
57
+ - **Embedding-Friendly**: Concise, direct text optimized for dense retrieval
58
+ - **Context Generation**: Synthetic context creation when missing
59
+ - **Content Cleaning**: Conversational element removal for medical focus
60
+ - **Length Control**: Hard caps on question/answer/context lengths
61
+ - **Quality Filtering**: Invalid response cleaning for RAG corpora
62
+
63
+ ## 📋 Supported Datasets
64
+
65
+ ### Medical Dialogue
66
+ - **HealthCareMagic**: 100k medical conversations
67
+ - **iCliniq**: 10k derived medical Q&A
68
+
69
+ ### Biomedical QA
70
+ - **PubMedQA-L**: Labeled biomedical questions
71
+ - **PubMedQA-U**: Unlabeled biomedical questions
72
+ - **PubMedQA-MAP**: Mapped biomedical Q&A pairs
73
+
74
+ ## ⚙️ Configuration
75
+
76
+ ### Augmentation Parameters
77
+ ```python
78
+ class AugmentOptions:
79
+ paraphrase_ratio: float = 0.2 # 0.0-1.0
80
+ paraphrase_outputs: bool = True # Augment model answers
81
+ backtranslate_ratio: float = 0.1 # 0.0-1.0 (Vietnamese pivot)
82
+ style_standardize: bool = True # Enforce clinical style
83
+ deidentify: bool = True # Remove PHI
84
+ dedupe: bool = True # Remove duplicates
85
+ max_chars: int = 5000 # Text length limit
86
+ consistency_check_ratio: float = 0.05 # 0.0-1.0
87
+ expand: bool = True # Enable enrichment
88
+ max_aug_per_sample: int = 2 # 1-3 variants
89
+ ```
90
+
91
+ ### Processing Modes
92
+ - **SFT Processing**: Supervised Fine-Tuning format with enrichment
93
+ - **RAG Processing**: Question-Context-Answer format for retrieval
94
+ - **Vietnamese Mode**: Complete translation of all text fields
95
+
96
+ ## 📈 Output Statistics
97
+
98
+ The system tracks comprehensive statistics:
99
+ - `written`: Successfully processed samples
100
+ - `paraphrased_input/output`: Paraphrasing counts
101
+ - `backtranslated_input/output`: Backtranslation counts
102
+ - `dropped_invalid`: Samples dropped due to failed retries
103
+ - `vietnamese_variants`: Vietnamese variants created
104
+ - `dedup_skipped`: Duplicate samples removed
105
+ - `consistency_failed`: Samples flagged for quality issues
106
+
107
+ ## 🔧 Usage
108
+
109
+ ### Web Interface
110
+ 1. Visit the [HF Space](https://huggingface.co/spaces/MedVietAI/processing)
111
+ 2. Select dataset and processing mode (SFT/RAG)
112
+ 3. Enable Vietnamese translation if needed
113
+ 4. Click process button
114
+
115
+ ### API Usage
116
+ ```bash
117
+ # SFT Processing with Vietnamese translation
118
+ curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/process/healthcaremagic" \
119
+ -H "Content-Type: application/json" \
120
+ -d '{
121
+ "augment": {
122
+ "paraphrase_ratio": 0.2,
123
+ "backtranslate_ratio": 0.1,
124
+ "paraphrase_outputs": true,
125
+ "style_standardize": true,
126
+ "deidentify": true,
127
+ "dedupe": true,
128
+ "expand": true
129
+ },
130
+ "vietnamese_translation": true
131
+ }'
132
+
133
+ # RAG Processing
134
+ curl -X POST "https://huggingface.co/spaces/MedVietAI/processing/rag/healthcaremagic" \
135
+ -H "Content-Type: application/json" \
136
+ -d '{
137
+ "vietnamese_translation": true
138
+ }'
139
+ ```
140
+
141
+ ## 📚 Documentation
142
+
143
+ - [Request Documentation](https://huggingface.co/spaces/MedVietAI/processing/blob/main/REQUEST.md)
144
+ - [Data Processing Guide](https://huggingface.co/spaces/MedVietAI/processing/blob/main/DATA_PROCESSING.md)
145
+
146
+ ## 📄 License
147
 
 
148
  [Apache-2.0 LICENSE](https://huggingface.co/spaces/MedVietAI/processing/blob/main/LICENSE.txt)
149