File size: 10,013 Bytes
80cb919
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# 📊 MedAI Data Processing Techniques

This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.

## 🎯 Project Overview

The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.

## 🏗️ System Architecture

### Core Components
- **FastAPI Web Service**: RESTful API for dataset processing
- **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration
- **Centralized Writer**: Parallel JSONL + CSV output generation
- **Google Drive Integration**: Automated artifact storage
- **Progress Monitoring**: Real-time job status tracking

### Supported Datasets
1. **HealthCareMagic** (100k medical dialogues)
2. **iCliniq** (10k medical consultations)
3. **PubMedQA-Labelled** (biomedical Q&A with answers)
4. **PubMedQA-Unlabelled** (biomedical Q&A without answers)
5. **PubMedQA-Map** (biomedical Q&A mapping format)

## 🔧 Data Processing Pipeline

### 1. Data Ingestion & Download
- **Hugging Face Hub Integration**: Automatic dataset downloading
- **Format Detection**: JSON/JSONL auto-detection and parsing
- **Caching System**: Local storage with symlink optimization

### 2. Data Cleaning & Preprocessing

#### Text Normalization
- **Unicode Fixing**: `ftfy` library for text encoding issues
- **Whitespace Standardization**: Consistent spacing and line breaks
- **Quote Canonicalization**: Standard quote character conversion
- **Terminal Punctuation**: Ensures proper sentence endings

#### Content Sanitization
- **Length Capping**: Configurable maximum character limits (default: 5000)
- **Language Detection**: English language validation using `langid`
- **Content Truncation**: Smart sentence boundary cutting for long texts

### 3. Data Augmentation Techniques

#### LLM-Based Paraphrasing
- **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback)
- **Difficulty Levels**: Easy vs. Hard paraphrasing modes
- **Medical Context Preservation**: Maintains clinical terminology accuracy
- **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0)

#### Back-Translation Augmentation
- **Multi-Language Support**: German as intermediate language
- **Meaning Preservation**: Maintains semantic accuracy through translation cycles
- **Fallback Mechanisms**: Automatic retry with alternative models
- **Quality Control**: Length and content validation

#### Style Standardization
- **Clinical Voice Enforcement**: Neutral, professional medical tone
- **Absolute Language Removal**: Replaces guarantees with probabilistic language
- **Forum Sign-off Removal**: Eliminates informal communication patterns
- **Consistent Punctuation**: Standardized sentence structure

### 4. Data Quality Assurance

#### De-identification (PHI Removal)
- **Email Redaction**: `[REDACTED_EMAIL]` placeholder
- **Phone Number Masking**: `[REDACTED_PHONE]` placeholder
- **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders
- **Configurable Privacy**: Optional PHI removal per dataset

#### Deduplication
- **Fingerprinting Algorithm**: MD5-based content hashing
- **Multi-Field Matching**: Instruction + Input + Output combination
- **Normalized Comparison**: Case-insensitive, whitespace-normalized matching
- **Performance Optimized**: In-memory set-based deduplication

#### Consistency Validation
- **LLM-Based QA Check**: Automated answer validation against context
- **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01)
- **Medical Safety Validation**: Ensures clinical accuracy and safety
- **Failure Tagging**: Marks samples with consistency issues

### 5. Advanced Augmentation Features

#### Knowledge Distillation
- **Pseudo-Label Generation**: Creates labels for unlabeled data
- **Fractional Processing**: Configurable percentage for distillation
- **Single-Prompt Approach**: Efficient single LLM call per sample
- **Length Control**: Maintains reasonable output lengths

#### Multi-Variant Generation
- **Configurable Counts**: 1-3 augmented variants per sample
- **Tagged Augmentations**: Tracks applied augmentation techniques
- **Original Preservation**: Always maintains base sample
- **Randomized IDs**: Unique identifiers for augmented variants

### 6. Output Generation & Storage

#### Centralized Format
- **SFT Schema**: Standardized Supervised Fine-Tuning format
- **Metadata Preservation**: Source, task type, and augmentation tags
- **Dual Output**: Simultaneous JSONL and CSV generation
- **Memory-Safe Streaming**: Handles large datasets efficiently

#### Storage Integration
- **Local Caching**: `cache/outputs/` directory storage
- **Google Drive Upload**: Automated cloud storage integration
- **Timestamped Naming**: Unique file identification
- **MIME Type Handling**: Proper content type specification

## ⚙️ Configuration Options

### Augmentation Parameters
```python
class AugmentOptions:
    paraphrase_ratio: float = 0.0          # 0.0-1.0
    paraphrase_outputs: bool = False       # Augment model answers
    backtranslate_ratio: float = 0.0       # 0.0-1.0
    style_standardize: bool = True         # Enforce clinical style
    deidentify: bool = True                # Remove PHI
    dedupe: bool = True                    # Remove duplicates
    max_chars: int = 5000                  # Text length limit
    consistency_check_ratio: float = 0.0   # 0.0-1.0
    distill_fraction: float = 0.0          # 0.0-1.0 for unlabeled
    expand: bool = True                    # Enable augmentation
    max_aug_per_sample: int = 2            # 1-3 variants
```

### Processing Parameters
```python
class ProcessParams:
    augment: AugmentOptions                # Augmentation settings
    sample_limit: Optional[int] = None     # Dataset sampling
    seed: int = 42                        # Reproducibility
```

## 📈 Performance & Monitoring

### Progress Tracking
- **Real-time Updates**: Live progress percentage and status messages
- **Background Processing**: Non-blocking job execution
- **State Management**: Thread-safe status tracking
- **Error Handling**: Comprehensive exception logging

### Resource Management
- **API Key Rotation**: Automatic fallback between multiple API keys
- **Rate Limiting**: Configurable request throttling
- **Memory Optimization**: Streaming processing for large datasets
- **Concurrent Processing**: Background task execution

## 🔒 Security & Privacy

### Data Protection
- **PHI Removal**: Automatic sensitive information redaction
- **Secure Storage**: Google Drive integration with OAuth2
- **Access Control**: Environment-based API key management
- **Audit Logging**: Comprehensive processing logs

### API Security
- **OAuth2 Integration**: Google Drive authentication
- **Token Management**: Secure credential handling
- **Request Validation**: Pydantic model validation
- **Error Sanitization**: Safe error message handling

## 🚀 Usage Examples

### Basic Processing
```bash
# Process HealthCareMagic with default settings
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"augment": {"paraphrase_ratio": 0.1}}' \
  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
```

### Advanced Augmentation
```bash
# Process with comprehensive augmentation
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "augment": {
      "paraphrase_ratio": 0.2,
      "backtranslate_ratio": 0.1,
      "paraphrase_outputs": true,
      "style_standardize": true,
      "deidentify": true,
      "dedupe": true,
      "max_chars": 5000,
      "consistency_check_ratio": 0.01,
      "max_aug_per_sample": 3
    },
    "sample_limit": 1000,
    "seed": 42
  }' \
  https://binkhoale1812-medai-processing.hf.space/process/icliniq
```

## 📊 Output Statistics

### Processing Metrics
- **Written Rows**: Total processed samples
- **Paraphrased Inputs**: Count of augmented user inputs
- **Paraphrased Outputs**: Count of augmented model responses
- **Back-translated**: Count of translation-augmented samples
- **Deduplication**: Count of skipped duplicate samples
- **Consistency Failures**: Count of validation failures

### File Outputs
- **JSONL Format**: Structured fine-tuning data with metadata
- **CSV Format**: Simplified tabular representation
- **Google Drive**: Cloud storage with automatic upload
- **Local Cache**: Persistent local storage

## 🔮 Future Enhancements

### Planned Features
- **Additional Dataset Support**: More medical dataset types
- **Advanced Augmentation**: More sophisticated LLM techniques
- **Quality Metrics**: Automated data quality scoring
- **Batch Processing**: Multiple dataset concurrent processing
- **Custom Schemas**: User-defined output formats

### Scalability Improvements
- **Distributed Processing**: Multi-node processing support
- **Streaming Augmentation**: Real-time data enhancement
- **Caching Optimization**: Improved performance and cost efficiency
- **API Rate Limiting**: Better resource management

## 📚 Technical Dependencies

### Core Libraries
- **FastAPI**: Web framework for API development
- **Hugging Face Hub**: Dataset downloading and management
- **Google GenAI**: Gemini model integration
- **ftfy**: Text encoding and normalization
- **langid**: Language detection
- **orjson**: High-performance JSON processing

### External Services
- **NVIDIA API**: Primary LLM service for paraphrasing
- **Google Gemini**: Fallback LLM service
- **Google Drive**: Cloud storage integration
- **Hugging Face Spaces**: Deployment platform

---

*This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.*