MedAI_Processing / docs /REQUEST.md
LiamKhoaLe's picture
Upd local setups with dynamic mode setter
a89888b

📑 MedAI Processing – Request Examples

Base URL of the Space:
https://binkhoale1812-medai-processing.hf.space

This Space processes medical datasets into a centralised fine-tuning format (JSONL + CSV) with optional augmentations such as paraphrasing, back-translation, style standardisation, de-identification, and deduplication.


🔹 1. Process HealthCareMagic

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "augment": {
          "paraphrase_ratio": 0.1,
          "backtranslate_ratio": 0.05,
          "paraphrase_outputs": false,
          "style_standardize": true,
          "deidentify": true,
          "dedupe": true,
          "max_chars": 5000
        },
        "sample_limit": 2000,
        "seed": 42
      }' \
  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic

🔹 2. Process iCliniq

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "augment": {
          "paraphrase_ratio": 0.2,
          "backtranslate_ratio": 0.1,
          "paraphrase_outputs": true,
          "style_standardize": true,
          "deidentify": true,
          "dedupe": true,
          "max_chars": 5000
        },
        "sample_limit": 1500,
        "seed": 123
      }' \
  https://binkhoale1812-medai-processing.hf.space/process/icliniq

🔹 3. Process PubMedQA (Labelled)

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "augment": {
          "paraphrase_ratio": 0.05,
          "backtranslate_ratio": 0.02,
          "paraphrase_outputs": false,
          "style_standardize": true,
          "deidentify": false,
          "dedupe": true,
          "max_chars": 8000
        },
        "sample_limit": 1000,
        "seed": 99
      }' \
  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_l

🔹 4. Process PubMedQA (Unlabelled)

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "augment": {
          "paraphrase_ratio": 0.05,
          "backtranslate_ratio": 0.05,
          "paraphrase_outputs": false,
          "style_standardize": true,
          "deidentify": true,
          "dedupe": true,
          "max_chars": 7000,
          "consistency_check_ratio": 0.01,
          "distill_fraction": 0.1
        },
        "sample_limit": 500,
        "seed": 7
      }' \
  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_u

🔹 5. Process PubMedQA (Map)

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "augment": {
          "paraphrase_ratio": 0.1,
          "backtranslate_ratio": 0.05,
          "paraphrase_outputs": true,
          "style_standardize": true,
          "deidentify": true,
          "dedupe": true,
          "max_chars": 6000
        },
        "sample_limit": 1200,
        "seed": 2024
      }' \
  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_map

🔹 6. Check Current Job Status

curl https://binkhoale1812-medai-processing.hf.space/status

🔹 7. List Generated Artifacts

curl https://binkhoale1812-medai-processing.hf.space/files

✅ Notes

  • Each run outputs both .jsonl and .csv in cache/outputs/ and also uploads them to Google Drive folder ID: 1JvW7its63E58fLxurH8ZdhxzdpcMrMbt

  • augment options can be adjusted per dataset:

    • paraphrase_ratio – % of rows paraphrased (0–1)
    • backtranslate_ratio – % of rows back-translated
    • paraphrase_outputs – whether to also augment model answers
    • style_standardize – enforce neutral, clinical style
    • deidentify – redact PHI (emails, phones, URLs, IPs)
    • dedupe – skip duplicate pairs
    • consistency_check_ratio – run lightweight QA sanity check
    • distill_fraction – generate pseudo-labels for unlabelled data