fix(hf): avoid eager 27GB dataset download on /api/cases (#46)
Browse files* docs(bug): add BUG-014 api/cases 500 error investigation
- GET /api/cases returns 500 Internal Server Error
- Health check passes but cases endpoint fails
- Frontend stuck on 'Loading cases...'
- Root cause TBD - need HF Space logs to confirm
- Dataset is PUBLIC (no auth issue)
- datasets library is transitive dep (not missing)
* docs(bug): update BUG-014 with confirmed root cause analysis
Root cause: datasets.load_dataset() downloads entire 27GB dataset
on every /api/cases call. This times out on HF Spaces.
- Verified call chain: routes.py → data/__init__.py → loader.py:224
- Dataset: 149 parquet shards × 184MB = 27.41GB
- HF Space logs won't help - request times out at proxy before logging
- Recommended fix: small demo dataset with 5-10 cases
* docs(bug): BUG-014 final fix - metadata-based case list, streaming for single case
Root cause: datasets.load_dataset() without streaming=True downloads
all 27GB before returning. This times out on HF Spaces.
Fix:
1. list_case_ids() - use load_dataset_builder().info (1s, no download)
2. get_case() - use streaming + early termination (~30-120s per case)
3. Wire hf_cache_dir for persistent caching
Previous audit dismissed this as 'negligible latency' - that was wrong.
The latency is from downloading 27GB, not from '149 cases'.
* fix(hf): avoid eager 27GB dataset download on /api/cases
Root cause: `datasets.load_dataset(dataset_id, split="train")` downloads
the entire 27GB ISLES24 dataset on cold start, causing HF Spaces proxy
timeouts on `/api/cases`.
Fix:
- Add pinned manifest of 149 case IDs (isles24_manifest.py)
- Add Isles24HuggingFaceDataset class that:
- Returns case IDs from manifest (no download)
- Loads single Parquet shard via data_files= for get_case()
- Route ISLES24 dataset loads to the optimized class
Result:
- `/api/cases` returns instantly (no dataset download)
- `get_case()` downloads ~180MB (one shard) instead of 27GB
Closes BUG-014
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# BUG-014: /api/cases Endpoint Timeout
|
| 2 |
+
|
| 3 |
+
**Status**: ROOT CAUSE CONFIRMED, FIX IDENTIFIED
|
| 4 |
+
**Severity**: P1 (Frontend completely broken)
|
| 5 |
+
**Discovered**: 2025-12-15
|
| 6 |
+
**Reporter**: Manual testing
|
| 7 |
+
|
| 8 |
+
## Summary
|
| 9 |
+
|
| 10 |
+
The `/api/cases` endpoint failed on HF Spaces because it triggered an eager ~27GB
|
| 11 |
+
dataset download/prepare step just to return a list of case IDs.
|
| 12 |
+
|
| 13 |
+
The fix keeps the *full* dataset, but changes the data access pattern so:
|
| 14 |
+
- `/api/cases` does **zero** dataset downloading
|
| 15 |
+
- `get_case(case_id)` downloads **one** Parquet shard (one case), not the full dataset
|
| 16 |
+
|
| 17 |
+
## Root Cause
|
| 18 |
+
|
| 19 |
+
**Current code path:**
|
| 20 |
+
```text
|
| 21 |
+
GET /api/cases
|
| 22 |
+
→ routes.py:57 list_case_ids()
|
| 23 |
+
→ data/__init__.py:43 load_isles_dataset()
|
| 24 |
+
→ loader.py:224 ds = load_dataset(...) ← DOWNLOADS 27GB EAGERLY
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
**The bug:** `datasets.load_dataset(dataset_id, split="train")` downloads/prepares the full
|
| 28 |
+
dataset on cold start. On HF Spaces this regularly exceeds the proxy timeout window, so
|
| 29 |
+
the frontend never receives a usable case list.
|
| 30 |
+
|
| 31 |
+
## Previous Misdiagnosis
|
| 32 |
+
|
| 33 |
+
This was identified in `ARCHITECTURE-AUDIT-2024-12-13.md` as P2-006 and dismissed in `REMAINING-ISSUES-2024-12-13.md`:
|
| 34 |
+
|
| 35 |
+
> "Dataset reload per request | ACCEPTABLE | Demo scale (149 cases), adds negligible latency"
|
| 36 |
+
|
| 37 |
+
**This assessment was wrong.** The latency isn't from "149 cases"—it's from an eager
|
| 38 |
+
27GB download/prepare step happening in the request path.
|
| 39 |
+
|
| 40 |
+
## Verified Facts
|
| 41 |
+
|
| 42 |
+
| Fact | Value | Verification |
|
| 43 |
+
|------|-------|--------------|
|
| 44 |
+
| Total download | 27.41 GB | `load_dataset_builder().info.download_size` |
|
| 45 |
+
| Number of cases | 149 | `info.splits['train'].num_examples` |
|
| 46 |
+
| Shards | 149 parquet files | HF Hub API |
|
| 47 |
+
| Shard shape | 1 case per parquet file | `load_dataset(..., data_files={...})` returns 1 row |
|
| 48 |
+
| Case ID range | `sub-stroke0001` … `sub-stroke0189` (with gaps) | subject_id per shard |
|
| 49 |
+
|
| 50 |
+
## The Fix
|
| 51 |
+
|
| 52 |
+
### Part 1: Instant Case List (No Download)
|
| 53 |
+
|
| 54 |
+
**Problem:** `list_case_ids()` was implemented by loading the dataset, which on HF Spaces
|
| 55 |
+
meant triggering the full 27GB download/prepare.
|
| 56 |
+
|
| 57 |
+
**Solution:** Use a pinned manifest of case IDs for the ISLES24 dataset.
|
| 58 |
+
|
| 59 |
+
Implemented at `src/stroke_deepisles_demo/data/isles24_manifest.py` (pinned to dataset revision).
|
| 60 |
+
|
| 61 |
+
### Part 2: Per-Case Loading by Shard (No Full Download)
|
| 62 |
+
|
| 63 |
+
**Problem:** `get_case(case_id)` previously loaded the whole dataset, even when only one
|
| 64 |
+
case is needed for inference.
|
| 65 |
+
|
| 66 |
+
**Solution:** For a single case, load exactly one Parquet shard using `data_files=...`,
|
| 67 |
+
then materialize DWI/ADC (and optional lesion mask) to temp files.
|
| 68 |
+
|
| 69 |
+
Implemented in `src/stroke_deepisles_demo/data/loader.py` as `Isles24HuggingFaceDataset`.
|
| 70 |
+
|
| 71 |
+
### Why This Works on HF Spaces
|
| 72 |
+
|
| 73 |
+
- `/api/cases` becomes a pure metadata response (fast, reliable).
|
| 74 |
+
- Per-case data download happens in the background job (fits the async job model).
|
| 75 |
+
- No streaming iteration over the full dataset is required.
|
| 76 |
+
|
| 77 |
+
## Files to Change
|
| 78 |
+
|
| 79 |
+
| File | Change |
|
| 80 |
+
|------|--------|
|
| 81 |
+
| `src/stroke_deepisles_demo/data/isles24_manifest.py` | Add pinned case ID manifest + shard mapping |
|
| 82 |
+
| `src/stroke_deepisles_demo/data/loader.py` | Add `Isles24HuggingFaceDataset` + route ISLES24 loads to it |
|
| 83 |
+
|
| 84 |
+
## Implementation Steps
|
| 85 |
+
|
| 86 |
+
1. Add pinned ISLES24 case ID manifest (no download on `/api/cases`)
|
| 87 |
+
2. Load single Parquet shard via `data_files=...` for `get_case(case_id)`
|
| 88 |
+
3. Verify `/api/cases` returns immediately on HF Spaces
|
| 89 |
+
4. Verify segmentation job downloads only selected case data
|
| 90 |
+
|
| 91 |
+
## Verification After Fix
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
# After deploying fix, from cold start:
|
| 95 |
+
time curl -s https://vibecodermcswaggins-stroke-deepisles-demo.hf.space/api/cases
|
| 96 |
+
# Should complete quickly with {"cases": ["sub-stroke0001", ...]}
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Notes
|
| 100 |
+
|
| 101 |
+
- The full 27GB dataset IS supported - we're not reducing it
|
| 102 |
+
- Case loading happens in background jobs, not blocking the API gateway timeout window
|
| 103 |
+
- The core issue was doing full-dataset work inside `/api/cases`
|
|
@@ -8,6 +8,11 @@
|
|
| 8 |
|
| 9 |
## Problem Statement
|
| 10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
`stroke-deepisles-demo` has a hand-rolled data loading workaround that:
|
| 12 |
|
| 13 |
1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly
|
|
|
|
| 8 |
|
| 9 |
## Problem Statement
|
| 10 |
|
| 11 |
+
> **Update (2025-12-15):** HF Spaces requires avoiding eager full-dataset downloads.
|
| 12 |
+
> The implementation now uses a pinned ISLES24 manifest (`src/stroke_deepisles_demo/data/isles24_manifest.py`)
|
| 13 |
+
> for `/api/cases`, and loads a single case via `datasets.load_dataset(..., data_files=...)` (one Parquet shard)
|
| 14 |
+
> instead of `load_dataset(..., split="train")`.
|
| 15 |
+
|
| 16 |
`stroke-deepisles-demo` has a hand-rolled data loading workaround that:
|
| 17 |
|
| 18 |
1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly
|
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ISLES'24 Stroke dataset manifest for Hugging Face loading.
|
| 2 |
+
|
| 3 |
+
This project targets the Hugging Face dataset repo `hugging-science/isles24-stroke`.
|
| 4 |
+
On HF Spaces, calling `datasets.load_dataset(dataset_id, split="train")` can trigger
|
| 5 |
+
an eager download/prepare step for ~27GB of Parquet shards, which is not viable for
|
| 6 |
+
fast API endpoints like `/api/cases`.
|
| 7 |
+
|
| 8 |
+
The upstream dataset stores one case per Parquet file at:
|
| 9 |
+
`data/train-00000-of-00149.parquet` ... `data/train-00148-of-00149.parquet`
|
| 10 |
+
|
| 11 |
+
This manifest provides:
|
| 12 |
+
- The authoritative list of available case IDs (subject_ids) for the train split.
|
| 13 |
+
- A stable mapping from case ID → Parquet shard index (and thus data file path).
|
| 14 |
+
|
| 15 |
+
SSOT:
|
| 16 |
+
- Generated from the dataset at the pinned revision below by reading `subject_id`
|
| 17 |
+
from each Parquet file (without downloading the full dataset).
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
ISLES24_DATASET_ID = "hugging-science/isles24-stroke"
|
| 23 |
+
# Pinned to the dataset revision used to generate this manifest.
|
| 24 |
+
ISLES24_DATASET_REVISION = "9707a7fca6d3dd1a690de010ec4aed06bdcd0417"
|
| 25 |
+
|
| 26 |
+
ISLES24_TRAIN_NUM_FILES = 149
|
| 27 |
+
|
| 28 |
+
# Case IDs in the same order as the Parquet shard filenames (train-00000..., train-00001..., ...).
|
| 29 |
+
ISLES24_TRAIN_CASE_IDS: tuple[str, ...] = (
|
| 30 |
+
"sub-stroke0001",
|
| 31 |
+
"sub-stroke0002",
|
| 32 |
+
"sub-stroke0003",
|
| 33 |
+
"sub-stroke0004",
|
| 34 |
+
"sub-stroke0005",
|
| 35 |
+
"sub-stroke0006",
|
| 36 |
+
"sub-stroke0007",
|
| 37 |
+
"sub-stroke0008",
|
| 38 |
+
"sub-stroke0009",
|
| 39 |
+
"sub-stroke0010",
|
| 40 |
+
"sub-stroke0011",
|
| 41 |
+
"sub-stroke0012",
|
| 42 |
+
"sub-stroke0013",
|
| 43 |
+
"sub-stroke0014",
|
| 44 |
+
"sub-stroke0015",
|
| 45 |
+
"sub-stroke0016",
|
| 46 |
+
"sub-stroke0017",
|
| 47 |
+
"sub-stroke0019",
|
| 48 |
+
"sub-stroke0020",
|
| 49 |
+
"sub-stroke0021",
|
| 50 |
+
"sub-stroke0022",
|
| 51 |
+
"sub-stroke0025",
|
| 52 |
+
"sub-stroke0026",
|
| 53 |
+
"sub-stroke0027",
|
| 54 |
+
"sub-stroke0028",
|
| 55 |
+
"sub-stroke0030",
|
| 56 |
+
"sub-stroke0033",
|
| 57 |
+
"sub-stroke0036",
|
| 58 |
+
"sub-stroke0037",
|
| 59 |
+
"sub-stroke0038",
|
| 60 |
+
"sub-stroke0040",
|
| 61 |
+
"sub-stroke0043",
|
| 62 |
+
"sub-stroke0045",
|
| 63 |
+
"sub-stroke0047",
|
| 64 |
+
"sub-stroke0048",
|
| 65 |
+
"sub-stroke0049",
|
| 66 |
+
"sub-stroke0052",
|
| 67 |
+
"sub-stroke0053",
|
| 68 |
+
"sub-stroke0054",
|
| 69 |
+
"sub-stroke0055",
|
| 70 |
+
"sub-stroke0057",
|
| 71 |
+
"sub-stroke0062",
|
| 72 |
+
"sub-stroke0066",
|
| 73 |
+
"sub-stroke0068",
|
| 74 |
+
"sub-stroke0070",
|
| 75 |
+
"sub-stroke0071",
|
| 76 |
+
"sub-stroke0073",
|
| 77 |
+
"sub-stroke0074",
|
| 78 |
+
"sub-stroke0075",
|
| 79 |
+
"sub-stroke0076",
|
| 80 |
+
"sub-stroke0077",
|
| 81 |
+
"sub-stroke0078",
|
| 82 |
+
"sub-stroke0079",
|
| 83 |
+
"sub-stroke0080",
|
| 84 |
+
"sub-stroke0081",
|
| 85 |
+
"sub-stroke0082",
|
| 86 |
+
"sub-stroke0083",
|
| 87 |
+
"sub-stroke0084",
|
| 88 |
+
"sub-stroke0085",
|
| 89 |
+
"sub-stroke0086",
|
| 90 |
+
"sub-stroke0087",
|
| 91 |
+
"sub-stroke0088",
|
| 92 |
+
"sub-stroke0089",
|
| 93 |
+
"sub-stroke0090",
|
| 94 |
+
"sub-stroke0091",
|
| 95 |
+
"sub-stroke0092",
|
| 96 |
+
"sub-stroke0093",
|
| 97 |
+
"sub-stroke0094",
|
| 98 |
+
"sub-stroke0095",
|
| 99 |
+
"sub-stroke0096",
|
| 100 |
+
"sub-stroke0097",
|
| 101 |
+
"sub-stroke0098",
|
| 102 |
+
"sub-stroke0099",
|
| 103 |
+
"sub-stroke0100",
|
| 104 |
+
"sub-stroke0101",
|
| 105 |
+
"sub-stroke0102",
|
| 106 |
+
"sub-stroke0103",
|
| 107 |
+
"sub-stroke0104",
|
| 108 |
+
"sub-stroke0105",
|
| 109 |
+
"sub-stroke0106",
|
| 110 |
+
"sub-stroke0107",
|
| 111 |
+
"sub-stroke0108",
|
| 112 |
+
"sub-stroke0109",
|
| 113 |
+
"sub-stroke0110",
|
| 114 |
+
"sub-stroke0111",
|
| 115 |
+
"sub-stroke0112",
|
| 116 |
+
"sub-stroke0113",
|
| 117 |
+
"sub-stroke0114",
|
| 118 |
+
"sub-stroke0115",
|
| 119 |
+
"sub-stroke0116",
|
| 120 |
+
"sub-stroke0117",
|
| 121 |
+
"sub-stroke0118",
|
| 122 |
+
"sub-stroke0119",
|
| 123 |
+
"sub-stroke0133",
|
| 124 |
+
"sub-stroke0134",
|
| 125 |
+
"sub-stroke0135",
|
| 126 |
+
"sub-stroke0136",
|
| 127 |
+
"sub-stroke0137",
|
| 128 |
+
"sub-stroke0138",
|
| 129 |
+
"sub-stroke0139",
|
| 130 |
+
"sub-stroke0140",
|
| 131 |
+
"sub-stroke0141",
|
| 132 |
+
"sub-stroke0142",
|
| 133 |
+
"sub-stroke0143",
|
| 134 |
+
"sub-stroke0144",
|
| 135 |
+
"sub-stroke0145",
|
| 136 |
+
"sub-stroke0146",
|
| 137 |
+
"sub-stroke0147",
|
| 138 |
+
"sub-stroke0148",
|
| 139 |
+
"sub-stroke0149",
|
| 140 |
+
"sub-stroke0150",
|
| 141 |
+
"sub-stroke0151",
|
| 142 |
+
"sub-stroke0152",
|
| 143 |
+
"sub-stroke0153",
|
| 144 |
+
"sub-stroke0154",
|
| 145 |
+
"sub-stroke0155",
|
| 146 |
+
"sub-stroke0156",
|
| 147 |
+
"sub-stroke0157",
|
| 148 |
+
"sub-stroke0158",
|
| 149 |
+
"sub-stroke0159",
|
| 150 |
+
"sub-stroke0161",
|
| 151 |
+
"sub-stroke0162",
|
| 152 |
+
"sub-stroke0163",
|
| 153 |
+
"sub-stroke0164",
|
| 154 |
+
"sub-stroke0165",
|
| 155 |
+
"sub-stroke0166",
|
| 156 |
+
"sub-stroke0167",
|
| 157 |
+
"sub-stroke0168",
|
| 158 |
+
"sub-stroke0169",
|
| 159 |
+
"sub-stroke0170",
|
| 160 |
+
"sub-stroke0171",
|
| 161 |
+
"sub-stroke0172",
|
| 162 |
+
"sub-stroke0173",
|
| 163 |
+
"sub-stroke0174",
|
| 164 |
+
"sub-stroke0175",
|
| 165 |
+
"sub-stroke0176",
|
| 166 |
+
"sub-stroke0177",
|
| 167 |
+
"sub-stroke0178",
|
| 168 |
+
"sub-stroke0179",
|
| 169 |
+
"sub-stroke0180",
|
| 170 |
+
"sub-stroke0181",
|
| 171 |
+
"sub-stroke0182",
|
| 172 |
+
"sub-stroke0183",
|
| 173 |
+
"sub-stroke0184",
|
| 174 |
+
"sub-stroke0185",
|
| 175 |
+
"sub-stroke0186",
|
| 176 |
+
"sub-stroke0187",
|
| 177 |
+
"sub-stroke0188",
|
| 178 |
+
"sub-stroke0189",
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX: dict[str, int] = {
|
| 182 |
+
case_id: idx for idx, case_id in enumerate(ISLES24_TRAIN_CASE_IDS)
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
def isles24_train_data_file(case_id: str) -> str:
|
| 187 |
+
"""Return the Parquet data file path in the HF dataset repo for a given case ID."""
|
| 188 |
+
idx = ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX[case_id]
|
| 189 |
+
return f"data/train-{idx:05d}-of-{ISLES24_TRAIN_NUM_FILES:05d}.parquet"
|
|
@@ -11,6 +11,12 @@ from typing import TYPE_CHECKING, Protocol, Self
|
|
| 11 |
|
| 12 |
from stroke_deepisles_demo.core.logging import get_logger
|
| 13 |
from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
|
| 16 |
# Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
|
|
@@ -154,6 +160,113 @@ class HuggingFaceDatasetWrapper:
|
|
| 154 |
self._temp_dir = None
|
| 155 |
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
def load_isles_dataset(
|
| 158 |
source: str | Path | None = None,
|
| 159 |
*,
|
|
@@ -217,6 +330,9 @@ def load_isles_dataset(
|
|
| 217 |
dataset_id = str(source) if source else settings.hf_dataset_id
|
| 218 |
hf_token = token if token is not None else settings.hf_token
|
| 219 |
|
|
|
|
|
|
|
|
|
|
| 220 |
# Load dataset, selecting only necessary columns to minimize decoding overhead
|
| 221 |
# We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
|
| 222 |
# but select_columns ensures we don't touch other modalities.
|
|
|
|
| 11 |
|
| 12 |
from stroke_deepisles_demo.core.logging import get_logger
|
| 13 |
from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
|
| 14 |
+
from stroke_deepisles_demo.data.isles24_manifest import (
|
| 15 |
+
ISLES24_DATASET_ID,
|
| 16 |
+
ISLES24_DATASET_REVISION,
|
| 17 |
+
ISLES24_TRAIN_CASE_IDS,
|
| 18 |
+
isles24_train_data_file,
|
| 19 |
+
)
|
| 20 |
|
| 21 |
# Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
|
| 22 |
# Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
|
|
|
|
| 160 |
self._temp_dir = None
|
| 161 |
|
| 162 |
|
| 163 |
+
@dataclass
|
| 164 |
+
class Isles24HuggingFaceDataset:
|
| 165 |
+
"""ISLES24 dataset access optimized for HF Spaces.
|
| 166 |
+
|
| 167 |
+
Key behavior:
|
| 168 |
+
- `list_case_ids()` returns from a pinned manifest (no dataset download).
|
| 169 |
+
- `get_case()` loads exactly one Parquet shard via `data_files=...` (no 27GB eager download).
|
| 170 |
+
|
| 171 |
+
This class exists because `datasets.load_dataset(dataset_id, split="train")` can
|
| 172 |
+
trigger an eager full-dataset download/prepare on cold starts, which is not viable
|
| 173 |
+
for API endpoints like `/api/cases` on Hugging Face Spaces.
|
| 174 |
+
"""
|
| 175 |
+
|
| 176 |
+
dataset_id: str = ISLES24_DATASET_ID
|
| 177 |
+
token: str | None = None
|
| 178 |
+
revision: str = ISLES24_DATASET_REVISION
|
| 179 |
+
_temp_dir: Path | None = field(default=None, repr=False)
|
| 180 |
+
|
| 181 |
+
def __len__(self) -> int:
|
| 182 |
+
return len(ISLES24_TRAIN_CASE_IDS)
|
| 183 |
+
|
| 184 |
+
def __enter__(self) -> Self:
|
| 185 |
+
return self
|
| 186 |
+
|
| 187 |
+
def __exit__(self, *args: object) -> None:
|
| 188 |
+
self.cleanup()
|
| 189 |
+
|
| 190 |
+
def list_case_ids(self) -> list[str]:
|
| 191 |
+
return list(ISLES24_TRAIN_CASE_IDS)
|
| 192 |
+
|
| 193 |
+
def get_case(self, case_id: str | int) -> CaseFiles:
|
| 194 |
+
"""Load files for a single ISLES24 case.
|
| 195 |
+
|
| 196 |
+
Args:
|
| 197 |
+
case_id: Case identifier (e.g., "sub-stroke0102") or 0-based integer index.
|
| 198 |
+
"""
|
| 199 |
+
from datasets import load_dataset
|
| 200 |
+
|
| 201 |
+
if isinstance(case_id, int):
|
| 202 |
+
if case_id < 0 or case_id >= len(ISLES24_TRAIN_CASE_IDS):
|
| 203 |
+
raise IndexError(f"Case index {case_id} out of range")
|
| 204 |
+
resolved_case_id = ISLES24_TRAIN_CASE_IDS[case_id]
|
| 205 |
+
else:
|
| 206 |
+
resolved_case_id = case_id
|
| 207 |
+
|
| 208 |
+
# Security: Validate subject_id before using in path (defense-in-depth)
|
| 209 |
+
if not _SAFE_SUBJECT_ID_PATTERN.match(resolved_case_id):
|
| 210 |
+
raise ValueError(
|
| 211 |
+
f"Invalid subject_id format: {resolved_case_id!r}. Expected format: sub-strokeXXXX"
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
# Load exactly one shard (1 case per parquet file in this dataset)
|
| 215 |
+
data_file = isles24_train_data_file(resolved_case_id)
|
| 216 |
+
ds = load_dataset(
|
| 217 |
+
self.dataset_id,
|
| 218 |
+
data_files={"train": data_file},
|
| 219 |
+
split="train",
|
| 220 |
+
token=self.token,
|
| 221 |
+
revision=self.revision,
|
| 222 |
+
)
|
| 223 |
+
ds = ds.select_columns(["subject_id", "dwi", "adc", "lesion_mask"])
|
| 224 |
+
if len(ds) != 1:
|
| 225 |
+
raise RuntimeError(f"Expected 1 row for {resolved_case_id}, got {len(ds)}")
|
| 226 |
+
|
| 227 |
+
row = ds[0]
|
| 228 |
+
subject_id = row["subject_id"]
|
| 229 |
+
if subject_id != resolved_case_id:
|
| 230 |
+
raise RuntimeError(
|
| 231 |
+
f"Unexpected subject_id {subject_id!r} in {data_file} (expected {resolved_case_id!r})"
|
| 232 |
+
)
|
| 233 |
+
|
| 234 |
+
if self._temp_dir is None:
|
| 235 |
+
self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_wrapper_"))
|
| 236 |
+
|
| 237 |
+
case_dir = self._temp_dir / subject_id
|
| 238 |
+
case_dir.mkdir(exist_ok=True)
|
| 239 |
+
|
| 240 |
+
dwi_path = case_dir / f"{subject_id}_dwi.nii.gz"
|
| 241 |
+
adc_path = case_dir / f"{subject_id}_adc.nii.gz"
|
| 242 |
+
|
| 243 |
+
if not dwi_path.exists():
|
| 244 |
+
row["dwi"].to_filename(str(dwi_path))
|
| 245 |
+
if not adc_path.exists():
|
| 246 |
+
row["adc"].to_filename(str(adc_path))
|
| 247 |
+
|
| 248 |
+
case_files: CaseFiles = {
|
| 249 |
+
"dwi": dwi_path,
|
| 250 |
+
"adc": adc_path,
|
| 251 |
+
}
|
| 252 |
+
|
| 253 |
+
if row.get("lesion_mask") is not None:
|
| 254 |
+
mask_path = case_dir / f"{subject_id}_lesion-msk.nii.gz"
|
| 255 |
+
if not mask_path.exists():
|
| 256 |
+
row["lesion_mask"].to_filename(str(mask_path))
|
| 257 |
+
case_files["ground_truth"] = mask_path
|
| 258 |
+
|
| 259 |
+
return case_files
|
| 260 |
+
|
| 261 |
+
def cleanup(self) -> None:
|
| 262 |
+
if self._temp_dir and self._temp_dir.exists():
|
| 263 |
+
try:
|
| 264 |
+
shutil.rmtree(self._temp_dir)
|
| 265 |
+
except OSError as e:
|
| 266 |
+
logger.warning("Failed to cleanup temp directory %s: %s", self._temp_dir, e)
|
| 267 |
+
self._temp_dir = None
|
| 268 |
+
|
| 269 |
+
|
| 270 |
def load_isles_dataset(
|
| 271 |
source: str | Path | None = None,
|
| 272 |
*,
|
|
|
|
| 330 |
dataset_id = str(source) if source else settings.hf_dataset_id
|
| 331 |
hf_token = token if token is not None else settings.hf_token
|
| 332 |
|
| 333 |
+
if dataset_id == ISLES24_DATASET_ID:
|
| 334 |
+
return Isles24HuggingFaceDataset(dataset_id=dataset_id, token=hf_token)
|
| 335 |
+
|
| 336 |
# Load dataset, selecting only necessary columns to minimize decoding overhead
|
| 337 |
# We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
|
| 338 |
# but select_columns ensures we don't touch other modalities.
|
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Unit tests for ISLES24 HF dataset fast-path loader."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from typing import TYPE_CHECKING
|
| 6 |
+
from unittest.mock import MagicMock, patch
|
| 7 |
+
|
| 8 |
+
import pytest
|
| 9 |
+
|
| 10 |
+
from stroke_deepisles_demo.data.isles24_manifest import (
|
| 11 |
+
ISLES24_DATASET_ID,
|
| 12 |
+
ISLES24_DATASET_REVISION,
|
| 13 |
+
ISLES24_TRAIN_CASE_IDS,
|
| 14 |
+
isles24_train_data_file,
|
| 15 |
+
)
|
| 16 |
+
from stroke_deepisles_demo.data.loader import Isles24HuggingFaceDataset
|
| 17 |
+
|
| 18 |
+
if TYPE_CHECKING:
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def test_list_case_ids_returns_manifest() -> None:
|
| 23 |
+
dataset = Isles24HuggingFaceDataset()
|
| 24 |
+
assert dataset.list_case_ids() == list(ISLES24_TRAIN_CASE_IDS)
|
| 25 |
+
assert len(dataset) == len(ISLES24_TRAIN_CASE_IDS)
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def test_get_case_loads_single_parquet_shard(tmp_path: Path) -> None:
|
| 29 |
+
mock_dwi = MagicMock()
|
| 30 |
+
mock_adc = MagicMock()
|
| 31 |
+
|
| 32 |
+
mock_ds = MagicMock()
|
| 33 |
+
mock_ds.select_columns.return_value = mock_ds
|
| 34 |
+
mock_ds.__len__.return_value = 1
|
| 35 |
+
mock_ds.__getitem__.return_value = {
|
| 36 |
+
"subject_id": "sub-stroke0001",
|
| 37 |
+
"dwi": mock_dwi,
|
| 38 |
+
"adc": mock_adc,
|
| 39 |
+
"lesion_mask": None,
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
temp_root = tmp_path / "hf_tmp"
|
| 43 |
+
temp_root.mkdir()
|
| 44 |
+
|
| 45 |
+
with (
|
| 46 |
+
patch("datasets.load_dataset", return_value=mock_ds) as mock_load,
|
| 47 |
+
patch("stroke_deepisles_demo.data.loader.tempfile.mkdtemp", return_value=str(temp_root)),
|
| 48 |
+
):
|
| 49 |
+
dataset = Isles24HuggingFaceDataset(token="hf_token_123")
|
| 50 |
+
with dataset:
|
| 51 |
+
case = dataset.get_case("sub-stroke0001")
|
| 52 |
+
|
| 53 |
+
# Uses pinned dataset settings + per-shard data_files selection.
|
| 54 |
+
mock_load.assert_called_once_with(
|
| 55 |
+
ISLES24_DATASET_ID,
|
| 56 |
+
data_files={"train": isles24_train_data_file("sub-stroke0001")},
|
| 57 |
+
split="train",
|
| 58 |
+
token="hf_token_123",
|
| 59 |
+
revision=ISLES24_DATASET_REVISION,
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
assert case["dwi"].name == "sub-stroke0001_dwi.nii.gz"
|
| 63 |
+
assert case["adc"].name == "sub-stroke0001_adc.nii.gz"
|
| 64 |
+
assert case["dwi"].parent == temp_root / "sub-stroke0001"
|
| 65 |
+
assert case["adc"].parent == temp_root / "sub-stroke0001"
|
| 66 |
+
|
| 67 |
+
# Materializes NIfTI objects via to_filename().
|
| 68 |
+
assert mock_dwi.to_filename.call_count == 1
|
| 69 |
+
assert mock_adc.to_filename.call_count == 1
|
| 70 |
+
|
| 71 |
+
# Temp dir cleaned up by context manager.
|
| 72 |
+
assert not temp_root.exists()
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def test_get_case_rejects_unknown_case_id() -> None:
|
| 76 |
+
dataset = Isles24HuggingFaceDataset()
|
| 77 |
+
with pytest.raises(KeyError):
|
| 78 |
+
_ = dataset.get_case("sub-stroke9999")
|
|
@@ -2,24 +2,19 @@
|
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
-
import os
|
| 6 |
from typing import TYPE_CHECKING
|
| 7 |
from unittest.mock import MagicMock, patch
|
| 8 |
|
| 9 |
-
import pytest
|
| 10 |
-
|
| 11 |
from stroke_deepisles_demo.data.adapter import LocalDataset
|
| 12 |
-
from stroke_deepisles_demo.data.loader import
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
if TYPE_CHECKING:
|
| 15 |
from pathlib import Path
|
| 16 |
|
| 17 |
-
# Skip tests that download large datasets in CI (limited disk space)
|
| 18 |
-
SKIP_IN_CI = pytest.mark.skipif(
|
| 19 |
-
os.environ.get("CI") == "true",
|
| 20 |
-
reason="Skips large HuggingFace downloads in CI (disk space)",
|
| 21 |
-
)
|
| 22 |
-
|
| 23 |
|
| 24 |
def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
|
| 25 |
"""Test that loading from local path returns a LocalDataset."""
|
|
@@ -51,15 +46,12 @@ def test_load_hf_calls_load_dataset() -> None:
|
|
| 51 |
assert mock_load.call_args[0][0] == "my/dataset"
|
| 52 |
|
| 53 |
|
| 54 |
-
@pytest.mark.integration
|
| 55 |
-
@SKIP_IN_CI
|
| 56 |
def test_load_from_huggingface_returns_hf_dataset() -> None:
|
| 57 |
-
"""Test that loading from HuggingFace returns
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
"""
|
| 62 |
with load_isles_dataset() as dataset: # Default is HuggingFace mode
|
| 63 |
-
assert isinstance(dataset,
|
| 64 |
-
|
| 65 |
-
# Real test might fail if network issue or auth issue
|
|
|
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
|
|
|
| 5 |
from typing import TYPE_CHECKING
|
| 6 |
from unittest.mock import MagicMock, patch
|
| 7 |
|
|
|
|
|
|
|
| 8 |
from stroke_deepisles_demo.data.adapter import LocalDataset
|
| 9 |
+
from stroke_deepisles_demo.data.loader import (
|
| 10 |
+
HuggingFaceDatasetWrapper,
|
| 11 |
+
Isles24HuggingFaceDataset,
|
| 12 |
+
load_isles_dataset,
|
| 13 |
+
)
|
| 14 |
|
| 15 |
if TYPE_CHECKING:
|
| 16 |
from pathlib import Path
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
|
| 20 |
"""Test that loading from local path returns a LocalDataset."""
|
|
|
|
| 46 |
assert mock_load.call_args[0][0] == "my/dataset"
|
| 47 |
|
| 48 |
|
|
|
|
|
|
|
| 49 |
def test_load_from_huggingface_returns_hf_dataset() -> None:
|
| 50 |
+
"""Test that loading from HuggingFace returns an Isles24HuggingFaceDataset.
|
| 51 |
|
| 52 |
+
This should not trigger a full dataset download: the default dataset uses a
|
| 53 |
+
pinned manifest for case IDs and per-case shard loading.
|
| 54 |
"""
|
| 55 |
with load_isles_dataset() as dataset: # Default is HuggingFace mode
|
| 56 |
+
assert isinstance(dataset, Isles24HuggingFaceDataset)
|
| 57 |
+
assert len(dataset.list_case_ids()) == len(dataset)
|
|
|