VibecoderMcSwaggins commited on
Commit
7e5ddec
·
unverified ·
1 Parent(s): 34b1b27

fix(hf): avoid eager 27GB dataset download on /api/cases (#46)

Browse files

* docs(bug): add BUG-014 api/cases 500 error investigation

- GET /api/cases returns 500 Internal Server Error
- Health check passes but cases endpoint fails
- Frontend stuck on 'Loading cases...'
- Root cause TBD - need HF Space logs to confirm
- Dataset is PUBLIC (no auth issue)
- datasets library is transitive dep (not missing)

* docs(bug): update BUG-014 with confirmed root cause analysis

Root cause: datasets.load_dataset() downloads entire 27GB dataset
on every /api/cases call. This times out on HF Spaces.

- Verified call chain: routes.py → data/__init__.py → loader.py:224
- Dataset: 149 parquet shards × 184MB = 27.41GB
- HF Space logs won't help - request times out at proxy before logging
- Recommended fix: small demo dataset with 5-10 cases

* docs(bug): BUG-014 final fix - metadata-based case list, streaming for single case

Root cause: datasets.load_dataset() without streaming=True downloads
all 27GB before returning. This times out on HF Spaces.

Fix:
1. list_case_ids() - use load_dataset_builder().info (1s, no download)
2. get_case() - use streaming + early termination (~30-120s per case)
3. Wire hf_cache_dir for persistent caching

Previous audit dismissed this as 'negligible latency' - that was wrong.
The latency is from downloading 27GB, not from '149 cases'.

* fix(hf): avoid eager 27GB dataset download on /api/cases

Root cause: `datasets.load_dataset(dataset_id, split="train")` downloads
the entire 27GB ISLES24 dataset on cold start, causing HF Spaces proxy
timeouts on `/api/cases`.

Fix:
- Add pinned manifest of 149 case IDs (isles24_manifest.py)
- Add Isles24HuggingFaceDataset class that:
- Returns case IDs from manifest (no download)
- Loads single Parquet shard via data_files= for get_case()
- Route ISLES24 dataset loads to the optimized class

Result:
- `/api/cases` returns instantly (no dataset download)
- `get_case()` downloads ~180MB (one shard) instead of 27GB

Closes BUG-014

docs/bugs/BUG-014-api-cases-500-error.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BUG-014: /api/cases Endpoint Timeout
2
+
3
+ **Status**: ROOT CAUSE CONFIRMED, FIX IDENTIFIED
4
+ **Severity**: P1 (Frontend completely broken)
5
+ **Discovered**: 2025-12-15
6
+ **Reporter**: Manual testing
7
+
8
+ ## Summary
9
+
10
+ The `/api/cases` endpoint failed on HF Spaces because it triggered an eager ~27GB
11
+ dataset download/prepare step just to return a list of case IDs.
12
+
13
+ The fix keeps the *full* dataset, but changes the data access pattern so:
14
+ - `/api/cases` does **zero** dataset downloading
15
+ - `get_case(case_id)` downloads **one** Parquet shard (one case), not the full dataset
16
+
17
+ ## Root Cause
18
+
19
+ **Current code path:**
20
+ ```text
21
+ GET /api/cases
22
+ → routes.py:57 list_case_ids()
23
+ → data/__init__.py:43 load_isles_dataset()
24
+ → loader.py:224 ds = load_dataset(...) ← DOWNLOADS 27GB EAGERLY
25
+ ```
26
+
27
+ **The bug:** `datasets.load_dataset(dataset_id, split="train")` downloads/prepares the full
28
+ dataset on cold start. On HF Spaces this regularly exceeds the proxy timeout window, so
29
+ the frontend never receives a usable case list.
30
+
31
+ ## Previous Misdiagnosis
32
+
33
+ This was identified in `ARCHITECTURE-AUDIT-2024-12-13.md` as P2-006 and dismissed in `REMAINING-ISSUES-2024-12-13.md`:
34
+
35
+ > "Dataset reload per request | ACCEPTABLE | Demo scale (149 cases), adds negligible latency"
36
+
37
+ **This assessment was wrong.** The latency isn't from "149 cases"—it's from an eager
38
+ 27GB download/prepare step happening in the request path.
39
+
40
+ ## Verified Facts
41
+
42
+ | Fact | Value | Verification |
43
+ |------|-------|--------------|
44
+ | Total download | 27.41 GB | `load_dataset_builder().info.download_size` |
45
+ | Number of cases | 149 | `info.splits['train'].num_examples` |
46
+ | Shards | 149 parquet files | HF Hub API |
47
+ | Shard shape | 1 case per parquet file | `load_dataset(..., data_files={...})` returns 1 row |
48
+ | Case ID range | `sub-stroke0001` … `sub-stroke0189` (with gaps) | subject_id per shard |
49
+
50
+ ## The Fix
51
+
52
+ ### Part 1: Instant Case List (No Download)
53
+
54
+ **Problem:** `list_case_ids()` was implemented by loading the dataset, which on HF Spaces
55
+ meant triggering the full 27GB download/prepare.
56
+
57
+ **Solution:** Use a pinned manifest of case IDs for the ISLES24 dataset.
58
+
59
+ Implemented at `src/stroke_deepisles_demo/data/isles24_manifest.py` (pinned to dataset revision).
60
+
61
+ ### Part 2: Per-Case Loading by Shard (No Full Download)
62
+
63
+ **Problem:** `get_case(case_id)` previously loaded the whole dataset, even when only one
64
+ case is needed for inference.
65
+
66
+ **Solution:** For a single case, load exactly one Parquet shard using `data_files=...`,
67
+ then materialize DWI/ADC (and optional lesion mask) to temp files.
68
+
69
+ Implemented in `src/stroke_deepisles_demo/data/loader.py` as `Isles24HuggingFaceDataset`.
70
+
71
+ ### Why This Works on HF Spaces
72
+
73
+ - `/api/cases` becomes a pure metadata response (fast, reliable).
74
+ - Per-case data download happens in the background job (fits the async job model).
75
+ - No streaming iteration over the full dataset is required.
76
+
77
+ ## Files to Change
78
+
79
+ | File | Change |
80
+ |------|--------|
81
+ | `src/stroke_deepisles_demo/data/isles24_manifest.py` | Add pinned case ID manifest + shard mapping |
82
+ | `src/stroke_deepisles_demo/data/loader.py` | Add `Isles24HuggingFaceDataset` + route ISLES24 loads to it |
83
+
84
+ ## Implementation Steps
85
+
86
+ 1. Add pinned ISLES24 case ID manifest (no download on `/api/cases`)
87
+ 2. Load single Parquet shard via `data_files=...` for `get_case(case_id)`
88
+ 3. Verify `/api/cases` returns immediately on HF Spaces
89
+ 4. Verify segmentation job downloads only selected case data
90
+
91
+ ## Verification After Fix
92
+
93
+ ```bash
94
+ # After deploying fix, from cold start:
95
+ time curl -s https://vibecodermcswaggins-stroke-deepisles-demo.hf.space/api/cases
96
+ # Should complete quickly with {"cases": ["sub-stroke0001", ...]}
97
+ ```
98
+
99
+ ## Notes
100
+
101
+ - The full 27GB dataset IS supported - we're not reducing it
102
+ - Case loading happens in background jobs, not blocking the API gateway timeout window
103
+ - The core issue was doing full-dataset work inside `/api/cases`
docs/specs/00-data-loading-refactor.md CHANGED
@@ -8,6 +8,11 @@
8
 
9
  ## Problem Statement
10
 
 
 
 
 
 
11
  `stroke-deepisles-demo` has a hand-rolled data loading workaround that:
12
 
13
  1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly
 
8
 
9
  ## Problem Statement
10
 
11
+ > **Update (2025-12-15):** HF Spaces requires avoiding eager full-dataset downloads.
12
+ > The implementation now uses a pinned ISLES24 manifest (`src/stroke_deepisles_demo/data/isles24_manifest.py`)
13
+ > for `/api/cases`, and loads a single case via `datasets.load_dataset(..., data_files=...)` (one Parquet shard)
14
+ > instead of `load_dataset(..., split="train")`.
15
+
16
  `stroke-deepisles-demo` has a hand-rolled data loading workaround that:
17
 
18
  1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly
src/stroke_deepisles_demo/data/isles24_manifest.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ISLES'24 Stroke dataset manifest for Hugging Face loading.
2
+
3
+ This project targets the Hugging Face dataset repo `hugging-science/isles24-stroke`.
4
+ On HF Spaces, calling `datasets.load_dataset(dataset_id, split="train")` can trigger
5
+ an eager download/prepare step for ~27GB of Parquet shards, which is not viable for
6
+ fast API endpoints like `/api/cases`.
7
+
8
+ The upstream dataset stores one case per Parquet file at:
9
+ `data/train-00000-of-00149.parquet` ... `data/train-00148-of-00149.parquet`
10
+
11
+ This manifest provides:
12
+ - The authoritative list of available case IDs (subject_ids) for the train split.
13
+ - A stable mapping from case ID → Parquet shard index (and thus data file path).
14
+
15
+ SSOT:
16
+ - Generated from the dataset at the pinned revision below by reading `subject_id`
17
+ from each Parquet file (without downloading the full dataset).
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ ISLES24_DATASET_ID = "hugging-science/isles24-stroke"
23
+ # Pinned to the dataset revision used to generate this manifest.
24
+ ISLES24_DATASET_REVISION = "9707a7fca6d3dd1a690de010ec4aed06bdcd0417"
25
+
26
+ ISLES24_TRAIN_NUM_FILES = 149
27
+
28
+ # Case IDs in the same order as the Parquet shard filenames (train-00000..., train-00001..., ...).
29
+ ISLES24_TRAIN_CASE_IDS: tuple[str, ...] = (
30
+ "sub-stroke0001",
31
+ "sub-stroke0002",
32
+ "sub-stroke0003",
33
+ "sub-stroke0004",
34
+ "sub-stroke0005",
35
+ "sub-stroke0006",
36
+ "sub-stroke0007",
37
+ "sub-stroke0008",
38
+ "sub-stroke0009",
39
+ "sub-stroke0010",
40
+ "sub-stroke0011",
41
+ "sub-stroke0012",
42
+ "sub-stroke0013",
43
+ "sub-stroke0014",
44
+ "sub-stroke0015",
45
+ "sub-stroke0016",
46
+ "sub-stroke0017",
47
+ "sub-stroke0019",
48
+ "sub-stroke0020",
49
+ "sub-stroke0021",
50
+ "sub-stroke0022",
51
+ "sub-stroke0025",
52
+ "sub-stroke0026",
53
+ "sub-stroke0027",
54
+ "sub-stroke0028",
55
+ "sub-stroke0030",
56
+ "sub-stroke0033",
57
+ "sub-stroke0036",
58
+ "sub-stroke0037",
59
+ "sub-stroke0038",
60
+ "sub-stroke0040",
61
+ "sub-stroke0043",
62
+ "sub-stroke0045",
63
+ "sub-stroke0047",
64
+ "sub-stroke0048",
65
+ "sub-stroke0049",
66
+ "sub-stroke0052",
67
+ "sub-stroke0053",
68
+ "sub-stroke0054",
69
+ "sub-stroke0055",
70
+ "sub-stroke0057",
71
+ "sub-stroke0062",
72
+ "sub-stroke0066",
73
+ "sub-stroke0068",
74
+ "sub-stroke0070",
75
+ "sub-stroke0071",
76
+ "sub-stroke0073",
77
+ "sub-stroke0074",
78
+ "sub-stroke0075",
79
+ "sub-stroke0076",
80
+ "sub-stroke0077",
81
+ "sub-stroke0078",
82
+ "sub-stroke0079",
83
+ "sub-stroke0080",
84
+ "sub-stroke0081",
85
+ "sub-stroke0082",
86
+ "sub-stroke0083",
87
+ "sub-stroke0084",
88
+ "sub-stroke0085",
89
+ "sub-stroke0086",
90
+ "sub-stroke0087",
91
+ "sub-stroke0088",
92
+ "sub-stroke0089",
93
+ "sub-stroke0090",
94
+ "sub-stroke0091",
95
+ "sub-stroke0092",
96
+ "sub-stroke0093",
97
+ "sub-stroke0094",
98
+ "sub-stroke0095",
99
+ "sub-stroke0096",
100
+ "sub-stroke0097",
101
+ "sub-stroke0098",
102
+ "sub-stroke0099",
103
+ "sub-stroke0100",
104
+ "sub-stroke0101",
105
+ "sub-stroke0102",
106
+ "sub-stroke0103",
107
+ "sub-stroke0104",
108
+ "sub-stroke0105",
109
+ "sub-stroke0106",
110
+ "sub-stroke0107",
111
+ "sub-stroke0108",
112
+ "sub-stroke0109",
113
+ "sub-stroke0110",
114
+ "sub-stroke0111",
115
+ "sub-stroke0112",
116
+ "sub-stroke0113",
117
+ "sub-stroke0114",
118
+ "sub-stroke0115",
119
+ "sub-stroke0116",
120
+ "sub-stroke0117",
121
+ "sub-stroke0118",
122
+ "sub-stroke0119",
123
+ "sub-stroke0133",
124
+ "sub-stroke0134",
125
+ "sub-stroke0135",
126
+ "sub-stroke0136",
127
+ "sub-stroke0137",
128
+ "sub-stroke0138",
129
+ "sub-stroke0139",
130
+ "sub-stroke0140",
131
+ "sub-stroke0141",
132
+ "sub-stroke0142",
133
+ "sub-stroke0143",
134
+ "sub-stroke0144",
135
+ "sub-stroke0145",
136
+ "sub-stroke0146",
137
+ "sub-stroke0147",
138
+ "sub-stroke0148",
139
+ "sub-stroke0149",
140
+ "sub-stroke0150",
141
+ "sub-stroke0151",
142
+ "sub-stroke0152",
143
+ "sub-stroke0153",
144
+ "sub-stroke0154",
145
+ "sub-stroke0155",
146
+ "sub-stroke0156",
147
+ "sub-stroke0157",
148
+ "sub-stroke0158",
149
+ "sub-stroke0159",
150
+ "sub-stroke0161",
151
+ "sub-stroke0162",
152
+ "sub-stroke0163",
153
+ "sub-stroke0164",
154
+ "sub-stroke0165",
155
+ "sub-stroke0166",
156
+ "sub-stroke0167",
157
+ "sub-stroke0168",
158
+ "sub-stroke0169",
159
+ "sub-stroke0170",
160
+ "sub-stroke0171",
161
+ "sub-stroke0172",
162
+ "sub-stroke0173",
163
+ "sub-stroke0174",
164
+ "sub-stroke0175",
165
+ "sub-stroke0176",
166
+ "sub-stroke0177",
167
+ "sub-stroke0178",
168
+ "sub-stroke0179",
169
+ "sub-stroke0180",
170
+ "sub-stroke0181",
171
+ "sub-stroke0182",
172
+ "sub-stroke0183",
173
+ "sub-stroke0184",
174
+ "sub-stroke0185",
175
+ "sub-stroke0186",
176
+ "sub-stroke0187",
177
+ "sub-stroke0188",
178
+ "sub-stroke0189",
179
+ )
180
+
181
+ ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX: dict[str, int] = {
182
+ case_id: idx for idx, case_id in enumerate(ISLES24_TRAIN_CASE_IDS)
183
+ }
184
+
185
+
186
+ def isles24_train_data_file(case_id: str) -> str:
187
+ """Return the Parquet data file path in the HF dataset repo for a given case ID."""
188
+ idx = ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX[case_id]
189
+ return f"data/train-{idx:05d}-of-{ISLES24_TRAIN_NUM_FILES:05d}.parquet"
src/stroke_deepisles_demo/data/loader.py CHANGED
@@ -11,6 +11,12 @@ from typing import TYPE_CHECKING, Protocol, Self
11
 
12
  from stroke_deepisles_demo.core.logging import get_logger
13
  from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
 
 
 
 
 
 
14
 
15
  # Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
16
  # Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
@@ -154,6 +160,113 @@ class HuggingFaceDatasetWrapper:
154
  self._temp_dir = None
155
 
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  def load_isles_dataset(
158
  source: str | Path | None = None,
159
  *,
@@ -217,6 +330,9 @@ def load_isles_dataset(
217
  dataset_id = str(source) if source else settings.hf_dataset_id
218
  hf_token = token if token is not None else settings.hf_token
219
 
 
 
 
220
  # Load dataset, selecting only necessary columns to minimize decoding overhead
221
  # We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
222
  # but select_columns ensures we don't touch other modalities.
 
11
 
12
  from stroke_deepisles_demo.core.logging import get_logger
13
  from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
14
+ from stroke_deepisles_demo.data.isles24_manifest import (
15
+ ISLES24_DATASET_ID,
16
+ ISLES24_DATASET_REVISION,
17
+ ISLES24_TRAIN_CASE_IDS,
18
+ isles24_train_data_file,
19
+ )
20
 
21
  # Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
22
  # Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
 
160
  self._temp_dir = None
161
 
162
 
163
+ @dataclass
164
+ class Isles24HuggingFaceDataset:
165
+ """ISLES24 dataset access optimized for HF Spaces.
166
+
167
+ Key behavior:
168
+ - `list_case_ids()` returns from a pinned manifest (no dataset download).
169
+ - `get_case()` loads exactly one Parquet shard via `data_files=...` (no 27GB eager download).
170
+
171
+ This class exists because `datasets.load_dataset(dataset_id, split="train")` can
172
+ trigger an eager full-dataset download/prepare on cold starts, which is not viable
173
+ for API endpoints like `/api/cases` on Hugging Face Spaces.
174
+ """
175
+
176
+ dataset_id: str = ISLES24_DATASET_ID
177
+ token: str | None = None
178
+ revision: str = ISLES24_DATASET_REVISION
179
+ _temp_dir: Path | None = field(default=None, repr=False)
180
+
181
+ def __len__(self) -> int:
182
+ return len(ISLES24_TRAIN_CASE_IDS)
183
+
184
+ def __enter__(self) -> Self:
185
+ return self
186
+
187
+ def __exit__(self, *args: object) -> None:
188
+ self.cleanup()
189
+
190
+ def list_case_ids(self) -> list[str]:
191
+ return list(ISLES24_TRAIN_CASE_IDS)
192
+
193
+ def get_case(self, case_id: str | int) -> CaseFiles:
194
+ """Load files for a single ISLES24 case.
195
+
196
+ Args:
197
+ case_id: Case identifier (e.g., "sub-stroke0102") or 0-based integer index.
198
+ """
199
+ from datasets import load_dataset
200
+
201
+ if isinstance(case_id, int):
202
+ if case_id < 0 or case_id >= len(ISLES24_TRAIN_CASE_IDS):
203
+ raise IndexError(f"Case index {case_id} out of range")
204
+ resolved_case_id = ISLES24_TRAIN_CASE_IDS[case_id]
205
+ else:
206
+ resolved_case_id = case_id
207
+
208
+ # Security: Validate subject_id before using in path (defense-in-depth)
209
+ if not _SAFE_SUBJECT_ID_PATTERN.match(resolved_case_id):
210
+ raise ValueError(
211
+ f"Invalid subject_id format: {resolved_case_id!r}. Expected format: sub-strokeXXXX"
212
+ )
213
+
214
+ # Load exactly one shard (1 case per parquet file in this dataset)
215
+ data_file = isles24_train_data_file(resolved_case_id)
216
+ ds = load_dataset(
217
+ self.dataset_id,
218
+ data_files={"train": data_file},
219
+ split="train",
220
+ token=self.token,
221
+ revision=self.revision,
222
+ )
223
+ ds = ds.select_columns(["subject_id", "dwi", "adc", "lesion_mask"])
224
+ if len(ds) != 1:
225
+ raise RuntimeError(f"Expected 1 row for {resolved_case_id}, got {len(ds)}")
226
+
227
+ row = ds[0]
228
+ subject_id = row["subject_id"]
229
+ if subject_id != resolved_case_id:
230
+ raise RuntimeError(
231
+ f"Unexpected subject_id {subject_id!r} in {data_file} (expected {resolved_case_id!r})"
232
+ )
233
+
234
+ if self._temp_dir is None:
235
+ self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_wrapper_"))
236
+
237
+ case_dir = self._temp_dir / subject_id
238
+ case_dir.mkdir(exist_ok=True)
239
+
240
+ dwi_path = case_dir / f"{subject_id}_dwi.nii.gz"
241
+ adc_path = case_dir / f"{subject_id}_adc.nii.gz"
242
+
243
+ if not dwi_path.exists():
244
+ row["dwi"].to_filename(str(dwi_path))
245
+ if not adc_path.exists():
246
+ row["adc"].to_filename(str(adc_path))
247
+
248
+ case_files: CaseFiles = {
249
+ "dwi": dwi_path,
250
+ "adc": adc_path,
251
+ }
252
+
253
+ if row.get("lesion_mask") is not None:
254
+ mask_path = case_dir / f"{subject_id}_lesion-msk.nii.gz"
255
+ if not mask_path.exists():
256
+ row["lesion_mask"].to_filename(str(mask_path))
257
+ case_files["ground_truth"] = mask_path
258
+
259
+ return case_files
260
+
261
+ def cleanup(self) -> None:
262
+ if self._temp_dir and self._temp_dir.exists():
263
+ try:
264
+ shutil.rmtree(self._temp_dir)
265
+ except OSError as e:
266
+ logger.warning("Failed to cleanup temp directory %s: %s", self._temp_dir, e)
267
+ self._temp_dir = None
268
+
269
+
270
  def load_isles_dataset(
271
  source: str | Path | None = None,
272
  *,
 
330
  dataset_id = str(source) if source else settings.hf_dataset_id
331
  hf_token = token if token is not None else settings.hf_token
332
 
333
+ if dataset_id == ISLES24_DATASET_ID:
334
+ return Isles24HuggingFaceDataset(dataset_id=dataset_id, token=hf_token)
335
+
336
  # Load dataset, selecting only necessary columns to minimize decoding overhead
337
  # We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
338
  # but select_columns ensures we don't touch other modalities.
tests/data/test_isles24_dataset.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Unit tests for ISLES24 HF dataset fast-path loader."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import TYPE_CHECKING
6
+ from unittest.mock import MagicMock, patch
7
+
8
+ import pytest
9
+
10
+ from stroke_deepisles_demo.data.isles24_manifest import (
11
+ ISLES24_DATASET_ID,
12
+ ISLES24_DATASET_REVISION,
13
+ ISLES24_TRAIN_CASE_IDS,
14
+ isles24_train_data_file,
15
+ )
16
+ from stroke_deepisles_demo.data.loader import Isles24HuggingFaceDataset
17
+
18
+ if TYPE_CHECKING:
19
+ from pathlib import Path
20
+
21
+
22
+ def test_list_case_ids_returns_manifest() -> None:
23
+ dataset = Isles24HuggingFaceDataset()
24
+ assert dataset.list_case_ids() == list(ISLES24_TRAIN_CASE_IDS)
25
+ assert len(dataset) == len(ISLES24_TRAIN_CASE_IDS)
26
+
27
+
28
+ def test_get_case_loads_single_parquet_shard(tmp_path: Path) -> None:
29
+ mock_dwi = MagicMock()
30
+ mock_adc = MagicMock()
31
+
32
+ mock_ds = MagicMock()
33
+ mock_ds.select_columns.return_value = mock_ds
34
+ mock_ds.__len__.return_value = 1
35
+ mock_ds.__getitem__.return_value = {
36
+ "subject_id": "sub-stroke0001",
37
+ "dwi": mock_dwi,
38
+ "adc": mock_adc,
39
+ "lesion_mask": None,
40
+ }
41
+
42
+ temp_root = tmp_path / "hf_tmp"
43
+ temp_root.mkdir()
44
+
45
+ with (
46
+ patch("datasets.load_dataset", return_value=mock_ds) as mock_load,
47
+ patch("stroke_deepisles_demo.data.loader.tempfile.mkdtemp", return_value=str(temp_root)),
48
+ ):
49
+ dataset = Isles24HuggingFaceDataset(token="hf_token_123")
50
+ with dataset:
51
+ case = dataset.get_case("sub-stroke0001")
52
+
53
+ # Uses pinned dataset settings + per-shard data_files selection.
54
+ mock_load.assert_called_once_with(
55
+ ISLES24_DATASET_ID,
56
+ data_files={"train": isles24_train_data_file("sub-stroke0001")},
57
+ split="train",
58
+ token="hf_token_123",
59
+ revision=ISLES24_DATASET_REVISION,
60
+ )
61
+
62
+ assert case["dwi"].name == "sub-stroke0001_dwi.nii.gz"
63
+ assert case["adc"].name == "sub-stroke0001_adc.nii.gz"
64
+ assert case["dwi"].parent == temp_root / "sub-stroke0001"
65
+ assert case["adc"].parent == temp_root / "sub-stroke0001"
66
+
67
+ # Materializes NIfTI objects via to_filename().
68
+ assert mock_dwi.to_filename.call_count == 1
69
+ assert mock_adc.to_filename.call_count == 1
70
+
71
+ # Temp dir cleaned up by context manager.
72
+ assert not temp_root.exists()
73
+
74
+
75
+ def test_get_case_rejects_unknown_case_id() -> None:
76
+ dataset = Isles24HuggingFaceDataset()
77
+ with pytest.raises(KeyError):
78
+ _ = dataset.get_case("sub-stroke9999")
tests/data/test_loader.py CHANGED
@@ -2,24 +2,19 @@
2
 
3
  from __future__ import annotations
4
 
5
- import os
6
  from typing import TYPE_CHECKING
7
  from unittest.mock import MagicMock, patch
8
 
9
- import pytest
10
-
11
  from stroke_deepisles_demo.data.adapter import LocalDataset
12
- from stroke_deepisles_demo.data.loader import HuggingFaceDatasetWrapper, load_isles_dataset
 
 
 
 
13
 
14
  if TYPE_CHECKING:
15
  from pathlib import Path
16
 
17
- # Skip tests that download large datasets in CI (limited disk space)
18
- SKIP_IN_CI = pytest.mark.skipif(
19
- os.environ.get("CI") == "true",
20
- reason="Skips large HuggingFace downloads in CI (disk space)",
21
- )
22
-
23
 
24
  def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
25
  """Test that loading from local path returns a LocalDataset."""
@@ -51,15 +46,12 @@ def test_load_hf_calls_load_dataset() -> None:
51
  assert mock_load.call_args[0][0] == "my/dataset"
52
 
53
 
54
- @pytest.mark.integration
55
- @SKIP_IN_CI
56
  def test_load_from_huggingface_returns_hf_dataset() -> None:
57
- """Test that loading from HuggingFace returns a HuggingFaceDatasetWrapper.
58
 
59
- Note: Skipped in CI due to large download size (~GB) and limited disk space.
60
- Run locally with: pytest -m integration tests/data/test_loader.py
61
  """
62
  with load_isles_dataset() as dataset: # Default is HuggingFace mode
63
- assert isinstance(dataset, HuggingFaceDatasetWrapper)
64
- # We can't guarantee length if we don't mock, but we can check type
65
- # Real test might fail if network issue or auth issue
 
2
 
3
  from __future__ import annotations
4
 
 
5
  from typing import TYPE_CHECKING
6
  from unittest.mock import MagicMock, patch
7
 
 
 
8
  from stroke_deepisles_demo.data.adapter import LocalDataset
9
+ from stroke_deepisles_demo.data.loader import (
10
+ HuggingFaceDatasetWrapper,
11
+ Isles24HuggingFaceDataset,
12
+ load_isles_dataset,
13
+ )
14
 
15
  if TYPE_CHECKING:
16
  from pathlib import Path
17
 
 
 
 
 
 
 
18
 
19
  def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
20
  """Test that loading from local path returns a LocalDataset."""
 
46
  assert mock_load.call_args[0][0] == "my/dataset"
47
 
48
 
 
 
49
  def test_load_from_huggingface_returns_hf_dataset() -> None:
50
+ """Test that loading from HuggingFace returns an Isles24HuggingFaceDataset.
51
 
52
+ This should not trigger a full dataset download: the default dataset uses a
53
+ pinned manifest for case IDs and per-case shard loading.
54
  """
55
  with load_isles_dataset() as dataset: # Default is HuggingFace mode
56
+ assert isinstance(dataset, Isles24HuggingFaceDataset)
57
+ assert len(dataset.list_case_ids()) == len(dataset)