Spaces:

VibecoderMcSwaggins
/

stroke-deepisles-demo

Paused

App Files Files Community

VibecoderMcSwaggins commited on Dec 15, 2025

Commit

7e5ddec

unverified ·

1 Parent(s): 34b1b27

fix(hf): avoid eager 27GB dataset download on /api/cases (#46)

Browse files

* docs(bug): add BUG-014 api/cases 500 error investigation

- GET /api/cases returns 500 Internal Server Error
- Health check passes but cases endpoint fails
- Frontend stuck on 'Loading cases...'
- Root cause TBD - need HF Space logs to confirm
- Dataset is PUBLIC (no auth issue)
- datasets library is transitive dep (not missing)

* docs(bug): update BUG-014 with confirmed root cause analysis

Root cause: datasets.load_dataset() downloads entire 27GB dataset
on every /api/cases call. This times out on HF Spaces.

- Verified call chain: routes.py → data/__init__.py → loader.py:224
- Dataset: 149 parquet shards × 184MB = 27.41GB
- HF Space logs won't help - request times out at proxy before logging
- Recommended fix: small demo dataset with 5-10 cases

* docs(bug): BUG-014 final fix - metadata-based case list, streaming for single case

Root cause: datasets.load_dataset() without streaming=True downloads
all 27GB before returning. This times out on HF Spaces.

Fix:
1. list_case_ids() - use load_dataset_builder().info (1s, no download)
2. get_case() - use streaming + early termination (~30-120s per case)
3. Wire hf_cache_dir for persistent caching

Previous audit dismissed this as 'negligible latency' - that was wrong.
The latency is from downloading 27GB, not from '149 cases'.

* fix(hf): avoid eager 27GB dataset download on /api/cases

Root cause: `datasets.load_dataset(dataset_id, split="train")` downloads
the entire 27GB ISLES24 dataset on cold start, causing HF Spaces proxy
timeouts on `/api/cases`.

Fix:
- Add pinned manifest of 149 case IDs (isles24_manifest.py)
- Add Isles24HuggingFaceDataset class that:
- Returns case IDs from manifest (no download)
- Loads single Parquet shard via data_files= for get_case()
- Route ISLES24 dataset loads to the optimized class

Result:
- `/api/cases` returns instantly (no dataset download)
- `get_case()` downloads ~180MB (one shard) instead of 27GB

Closes BUG-014

Files changed (6) hide show

docs/bugs/BUG-014-api-cases-500-error.md +103 -0
docs/specs/00-data-loading-refactor.md +5 -0
src/stroke_deepisles_demo/data/isles24_manifest.py +189 -0
src/stroke_deepisles_demo/data/loader.py +116 -0
tests/data/test_isles24_dataset.py +78 -0
tests/data/test_loader.py +10 -18

docs/bugs/BUG-014-api-cases-500-error.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# BUG-014: /api/cases Endpoint Timeout
+**Status**: ROOT CAUSE CONFIRMED, FIX IDENTIFIED
+**Severity**: P1 (Frontend completely broken)
+**Discovered**: 2025-12-15
+**Reporter**: Manual testing
+## Summary
+The `/api/cases` endpoint failed on HF Spaces because it triggered an eager ~27GB
+dataset download/prepare step just to return a list of case IDs.
+The fix keeps the *full* dataset, but changes the data access pattern so:
+- `/api/cases` does **zero** dataset downloading
+- `get_case(case_id)` downloads **one** Parquet shard (one case), not the full dataset
+## Root Cause
+**Current code path:**
+```text
+GET /api/cases
+  → routes.py:57           list_case_ids()
+    → data/__init__.py:43    load_isles_dataset()
+      → loader.py:224        ds = load_dataset(...)  ← DOWNLOADS 27GB EAGERLY
+```
+**The bug:** `datasets.load_dataset(dataset_id, split="train")` downloads/prepares the full
+dataset on cold start. On HF Spaces this regularly exceeds the proxy timeout window, so
+the frontend never receives a usable case list.
+## Previous Misdiagnosis
+This was identified in `ARCHITECTURE-AUDIT-2024-12-13.md` as P2-006 and dismissed in `REMAINING-ISSUES-2024-12-13.md`:
+> "Dataset reload per request | ACCEPTABLE | Demo scale (149 cases), adds negligible latency"
+**This assessment was wrong.** The latency isn't from "149 cases"—it's from an eager
+27GB download/prepare step happening in the request path.
+## Verified Facts
+| Fact | Value | Verification |
+|------|-------|--------------|
+| Total download | 27.41 GB | `load_dataset_builder().info.download_size` |
+| Number of cases | 149 | `info.splits['train'].num_examples` |
+| Shards | 149 parquet files | HF Hub API |
+| Shard shape | 1 case per parquet file | `load_dataset(..., data_files={...})` returns 1 row |
+| Case ID range | `sub-stroke0001` … `sub-stroke0189` (with gaps) | subject_id per shard |
+## The Fix
+### Part 1: Instant Case List (No Download)
+**Problem:** `list_case_ids()` was implemented by loading the dataset, which on HF Spaces
+meant triggering the full 27GB download/prepare.
+**Solution:** Use a pinned manifest of case IDs for the ISLES24 dataset.
+Implemented at `src/stroke_deepisles_demo/data/isles24_manifest.py` (pinned to dataset revision).
+### Part 2: Per-Case Loading by Shard (No Full Download)
+**Problem:** `get_case(case_id)` previously loaded the whole dataset, even when only one
+case is needed for inference.
+**Solution:** For a single case, load exactly one Parquet shard using `data_files=...`,
+then materialize DWI/ADC (and optional lesion mask) to temp files.
+Implemented in `src/stroke_deepisles_demo/data/loader.py` as `Isles24HuggingFaceDataset`.
+### Why This Works on HF Spaces
+- `/api/cases` becomes a pure metadata response (fast, reliable).
+- Per-case data download happens in the background job (fits the async job model).
+- No streaming iteration over the full dataset is required.
+## Files to Change
+| File | Change |
+|------|--------|
+| `src/stroke_deepisles_demo/data/isles24_manifest.py` | Add pinned case ID manifest + shard mapping |
+| `src/stroke_deepisles_demo/data/loader.py` | Add `Isles24HuggingFaceDataset` + route ISLES24 loads to it |
+## Implementation Steps
+1. Add pinned ISLES24 case ID manifest (no download on `/api/cases`)
+2. Load single Parquet shard via `data_files=...` for `get_case(case_id)`
+3. Verify `/api/cases` returns immediately on HF Spaces
+4. Verify segmentation job downloads only selected case data
+## Verification After Fix
+```bash
+# After deploying fix, from cold start:
+time curl -s https://vibecodermcswaggins-stroke-deepisles-demo.hf.space/api/cases
+# Should complete quickly with {"cases": ["sub-stroke0001", ...]}
+```
+## Notes
+- The full 27GB dataset IS supported - we're not reducing it
+- Case loading happens in background jobs, not blocking the API gateway timeout window
+- The core issue was doing full-dataset work inside `/api/cases`

docs/specs/00-data-loading-refactor.md CHANGED Viewed

@@ -8,6 +8,11 @@
 ## Problem Statement
 `stroke-deepisles-demo` has a hand-rolled data loading workaround that:
 1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly

 ## Problem Statement
+> **Update (2025-12-15):** HF Spaces requires avoiding eager full-dataset downloads.
+> The implementation now uses a pinned ISLES24 manifest (`src/stroke_deepisles_demo/data/isles24_manifest.py`)
+> for `/api/cases`, and loads a single case via `datasets.load_dataset(..., data_files=...)` (one Parquet shard)
+> instead of `load_dataset(..., split="train")`.
 `stroke-deepisles-demo` has a hand-rolled data loading workaround that:
 1. **Bypasses `datasets.load_dataset()`** - Uses `HfFileSystem + pyarrow` directly

src/stroke_deepisles_demo/data/isles24_manifest.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""ISLES'24 Stroke dataset manifest for Hugging Face loading.
+This project targets the Hugging Face dataset repo `hugging-science/isles24-stroke`.
+On HF Spaces, calling `datasets.load_dataset(dataset_id, split="train")` can trigger
+an eager download/prepare step for ~27GB of Parquet shards, which is not viable for
+fast API endpoints like `/api/cases`.
+The upstream dataset stores one case per Parquet file at:
+`data/train-00000-of-00149.parquet` ... `data/train-00148-of-00149.parquet`
+This manifest provides:
+- The authoritative list of available case IDs (subject_ids) for the train split.
+- A stable mapping from case ID → Parquet shard index (and thus data file path).
+SSOT:
+- Generated from the dataset at the pinned revision below by reading `subject_id`
+  from each Parquet file (without downloading the full dataset).
+"""
+from __future__ import annotations
+ISLES24_DATASET_ID = "hugging-science/isles24-stroke"
+# Pinned to the dataset revision used to generate this manifest.
+ISLES24_DATASET_REVISION = "9707a7fca6d3dd1a690de010ec4aed06bdcd0417"
+ISLES24_TRAIN_NUM_FILES = 149
+# Case IDs in the same order as the Parquet shard filenames (train-00000..., train-00001..., ...).
+ISLES24_TRAIN_CASE_IDS: tuple[str, ...] = (
+    "sub-stroke0001",
+    "sub-stroke0002",
+    "sub-stroke0003",
+    "sub-stroke0004",
+    "sub-stroke0005",
+    "sub-stroke0006",
+    "sub-stroke0007",
+    "sub-stroke0008",
+    "sub-stroke0009",
+    "sub-stroke0010",
+    "sub-stroke0011",
+    "sub-stroke0012",
+    "sub-stroke0013",
+    "sub-stroke0014",
+    "sub-stroke0015",
+    "sub-stroke0016",
+    "sub-stroke0017",
+    "sub-stroke0019",
+    "sub-stroke0020",
+    "sub-stroke0021",
+    "sub-stroke0022",
+    "sub-stroke0025",
+    "sub-stroke0026",
+    "sub-stroke0027",
+    "sub-stroke0028",
+    "sub-stroke0030",
+    "sub-stroke0033",
+    "sub-stroke0036",
+    "sub-stroke0037",
+    "sub-stroke0038",
+    "sub-stroke0040",
+    "sub-stroke0043",
+    "sub-stroke0045",
+    "sub-stroke0047",
+    "sub-stroke0048",
+    "sub-stroke0049",
+    "sub-stroke0052",
+    "sub-stroke0053",
+    "sub-stroke0054",
+    "sub-stroke0055",
+    "sub-stroke0057",
+    "sub-stroke0062",
+    "sub-stroke0066",
+    "sub-stroke0068",
+    "sub-stroke0070",
+    "sub-stroke0071",
+    "sub-stroke0073",
+    "sub-stroke0074",
+    "sub-stroke0075",
+    "sub-stroke0076",
+    "sub-stroke0077",
+    "sub-stroke0078",
+    "sub-stroke0079",
+    "sub-stroke0080",
+    "sub-stroke0081",
+    "sub-stroke0082",
+    "sub-stroke0083",
+    "sub-stroke0084",
+    "sub-stroke0085",
+    "sub-stroke0086",
+    "sub-stroke0087",
+    "sub-stroke0088",
+    "sub-stroke0089",
+    "sub-stroke0090",
+    "sub-stroke0091",
+    "sub-stroke0092",
+    "sub-stroke0093",
+    "sub-stroke0094",
+    "sub-stroke0095",
+    "sub-stroke0096",
+    "sub-stroke0097",
+    "sub-stroke0098",
+    "sub-stroke0099",
+    "sub-stroke0100",
+    "sub-stroke0101",
+    "sub-stroke0102",
+    "sub-stroke0103",
+    "sub-stroke0104",
+    "sub-stroke0105",
+    "sub-stroke0106",
+    "sub-stroke0107",
+    "sub-stroke0108",
+    "sub-stroke0109",
+    "sub-stroke0110",
+    "sub-stroke0111",
+    "sub-stroke0112",
+    "sub-stroke0113",
+    "sub-stroke0114",
+    "sub-stroke0115",
+    "sub-stroke0116",
+    "sub-stroke0117",
+    "sub-stroke0118",
+    "sub-stroke0119",
+    "sub-stroke0133",
+    "sub-stroke0134",
+    "sub-stroke0135",
+    "sub-stroke0136",
+    "sub-stroke0137",
+    "sub-stroke0138",
+    "sub-stroke0139",
+    "sub-stroke0140",
+    "sub-stroke0141",
+    "sub-stroke0142",
+    "sub-stroke0143",
+    "sub-stroke0144",
+    "sub-stroke0145",
+    "sub-stroke0146",
+    "sub-stroke0147",
+    "sub-stroke0148",
+    "sub-stroke0149",
+    "sub-stroke0150",
+    "sub-stroke0151",
+    "sub-stroke0152",
+    "sub-stroke0153",
+    "sub-stroke0154",
+    "sub-stroke0155",
+    "sub-stroke0156",
+    "sub-stroke0157",
+    "sub-stroke0158",
+    "sub-stroke0159",
+    "sub-stroke0161",
+    "sub-stroke0162",
+    "sub-stroke0163",
+    "sub-stroke0164",
+    "sub-stroke0165",
+    "sub-stroke0166",
+    "sub-stroke0167",
+    "sub-stroke0168",
+    "sub-stroke0169",
+    "sub-stroke0170",
+    "sub-stroke0171",
+    "sub-stroke0172",
+    "sub-stroke0173",
+    "sub-stroke0174",
+    "sub-stroke0175",
+    "sub-stroke0176",
+    "sub-stroke0177",
+    "sub-stroke0178",
+    "sub-stroke0179",
+    "sub-stroke0180",
+    "sub-stroke0181",
+    "sub-stroke0182",
+    "sub-stroke0183",
+    "sub-stroke0184",
+    "sub-stroke0185",
+    "sub-stroke0186",
+    "sub-stroke0187",
+    "sub-stroke0188",
+    "sub-stroke0189",
+)
+ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX: dict[str, int] = {
+    case_id: idx for idx, case_id in enumerate(ISLES24_TRAIN_CASE_IDS)
+}
+def isles24_train_data_file(case_id: str) -> str:
+    """Return the Parquet data file path in the HF dataset repo for a given case ID."""
+    idx = ISLES24_TRAIN_CASE_ID_TO_FILE_INDEX[case_id]
+    return f"data/train-{idx:05d}-of-{ISLES24_TRAIN_NUM_FILES:05d}.parquet"

src/stroke_deepisles_demo/data/loader.py CHANGED Viewed

@@ -11,6 +11,12 @@ from typing import TYPE_CHECKING, Protocol, Self
 from stroke_deepisles_demo.core.logging import get_logger
 from stroke_deepisles_demo.core.types import CaseFiles  # noqa: TC001
 # Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
 # Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
@@ -154,6 +160,113 @@ class HuggingFaceDatasetWrapper:
         self._temp_dir = None
 def load_isles_dataset(
     source: str | Path | None = None,
     *,
@@ -217,6 +330,9 @@ def load_isles_dataset(
     dataset_id = str(source) if source else settings.hf_dataset_id
     hf_token = token if token is not None else settings.hf_token
     # Load dataset, selecting only necessary columns to minimize decoding overhead
     # We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
     # but select_columns ensures we don't touch other modalities.

 from stroke_deepisles_demo.core.logging import get_logger
 from stroke_deepisles_demo.core.types import CaseFiles  # noqa: TC001
+from stroke_deepisles_demo.data.isles24_manifest import (
+    ISLES24_DATASET_ID,
+    ISLES24_DATASET_REVISION,
+    ISLES24_TRAIN_CASE_IDS,
+    isles24_train_data_file,
+)
 # Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
 # Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
         self._temp_dir = None
+@dataclass
+class Isles24HuggingFaceDataset:
+    """ISLES24 dataset access optimized for HF Spaces.
+    Key behavior:
+    - `list_case_ids()` returns from a pinned manifest (no dataset download).
+    - `get_case()` loads exactly one Parquet shard via `data_files=...` (no 27GB eager download).
+    This class exists because `datasets.load_dataset(dataset_id, split="train")` can
+    trigger an eager full-dataset download/prepare on cold starts, which is not viable
+    for API endpoints like `/api/cases` on Hugging Face Spaces.
+    """
+    dataset_id: str = ISLES24_DATASET_ID
+    token: str | None = None
+    revision: str = ISLES24_DATASET_REVISION
+    _temp_dir: Path | None = field(default=None, repr=False)
+    def __len__(self) -> int:
+        return len(ISLES24_TRAIN_CASE_IDS)
+    def __enter__(self) -> Self:
+        return self
+    def __exit__(self, *args: object) -> None:
+        self.cleanup()
+    def list_case_ids(self) -> list[str]:
+        return list(ISLES24_TRAIN_CASE_IDS)
+    def get_case(self, case_id: str | int) -> CaseFiles:
+        """Load files for a single ISLES24 case.
+        Args:
+            case_id: Case identifier (e.g., "sub-stroke0102") or 0-based integer index.
+        """
+        from datasets import load_dataset
+        if isinstance(case_id, int):
+            if case_id < 0 or case_id >= len(ISLES24_TRAIN_CASE_IDS):
+                raise IndexError(f"Case index {case_id} out of range")
+            resolved_case_id = ISLES24_TRAIN_CASE_IDS[case_id]
+        else:
+            resolved_case_id = case_id
+        # Security: Validate subject_id before using in path (defense-in-depth)
+        if not _SAFE_SUBJECT_ID_PATTERN.match(resolved_case_id):
+            raise ValueError(
+                f"Invalid subject_id format: {resolved_case_id!r}. Expected format: sub-strokeXXXX"
+            )
+        # Load exactly one shard (1 case per parquet file in this dataset)
+        data_file = isles24_train_data_file(resolved_case_id)
+        ds = load_dataset(
+            self.dataset_id,
+            data_files={"train": data_file},
+            split="train",
+            token=self.token,
+            revision=self.revision,
+        )
+        ds = ds.select_columns(["subject_id", "dwi", "adc", "lesion_mask"])
+        if len(ds) != 1:
+            raise RuntimeError(f"Expected 1 row for {resolved_case_id}, got {len(ds)}")
+        row = ds[0]
+        subject_id = row["subject_id"]
+        if subject_id != resolved_case_id:
+            raise RuntimeError(
+                f"Unexpected subject_id {subject_id!r} in {data_file} (expected {resolved_case_id!r})"
+            )
+        if self._temp_dir is None:
+            self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_wrapper_"))
+        case_dir = self._temp_dir / subject_id
+        case_dir.mkdir(exist_ok=True)
+        dwi_path = case_dir / f"{subject_id}_dwi.nii.gz"
+        adc_path = case_dir / f"{subject_id}_adc.nii.gz"
+        if not dwi_path.exists():
+            row["dwi"].to_filename(str(dwi_path))
+        if not adc_path.exists():
+            row["adc"].to_filename(str(adc_path))
+        case_files: CaseFiles = {
+            "dwi": dwi_path,
+            "adc": adc_path,
+        }
+        if row.get("lesion_mask") is not None:
+            mask_path = case_dir / f"{subject_id}_lesion-msk.nii.gz"
+            if not mask_path.exists():
+                row["lesion_mask"].to_filename(str(mask_path))
+            case_files["ground_truth"] = mask_path
+        return case_files
+    def cleanup(self) -> None:
+        if self._temp_dir and self._temp_dir.exists():
+            try:
+                shutil.rmtree(self._temp_dir)
+            except OSError as e:
+                logger.warning("Failed to cleanup temp directory %s: %s", self._temp_dir, e)
+        self._temp_dir = None
 def load_isles_dataset(
     source: str | Path | None = None,
     *,
     dataset_id = str(source) if source else settings.hf_dataset_id
     hf_token = token if token is not None else settings.hf_token
+    if dataset_id == ISLES24_DATASET_ID:
+        return Isles24HuggingFaceDataset(dataset_id=dataset_id, token=hf_token)
     # Load dataset, selecting only necessary columns to minimize decoding overhead
     # We rely on neuroimaging-go-brrrr's Nifti feature for lazy loading if configured,
     # but select_columns ensures we don't touch other modalities.

tests/data/test_isles24_dataset.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Unit tests for ISLES24 HF dataset fast-path loader."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from unittest.mock import MagicMock, patch
+import pytest
+from stroke_deepisles_demo.data.isles24_manifest import (
+    ISLES24_DATASET_ID,
+    ISLES24_DATASET_REVISION,
+    ISLES24_TRAIN_CASE_IDS,
+    isles24_train_data_file,
+)
+from stroke_deepisles_demo.data.loader import Isles24HuggingFaceDataset
+if TYPE_CHECKING:
+    from pathlib import Path
+def test_list_case_ids_returns_manifest() -> None:
+    dataset = Isles24HuggingFaceDataset()
+    assert dataset.list_case_ids() == list(ISLES24_TRAIN_CASE_IDS)
+    assert len(dataset) == len(ISLES24_TRAIN_CASE_IDS)
+def test_get_case_loads_single_parquet_shard(tmp_path: Path) -> None:
+    mock_dwi = MagicMock()
+    mock_adc = MagicMock()
+    mock_ds = MagicMock()
+    mock_ds.select_columns.return_value = mock_ds
+    mock_ds.__len__.return_value = 1
+    mock_ds.__getitem__.return_value = {
+        "subject_id": "sub-stroke0001",
+        "dwi": mock_dwi,
+        "adc": mock_adc,
+        "lesion_mask": None,
+    }
+    temp_root = tmp_path / "hf_tmp"
+    temp_root.mkdir()
+    with (
+        patch("datasets.load_dataset", return_value=mock_ds) as mock_load,
+        patch("stroke_deepisles_demo.data.loader.tempfile.mkdtemp", return_value=str(temp_root)),
+    ):
+        dataset = Isles24HuggingFaceDataset(token="hf_token_123")
+        with dataset:
+            case = dataset.get_case("sub-stroke0001")
+    # Uses pinned dataset settings + per-shard data_files selection.
+    mock_load.assert_called_once_with(
+        ISLES24_DATASET_ID,
+        data_files={"train": isles24_train_data_file("sub-stroke0001")},
+        split="train",
+        token="hf_token_123",
+        revision=ISLES24_DATASET_REVISION,
+    )
+    assert case["dwi"].name == "sub-stroke0001_dwi.nii.gz"
+    assert case["adc"].name == "sub-stroke0001_adc.nii.gz"
+    assert case["dwi"].parent == temp_root / "sub-stroke0001"
+    assert case["adc"].parent == temp_root / "sub-stroke0001"
+    # Materializes NIfTI objects via to_filename().
+    assert mock_dwi.to_filename.call_count == 1
+    assert mock_adc.to_filename.call_count == 1
+    # Temp dir cleaned up by context manager.
+    assert not temp_root.exists()
+def test_get_case_rejects_unknown_case_id() -> None:
+    dataset = Isles24HuggingFaceDataset()
+    with pytest.raises(KeyError):
+        _ = dataset.get_case("sub-stroke9999")

tests/data/test_loader.py CHANGED Viewed

@@ -2,24 +2,19 @@
 from __future__ import annotations
-import os
 from typing import TYPE_CHECKING
 from unittest.mock import MagicMock, patch
-import pytest
 from stroke_deepisles_demo.data.adapter import LocalDataset
-from stroke_deepisles_demo.data.loader import HuggingFaceDatasetWrapper, load_isles_dataset
 if TYPE_CHECKING:
     from pathlib import Path
-# Skip tests that download large datasets in CI (limited disk space)
-SKIP_IN_CI = pytest.mark.skipif(
-    os.environ.get("CI") == "true",
-    reason="Skips large HuggingFace downloads in CI (disk space)",
-)
 def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
     """Test that loading from local path returns a LocalDataset."""
@@ -51,15 +46,12 @@ def test_load_hf_calls_load_dataset() -> None:
         assert mock_load.call_args[0][0] == "my/dataset"
-@pytest.mark.integration
-@SKIP_IN_CI
 def test_load_from_huggingface_returns_hf_dataset() -> None:
-    """Test that loading from HuggingFace returns a HuggingFaceDatasetWrapper.
-    Note: Skipped in CI due to large download size (~GB) and limited disk space.
-    Run locally with: pytest -m integration tests/data/test_loader.py
     """
     with load_isles_dataset() as dataset:  # Default is HuggingFace mode
-        assert isinstance(dataset, HuggingFaceDatasetWrapper)
-        # We can't guarantee length if we don't mock, but we can check type
-        # Real test might fail if network issue or auth issue

 from __future__ import annotations
 from typing import TYPE_CHECKING
 from unittest.mock import MagicMock, patch
 from stroke_deepisles_demo.data.adapter import LocalDataset
+from stroke_deepisles_demo.data.loader import (
+    HuggingFaceDatasetWrapper,
+    Isles24HuggingFaceDataset,
+    load_isles_dataset,
+)
 if TYPE_CHECKING:
     from pathlib import Path
 def test_load_from_local_returns_local_dataset(synthetic_isles_dir: Path) -> None:
     """Test that loading from local path returns a LocalDataset."""
         assert mock_load.call_args[0][0] == "my/dataset"
 def test_load_from_huggingface_returns_hf_dataset() -> None:
+    """Test that loading from HuggingFace returns an Isles24HuggingFaceDataset.
+    This should not trigger a full dataset download: the default dataset uses a
+    pinned manifest for case IDs and per-case shard loading.
     """
     with load_isles_dataset() as dataset:  # Default is HuggingFace mode
+        assert isinstance(dataset, Isles24HuggingFaceDataset)
+        assert len(dataset.list_case_ids()) == len(dataset)