{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2f80b2a3",
   "metadata": {},
   "source": [
    "# VertebralBodiesCT-Neighbors — Dataset Generation Notebook\n",
    "\n",
    "This notebook documents and demonstrates how we generated the training data for the VertebralBodiesCT-Neighbors model. We publish this notebook (not the actual images/labels) to make the process transparent and reproducible.\n",
    "\n",
    "## TL;DR\n",
    "- Task: Create 3D crops around an index (center) vertebra including its two neighbors (above/below).\n",
    "- Inputs per crop (2 channels):\n",
    "  1) CT crop (`*_0000.nii.gz`)\n",
    "  2) Center vertebra marker (small sphere at centroid, `*_0001.nii.gz`)\n",
    "- Labels (mutually exclusive): 0=background, 1=other, 2=center, 3=above, 4=below, 5=ignore.\n",
    "- Bounding boxes: fixed symmetric padding (8 mm, 22 mm, 70 mm for x,y,z) with optional dynamic expansion (+10 mm per iteration, up to 6) if a neighbor touches a crop border.\n",
    "- Ignore label: used to mask regions above T1 as being an out-of-scope area.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed208ae0",
   "metadata": {},
   "source": [
    "# Creation of Bounding Box Images and Labels\n",
    "\n",
    "- Aim: create a dataset for the training of a nnUNet\n",
    "- Method: create bounding boxes around each *index vertebra* including its adjacent neighbors\n",
    "- Orientation follows (z, y, x)\n",
    "- Processing: Multiprocessing implementation using external module for macOS compatibility\n",
    "\n",
    "## Model Input / Output\n",
    "- Input: vertebral bodies with neighbor context, 2 channels\n",
    "  - image *_0000.nii.gz*\n",
    "  - center vertebra marker (small sphere) *_0001.nii.gz*\n",
    "\n",
    "- Output labels (VertebralBodiesCT-Neighbors training format):\n",
    "  - 0: background\n",
    "  - 1: other vertebrae (any vertebra except center/above/below)\n",
    "  - 2: center vertebra\n",
    "  - 3: above vertebra\n",
    "  - 4: below vertebra\n",
    "  - 5: ignore region (areas above T1)\n",
    "\n",
    "## Bounding Boxes (Fixed Padding):\n",
    "Based on q95 neighbor distance analysis, using **fixed symmetric padding** for simplicity and efficiency:\n",
    "- X-axis (Left-Right): pad_x = 8mm\n",
    "- Y-axis (Anterior-Posterior): pad_y = 22mm\n",
    "- Z-axis (Cranial-Caudal): pad_z = 70mm\n",
    "\n",
    "## Dynamic Expansion:\n",
    "- Border extension rule: After projecting neighbor labels into the crop, if the neighbor mask touches the crop boundary within 1 voxel and the image has room, expand that axis by +10mm iteratively (max 6 iterations)\n",
    "- Clamp boxes to image bounds at the end (after any expansion)\n",
    "\n",
    "## Algorithm\n",
    "1. Load CT and labels\n",
    "2. Loop through all vertebrae, for each vertebra i present:\n",
    "   - Extract spacing to transform mm into voxels\n",
    "   - Create training labels (0=bg, 1=other, 2=center, 3=above, 4=below)\n",
    "   - Generate center vertebra marker (small sphere at centroid)\n",
    "   - Compute bounding box with fixed padding (8mm/22mm/70mm)\n",
    "   - Apply dynamic expansion if neighbors touch boundaries\n",
    "   - Apply ignore mask for regions above T1 (label 5)\n",
    "   - Extract and save images and labels:\n",
    "     - CT crop (images/{filename}_{i}_0000.nii.gz)\n",
    "     - center marker (images/{filename}_{i}_0001.nii.gz)\n",
    "     - neighbor labels (labels/{filename}_{i}.nii.gz)\n",
    "\n",
    "## Input Labels (Original Segmentation)\n",
    "\n",
    "```json\n",
    " \"labels\": {\n",
    "   \"background\": 0,\n",
    "    \"T1\": 1,\n",
    "    \"T2\": 2,\n",
    "    \"T3\": 3,\n",
    "    \"T4\": 4,\n",
    "    \"T5\": 5,\n",
    "    \"T6\": 6,\n",
    "    \"T7\": 7,\n",
    "    \"T8\": 8,\n",
    "    \"T9\": 9,\n",
    "    \"T10\": 10,\n",
    "    \"T11\": 11,\n",
    "    \"T12\": 12,\n",
    "    \"L1\": 13,\n",
    "    \"L2\": 14,\n",
    "    \"L3\": 15,\n",
    "    \"L4\": 16,\n",
    "    \"L5\": 17,\n",
    "    \"L6\": 18,\n",
    "    \"sacrum\": 19,\n",
    "    \"coccyx\": 20,\n",
    "    \"T13\": 21\n",
    " }\n",
    "```\n",
    "\n",
    "## Output Labels (VertebralBodiesCT-Neighbors Training)\n",
    "```json\n",
    " \"labels\": {\n",
    "   \"background\": 0,\n",
    "   \"other_vertebrae\": 1,\n",
    "   \"center_vertebra\": 2,\n",
    "   \"above_vertebra\": 3,\n",
    "   \"below_vertebra\": 4,\n",
    "   \"ignore\": 5\n",
    " }\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93de4fa3",
   "metadata": {},
   "source": [
    "# Load dependencies and files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77aadb7c",
   "metadata": {},
   "source": [
    "## Inputs, Outputs, and Directory Layout\n",
    "\n",
    "Inputs\n",
    "- CT volumes and vertebrae labelmaps (thoracic/lumbar bodies) following your local layout.\n",
    "- The notebook expects separate training and testing folders (imagesTr/imagesTs and labelsTr/labelsTs) similar to nnU-Net’s structure.\n",
    "\n",
    "Outputs\n",
    "- Two-channel image crops and multi-class labels written under a `VertebralBodiesCT-Neighbors` subfolder next to your dataset root:\n",
    "  - `.../VertebralBodiesCT-Neighbors/imagesTr/Case_XXX_YYYY_0000.nii.gz` (CT)\n",
    "  - `.../VertebralBodiesCT-Neighbors/imagesTr/Case_XXX_YYYY_0001.nii.gz` (center marker)\n",
    "  - `.../VertebralBodiesCT-Neighbors/labelsTr/Case_XXX_YYYY.nii.gz` (classes 0..5)\n",
    "\n",
    "## How to Run\n",
    "1. Adjust paths at the top of the notebook (dataset root, input images/labels folders, output folders).\n",
    "2. Ensure required packages are installed: SimpleITK, numpy, pandas, tqdm.\n",
    "3. On macOS, the notebook sets multiprocessing start method to `spawn` for stability.\n",
    "4. Run all cells. Processing is parallelized; logs are written to a `logs` directory under the output dataset root.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c41287c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import libraries\n",
    "from pathlib import Path\n",
    "from tqdm import tqdm\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import SimpleITK as sitk\n",
    "import logging\n",
    "from math import ceil\n",
    "from datetime import datetime\n",
    "import multiprocessing as mp\n",
    "from multiprocessing import Pool, Manager, Lock\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# Configure multiprocessing for macOS (use 'spawn' to avoid issues)\n",
    "if sys.platform == 'darwin':  # macOS\n",
    "    mp.set_start_method('spawn', force=True)\n",
    "\n",
    "# Get number of CPU cores (reserve 1-2 cores for system)\n",
    "n_cores = max(1, mp.cpu_count() - 1)\n",
    "print(f\"Available CPU cores: {mp.cpu_count()}, using: {n_cores}\")\n",
    "\n",
    "# Configure logging to both file and console\n",
    "log_dir = Path('./logs')\n",
    "log_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Create log filename with timestamp\n",
    "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
    "log_file = log_dir / f\"{timestamp}.log\"\n",
    "\n",
    "# Configure logging\n",
    "logging.basicConfig(\n",
    "    level=logging.DEBUG,\n",
    "    format='%(asctime)s - %(processName)s - %(levelname)s - %(message)s',\n",
    "    handlers=[\n",
    "        logging.FileHandler(log_file),  # Save to file\n",
    "        logging.StreamHandler()         # Print to console (only INFO and above)\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Set console handler to INFO level only\n",
    "console_handler = logging.getLogger().handlers[1]\n",
    "console_handler.setLevel(logging.INFO)\n",
    "\n",
    "print(f\"Debug logging will be saved to: {log_file}\")\n",
    "\n",
    "# define directories\n",
    "dir_root = Path('./dataset')\n",
    "dir_sub_imagesTr = dir_root / 'VertebralBodiesCT-Labels/imagesTr'\n",
    "dir_sub_imagesTs = dir_root / 'VertebralBodiesCT-Labels/imagesTs'\n",
    "dir_sub_labelsTr = dir_root / 'VertebralBodiesCT-Labels/labelsTr'\n",
    "dir_sub_labelsTs = dir_root / 'VertebralBodiesCT-Labels/labelsTs'\n",
    "\n",
    "# define output directory\n",
    "dir_output_imagesTr = dir_root / 'VertebralBodiesCT-Neighbors-Labels' / 'imagesTr'\n",
    "dir_output_imagesTs = dir_root / 'VertebralBodiesCT-Neighbors-Labels' / 'imagesTs'\n",
    "dir_output_labelsTr = dir_root / 'VertebralBodiesCT-Neighbors-Labels' / 'labelsTr'\n",
    "dir_output_labelsTs = dir_root / 'VertebralBodiesCT-Neighbors-Labels' / 'labelsTs'\n",
    "\n",
    "# create output directories if they do not exist\n",
    "dir_output_imagesTr.mkdir(parents=True, exist_ok=True)\n",
    "dir_output_imagesTs.mkdir(parents=True, exist_ok=True)\n",
    "dir_output_labelsTr.mkdir(parents=True, exist_ok=True)\n",
    "dir_output_labelsTs.mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ca3629d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# get all image files in both directories, training and testing \n",
    "imagesTr_paths = [path for path in dir_sub_imagesTr.glob('*.nii.gz') if not path.name.startswith('.')]\n",
    "imagesTs_paths = [path for path in dir_sub_imagesTs.glob('*.nii.gz') if not path.name.startswith('.')]\n",
    "images_paths = [(path, 'Tr') for path in imagesTr_paths] + [(path, 'Ts') for path in imagesTs_paths]\n",
    "images_paths.sort(key=lambda x: x[0].name)\n",
    "print(f'images: n={len(images_paths)}, {images_paths[:3]}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ad4bcfd",
   "metadata": {},
   "source": [
    "## Processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3949bcf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import processing functions from external module\n",
    "# This is necessary for multiprocessing with 'spawn' method on macOS\n",
    "from DatasetGeneration_mp import process_single_case\n",
    "\n",
    "def update_progress(result):\n",
    "    \"\"\"Callback function for progress updates\"\"\"\n",
    "    if result['status'] == 'success':\n",
    "        print(f\"✓ Completed {result['case_id']}: {result['vertebrae_count']} vertebrae\")\n",
    "    else:\n",
    "        print(f\"✗ Failed {result['case_id']}: {result['error']}\")\n",
    "\n",
    "\n",
    "# Prepare arguments for parallel processing\n",
    "process_args = []\n",
    "for image_zip in images_paths:\n",
    "    args = (image_zip, dir_sub_labelsTr, dir_sub_labelsTs, \n",
    "            dir_output_imagesTr, dir_output_imagesTs, \n",
    "            dir_output_labelsTr, dir_output_labelsTs)\n",
    "    process_args.append(args)\n",
    "\n",
    "print(f\"Starting parallel processing of {len(process_args)} cases using {n_cores} cores...\")\n",
    "\n",
    "# Process with multiprocessing\n",
    "if __name__ == '__main__' or 'ipykernel' in sys.modules:  # Jupyter notebook compatibility\n",
    "    start_time = datetime.now()\n",
    "    \n",
    "    with Pool(processes=n_cores) as pool:\n",
    "        # Use imap for better progress tracking\n",
    "        results = []\n",
    "        with tqdm(total=len(process_args), desc=\"Processing cases\") as pbar:\n",
    "            for result in pool.imap(process_single_case, process_args):\n",
    "                results.append(result)\n",
    "                update_progress(result)\n",
    "                pbar.update(1)\n",
    "    \n",
    "    end_time = datetime.now()\n",
    "    processing_time = end_time - start_time\n",
    "    \n",
    "    # Calculate statistics\n",
    "    successful_cases = [r for r in results if r['status'] == 'success']\n",
    "    failed_cases = [r for r in results if r['status'] == 'failed']\n",
    "    total_vertebrae = sum(r['vertebrae_count'] for r in successful_cases)\n",
    "    \n",
    "    print(f\"\\n{'='*50}\")\n",
    "    print(f\"PROCESSING COMPLETED\")\n",
    "    print(f\"{'='*50}\")\n",
    "    print(f\"Total time: {processing_time}\")\n",
    "    print(f\"Total cases: {len(results)}\")\n",
    "    print(f\"Successful: {len(successful_cases)}\")\n",
    "    print(f\"Failed: {len(failed_cases)}\")\n",
    "    print(f\"Total vertebrae processed: {total_vertebrae}\")\n",
    "    print(f\"Average vertebrae per case: {total_vertebrae/len(successful_cases):.1f}\")\n",
    "    print(f\"Processing speed: {len(successful_cases)/processing_time.total_seconds()*60:.1f} cases/min\")\n",
    "    \n",
    "    if failed_cases:\n",
    "        print(f\"\\nFailed cases:\")\n",
    "        for case in failed_cases[:5]:  # Show first 5 failures\n",
    "            print(f\"  - {case['case_id']}: {case['error']}\")\n",
    "        if len(failed_cases) > 5:\n",
    "            print(f\"  ... and {len(failed_cases)-5} more\")\n",
    "    \n",
    "    # Save processing log\n",
    "    log_data = {\n",
    "        'processing_time': str(processing_time),\n",
    "        'total_cases': len(results),\n",
    "        'successful_cases': len(successful_cases),\n",
    "        'failed_cases': len(failed_cases),\n",
    "        'total_vertebrae': total_vertebrae,\n",
    "        'results': results\n",
    "    }\n",
    "    \n",
    "    log_summary_file = log_dir / f\"processing_summary_{timestamp}.json\"\n",
    "    import json\n",
    "    with open(log_summary_file, 'w') as f:\n",
    "        json.dump(log_data, f, indent=2, default=str)\n",
    "    \n",
    "    print(f\"\\nDetailed log saved to: {log_file}\")\n",
    "    print(f\"Summary saved to: {log_summary_file}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "bodycomposition-postprocessing",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}