CUDA OOM on 96GB VRAM during loading
Description
I am trying to load the Qwen3-Next-80B-A3B-Instruct-FP8 model on a single NVIDIA RTX 6000 GPU (approx. 95GB VRAM).
Despite the model being FP8 (theoretically ~80GB) and my GPU having ~95GB VRAM, the loading process crashes at 25% (Shard 2/8) with a CUDA OOM error.
The error message indicates severe memory fragmentation:
"Of the allocated memory 21.12 GiB is allocated by PyTorch, and 73.21 GiB is reserved by PyTorch but unallocated."
It seems PyTorch reserves almost all available VRAM immediately but fails to allocate new segments for the subsequent shards, even though the actual used memory is only ~21GB at that point.
Environment
- Model:
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8(Local path) - GPU: NVIDIA RTX 6000 (Ada/Blackwell generation) - 94.97 GiB Total Capacity
- Library:
transformers(latest),torch(2.x),accelerateinstalled. - CUDA: 13.0
Reproduction Code
from transformers import AutoModel
import torch
# Path to the downloaded FP8 model
MODEL_PATH = "/path/to/Qwen3-Next-80B-A3B-Instruct-FP8"
# Simple loading with auto device map
model = AutoModel.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map="auto"
)
Error Log
Loading checkpoint shards: 25%|βββ | 2/8 [00:10<00:30, 5.02s/it]
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 94.97 GiB of which 1.88 MiB is free. Including non-PyTorch memory, this process has 94.96 GiB memory in use. Of the allocated memory 21.12 GiB is allocated by PyTorch, and 73.21 GiB is reserved by PyTorch but unallocated.
Question
Given that the GPU has significantly more memory (95GB) than the theoretical model size (~80GB), why does the loader reserve 73GB of unallocated memory so early in the process?
Are there specific quantization_config settings or environment variables required to load this FP8 checkpoint correctly without triggering this fragmentation?