DynamicVLA โ€” DOM (full fine-tune checkpoint)

A DynamicVLA policy trained on the DOM dataset (hzxie/DOM) for dynamic-object manipulation.

โš ๏ธ Mid-training checkpoint (~epoch 16, train loss โ‰ˆ 0.0007โ€“0.002). Self-contained and eval-ready (includes normalization buffers), but optimizer/scheduler state is not included (cannot resume optimizer momentum from this file).

Files in this repo

  • model.safetensors + config.json (root) โ€” latest checkpoint (~epoch 16, a mid-epoch step snapshot, refreshed as training proceeds).
  • epoch0005/, epoch0010/ โ€” clean epoch-milestone checkpoints (saved at the end of those epochs; load with subfolder="epoch0005" etc.). Note the folder name uses the internal epoch_idx, which equals the log's "Epoch N+1" (e.g. epoch0010 = the completed "Epoch 11").

Model

  • Architecture: DynamicVLA = SmolLM2-360M VLM backbone (16 layers) + FastViT vision encoder
    • flow-matching action expert (cross-attention bridge, temporal-attention fusion).
  • Full fine-tune: vision, text, and connector are unfrozen (freeze_* = False) โ†’ all 430M parameters trainable (the stock config freezes the backbone and trains only ~99M).
  • Action chunk / horizon 20, 2 observation steps, delta actions, images padded to 384ร—384, cameras opst_cam + wrist_cam.

Training

  • Hardware: 8ร— NVIDIA H200.
  • Effective global batch 1280 = 80/GPU ร— 8 GPUs ร— grad-accum 2 (matches the paper's effective batch; the paper used 32ร— A100 ร— 40/GPU = 1280).
  • AdamW, lr 1e-4, betas (0.9, 0.95), wd 1e-10, cosine schedule + 1000 warmup steps.
  • This run does only the paper's mid-training stage on DOM (no COYO vision-language pre-training, no real-robot post-training), from off-the-shelf SmolLM2-360M + FastVLM-0.5B init.

Load / evaluate

from policies.dynamicvla.modeling_dynamicvla import DynamicVLAPolicy
policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM")               # latest (~epoch 16)
# policy = DynamicVLAPolicy.from_pretrained("mickeykang/dynamic-vla-DOM", subfolder="epoch0010")
policy.eval().cuda()

from_pretrained restores the normalization buffers from model.safetensors, so no dataset is needed to load/infer. For the DOM benchmark, serve with scripts/inference.py -p <dir> against the Isaac Lab simulations/evaluate.py eval server.

Notes

  • DOM contains some corrupt/truncated videos; a local utils/datasets.py resilience patch (substitute a valid sample on any decode error) is needed to train on the full set, but not to load/eval this checkpoint.
Downloads last month
43
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
BF16
ยท
Video Preview
loading

Dataset used to train mickeykang/dynamic-vla-DOM