LightOnOCR-2-1B-bbox-FP8

FP8 dynamically quantized version of lightonai/LightOnOCR-2-1B-bbox for faster inference.

Quantization Details

  • Method: FP8 dynamic quantization using llmcompressor
  • Scheme: FP8_DYNAMIC (weights and activations)
  • Ignored layers: lm_head

Performance

Benchmarked on NVIDIA B200 GPU with vLLM 0.14:

Precision Throughput Quality
BF16 1,570 tok/s Baseline
FP8 2,023 tok/s ✓ Match

1.29x speedup with identical output quality.

Usage with vLLM

vllm serve richarddavison/LightOnOCR-2-1B-bbox-FP8 \
    --limit-mm-per-prompt '{"image": 1}'
from vllm import LLM, SamplingParams

llm = LLM(model="richarddavison/LightOnOCR-2-1B-bbox-FP8", limit_mm_per_prompt={"image": 1})
sampling = SamplingParams(temperature=0.2, max_tokens=2048)

conversations = [[{
    "role": "user",
    "content": [{"type": "image_url", "image_url": {"url": "https://example.com/document.png"}}]
}]]

results = llm.chat(conversations, sampling)
print(results[0].outputs[0].text)

License

Apache 2.0 (same as base model)

Downloads last month
34
Safetensors
Model size
1B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for richarddavison/LightOnOCR-2-1B-bbox-FP8

Quantized
(2)
this model