LightOnOCR-2-1B-bbox-FP8

FP8 dynamically quantized version of lightonai/LightOnOCR-2-1B-bbox for faster inference.

Quantization Details

Method: FP8 dynamic quantization using llmcompressor
Scheme: FP8_DYNAMIC (weights and activations)
Ignored layers: lm_head

Performance

Benchmarked on NVIDIA B200 GPU with vLLM 0.14:

Precision	Throughput	Quality
BF16	1,570 tok/s	Baseline
FP8	2,023 tok/s	✓ Match

1.29x speedup with identical output quality.

Usage with vLLM

vllm serve richarddavison/LightOnOCR-2-1B-bbox-FP8 \
    --limit-mm-per-prompt '{"image": 1}'

from vllm import LLM, SamplingParams

llm = LLM(model="richarddavison/LightOnOCR-2-1B-bbox-FP8", limit_mm_per_prompt={"image": 1})
sampling = SamplingParams(temperature=0.2, max_tokens=2048)

conversations = [[{
    "role": "user",
    "content": [{"type": "image_url", "image_url": {"url": "https://example.com/document.png"}}]
}]]

results = llm.chat(conversations, sampling)
print(results[0].outputs[0].text)

License

Apache 2.0 (same as base model)

Downloads last month: 34

Safetensors

Model size

1B params

Tensor type

BF16

F8_E4M3

Model tree for richarddavison/LightOnOCR-2-1B-bbox-FP8

Base model

lightonai/LightOnOCR-2-1B-bbox

Quantized

(2)

this model