LightOnOCR-2-1B-bbox-FP8
FP8 dynamically quantized version of lightonai/LightOnOCR-2-1B-bbox for faster inference.
Quantization Details
- Method: FP8 dynamic quantization using llmcompressor
- Scheme: FP8_DYNAMIC (weights and activations)
- Ignored layers:
lm_head
Performance
Benchmarked on NVIDIA B200 GPU with vLLM 0.14:
| Precision | Throughput | Quality |
|---|---|---|
| BF16 | 1,570 tok/s | Baseline |
| FP8 | 2,023 tok/s | ✓ Match |
1.29x speedup with identical output quality.
Usage with vLLM
vllm serve richarddavison/LightOnOCR-2-1B-bbox-FP8 \
--limit-mm-per-prompt '{"image": 1}'
from vllm import LLM, SamplingParams
llm = LLM(model="richarddavison/LightOnOCR-2-1B-bbox-FP8", limit_mm_per_prompt={"image": 1})
sampling = SamplingParams(temperature=0.2, max_tokens=2048)
conversations = [[{
"role": "user",
"content": [{"type": "image_url", "image_url": {"url": "https://example.com/document.png"}}]
}]]
results = llm.chat(conversations, sampling)
print(results[0].outputs[0].text)
License
Apache 2.0 (same as base model)
- Downloads last month
- 34
Model tree for richarddavison/LightOnOCR-2-1B-bbox-FP8
Base model
lightonai/LightOnOCR-2-1B-bbox