arxiv:2510.19482

ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Published on Oct 22

Authors:

Abstract

ELUTQ, a novel quantization framework using Hierarchical Linear Quantization, improves weight quantization efficiency and reduces perplexity for large language models, enabling deployment on edge devices with limited resources.

AI-generated summary

Weight quantization effectively reduces memory consumption and enables the deployment of large language models on CPU-based edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights without increasing the computational cost of bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. HLQ is orthogonal to existing quantization algorithms. For the LLaMA3.1-8B model, when combined with post-training quantization, HLQ improves uniform quantization by achieving approximately 8 percent perplexity reduction at 3-bit precision and 85 percent perplexity reduction at 2-bit precision. When combined with efficient finetuning techniques, HLQ further improves model accuracy. We also integrate a disk-offload technique into ELUTQ, enabling it to complete the quantization of LLaMA3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, significantly reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ provides high-performance CPU kernels to support end-to-end inference. Under a 4-thread configuration with batch size 1, our 2-bit quantized LLaMA2-7B model achieves a throughput of more than 25 tokens per second on an Apple M2 chip. All the code is available at https://github.com/Nkniexin/ELUTQ.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.19482 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.19482 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.19482 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.