Llama-3.2-1B - ADPQ 4-bit Quantized

This work is part of a master thesis. The library used for quantization is available at auto-adpq.

pip install auto-adpq

Model Description

This is a compressed version of meta-llama/Llama-3.2-1B created using 4-bit quantization.

This model was quantized to reduce VRAM usage and increase inference speed while maintaining majority of the original model's performance.

Quantization Details

  • Original Model: meta-llama/Llama-3.2-1B
  • Quantization Method: ADPQ (Adaptive Quantization with data-free calibration)
  • Precision: 4-bit
  • Simulated: Yes
  • Target Hardware: Compatible with consumer GPUs (e.g., RTX 3060/4090)

How to Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Tfloow/Llama-3.2-1B-adpq-4bit-sim"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Performance

model PPL
unsloth/Meta-Llama-3.2-1B 6.5546
unsloth/Meta-Llama-3.2-1B-bnb-4bit 6.9971
unsloth/Meta-Llama-3.2-1B-adpq 7.5700

How was the model quantized?


import torch
from transformers import AutoModelForCausalLM

from auto_adpq import Auto_AdpQ, AutoAdpQConfig

model_name = "meta-llama/Llama-3.2-1B"

# Setup Auto-AdpQ configuration
adpq_config = AutoAdpQConfig(
    group_size=group_size,
    n_iters=30,  # Seems quite slow otherwise
    alpha=0.08,
    device="cpu",
    q_bit=4,
    data_packing=False,
    symmetrical_quantization=True,
)

adpq = Auto_AdpQ(config=adpq_config)


model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

print(model.dtype)

quantized_model = adpq.quantize_model_multithreaded(model, max_workers=16)

adpq.save_pretrained("quantized/")
adpq.fuse_model_from_pretrained(model, "quantized/")

model.push_to_hub(f"Tfloow/{model_name.split('/')[-1]}-adpq-4bit-sim")
Downloads last month
47
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tfloow/Llama-3.2-1B-adpq-4bit-sim

Finetuned
(815)
this model

Collection including Tfloow/Llama-3.2-1B-adpq-4bit-sim