gpt-oss-20b-reap-0.5-mxfp4

This repository contains a quantized version of the openai/gpt-oss-20b model.

Model Description

This model is a pruned and quantized version of the openai/gpt-oss-20b model.

Original Model: openai/gpt-oss-20b
Pruning Method: reap with a compression ratio of 0.5
Quantization Method: MXFP4 weight-only quantization
Dataset used for pruning/quantization (if applicable): theblackcat102/evol-codealpaca-v1

The quantization process specifically targeted the "expert" layers of the model, skipping self-attention and router layers, as is standard practice for Mixture-of-Experts (MoE) models to optimize performance and reduce size.

Usage

You can load this model using the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sandeshrajx/gpt-oss-20b-reap-0.5-mxfp4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Use the model for inference
# ...

Pruning Commands Used:

python ./reap/src/reap/prune.py \
    --model-name "openai/gpt-oss-20b" \
    --run_observer_only true \
    --samples_per_category 32

python ./reap/src/reap/prune.py \
    --model-name "openai/gpt-oss-20b" \
    --compression-ratio 0.5 \
    --prune-method reap

Quantization Details

The model weights have been quantized to MXFP4 format using the NVIDIA Model Optimizer. This process converts the weights to a lower precision format to reduce memory footprint and improve inference speed, while aiming to maintain model accuracy.

Note: The quantization script specifically targets experts layers for quantization, skipping self_attn and router layers as configured.

License

(Please specify the license of the original model and any modifications)

Downloads last month: 13

Safetensors

Model size

11B params

Tensor type

BF16

sandeshrajx
/

gpt-oss-20b-reap-0.5-mxfp4

gpt-oss-20b-reap-0.5-mxfp4

Model Description

Usage

Quantization Details

License

Evaluation results