gpt-oss-20b-reap-0.5-mxfp4

This repository contains a quantized version of the openai/gpt-oss-20b model.

Model Description

This model is a pruned and quantized version of the openai/gpt-oss-20b model.

  • Original Model: openai/gpt-oss-20b
  • Pruning Method: reap with a compression ratio of 0.5
  • Quantization Method: MXFP4 weight-only quantization
  • Dataset used for pruning/quantization (if applicable): theblackcat102/evol-codealpaca-v1

The quantization process specifically targeted the "expert" layers of the model, skipping self-attention and router layers, as is standard practice for Mixture-of-Experts (MoE) models to optimize performance and reduce size.

Usage

You can load this model using the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sandeshrajx/gpt-oss-20b-reap-0.5-mxfp4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Use the model for inference
# ...

Pruning Commands Used:

python ./reap/src/reap/prune.py \
    --model-name "openai/gpt-oss-20b" \
    --run_observer_only true \
    --samples_per_category 32

python ./reap/src/reap/prune.py \
    --model-name "openai/gpt-oss-20b" \
    --compression-ratio 0.5 \
    --prune-method reap

Quantization Details

The model weights have been quantized to MXFP4 format using the NVIDIA Model Optimizer. This process converts the weights to a lower precision format to reduce memory footprint and improve inference speed, while aiming to maintain model accuracy.

Note: The quantization script specifically targets experts layers for quantization, skipping self_attn and router layers as configured.

License

(Please specify the license of the original model and any modifications)


Downloads last month
13
Safetensors
Model size
11B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results