gpt-oss-20b-reap-0.5-mxfp4
This repository contains a quantized version of the openai/gpt-oss-20b model.
Model Description
This model is a pruned and quantized version of the openai/gpt-oss-20b model.
- Original Model:
openai/gpt-oss-20b - Pruning Method:
reapwith a compression ratio of0.5 - Quantization Method: MXFP4 weight-only quantization
- Dataset used for pruning/quantization (if applicable):
theblackcat102/evol-codealpaca-v1
The quantization process specifically targeted the "expert" layers of the model, skipping self-attention and router layers, as is standard practice for Mixture-of-Experts (MoE) models to optimize performance and reduce size.
Usage
You can load this model using the transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sandeshrajx/gpt-oss-20b-reap-0.5-mxfp4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Use the model for inference
# ...
Pruning Commands Used:
python ./reap/src/reap/prune.py \
--model-name "openai/gpt-oss-20b" \
--run_observer_only true \
--samples_per_category 32
python ./reap/src/reap/prune.py \
--model-name "openai/gpt-oss-20b" \
--compression-ratio 0.5 \
--prune-method reap
Quantization Details
The model weights have been quantized to MXFP4 format using the NVIDIA Model Optimizer. This process converts the weights to a lower precision format to reduce memory footprint and improve inference speed, while aiming to maintain model accuracy.
Note: The quantization script specifically targets experts layers for quantization, skipping self_attn and router layers as configured.
License
(Please specify the license of the original model and any modifications)
- Downloads last month
- 13