Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GptOssDense

License Python 3.8+ Transformers

GptOssDense is a dense variant of the GptOss model architecture. While GptOss uses a Mixture-of-Experts (MoE) approach with routing, GptOssDense replaces the MoE layer with a standard dense feedforward network (FFN).

✅ Verified to work with trust_remote_code=True on stable transformers (v4.40+)

Model Architecture

  • Attention: Same as GptOss with sliding window attention and sink tokens
  • MLP: Dense FFN with GLU activation (instead of MoE with router)
  • Activation: Same GLU activation as GptOss experts: (up + 1) * gate * sigmoid(gate * alpha) where alpha=1.702
  • Normalization: RMSNorm
  • RoPE: YaRN (Yet another RoPE extensioN)

Key Differences from GptOss

Feature GptOss GptOssDense
MLP Type Mixture-of-Experts Dense FFN
Router Yes No
Experts Multiple (128) Single
Parameters More (due to multiple experts) Fewer
Inference Routes tokens to top-k experts Single FFN for all tokens

Usage

Quick Start - Random Initialization

Try the model with randomly initialized weights (outputs will be random):

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

# Load config and tokenizer
config = AutoConfig.from_pretrained("marksverdhei/gpt-oss-dense", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("marksverdhei/gpt-oss-dense")

# Initialize model with random weights
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
model.eval()

# Generate text (will be random since model is not trained)
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=20,
        do_sample=True,
        temperature=1.0,
        top_k=50,
        pad_token_id=tokenizer.pad_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Example output: "Hello, how are you? pronunci bhithCiudadstdafxipseігlanders導 conveyoruviainn"
# (random tokens since model is not trained)

Loading Pre-trained Weights (when available)

Once model weights are uploaded to the repository:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model with weights
model = AutoModelForCausalLM.from_pretrained(
    "marksverdhei/gpt-oss-dense",
    trust_remote_code=True
)

# Load tokenizer (you'll need to upload a tokenizer)
tokenizer = AutoTokenizer.from_pretrained("marksverdhei/gpt-oss-dense")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

With transformers fork

Using the marksverdhei/transformers fork where GptOssDense is registered:

# Install the fork
pip install git+https://github.com/marksverdhei/transformers.git
from transformers import GptOssDenseForCausalLM, GptOssDenseConfig

config = GptOssDenseConfig()
model = GptOssDenseForCausalLM(config)

Model Configuration

Matches openai/gpt-oss-20b configuration (dense variant):

  • Hidden size: 2880
  • Intermediate size: 2880
  • Number of layers: 24
  • Number of attention heads: 64
  • Number of key-value heads: 8
  • Head dimension: 64
  • Vocabulary size: 201,088
  • Max position embeddings: 131,072
  • Initial context length: 4,096
  • Sliding window: 128
  • RoPE type: YaRN with factor 32.0
  • SwiGLU limit: 7.0
  • Total parameters: ~2.4B

License

Apache 2.0

Citation

If you use this model, please cite the original GptOss work and acknowledge this dense variant.

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support