Why GraniteMoeHybridForCausalLM?
Hi,
I can run this model using normal transformers library. Because that is quite slow, I tried optimized inference frameworks like "transformers optimum nvidia" and "exllamav2". In both cases I get the error that "GraniteMoeHybridForCausalLM" is not supported.
Why is this the architecure of this model, anyway? According to the model card the whole point of this version (without H in name) is that it is a "simple" dense transformer model without hybrid/mamba and mix of experts. Exactly to allow more compatibility.
How should I proceed?
@liebedan
Thanks for raising this concern. You're absolutely right that the architecture name GraniteMoeHybrid is confusing here. For this model (and others that use the Dense/non-Hybrid configuration), the name is not at all accurate and the architecture is functionally equivalent to Granite (or GraniteMoeShared if using Moe w/ non-Hybrid). The reason this was all consolidated under GraniteMoeHybrid was for consistency in the hparams and tensor names at training and loading time. This makes the model management much simpler to have a single superset architecture, but (as you point out) it means that any extension will need to have the full superset architecture implemented in order to load these models.
It is certainly possible to map the hparams and tensor names to their corresponding names in Granite or GraniteMoeShared, but that would require additional work and a modified config.json. This is exactly what I did when adding conversion support in llama.cpp's convert_hf_to_gguf.py: https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py#L8259.
Thank you for your response, Gabe!
While I can somewhat follow your reasoning I see this as a barrier for other frameworks. e.g. your modification shows how to add "backwards support" if there is already proper GraniteMoeHybrid support. But for frameworks that don't have that support yet, they would need to implement a "limited support" for GraniteMoeHybrid" that actually supports neither MoE nor Hybrid. Also, this is wide outside of my capabilities when it comes to Tensor RT LLM / optimum or exllamav2. Could only a raise an issue there.
btw: in the meantime I got it running using llama.cpp server with the GGUF Q4 K S variant. And it's faster by a factor of 8 compared to normal transformers. (around 70 t/s instead if 8-9 t/s on a T4)
unfortunately my current stack relies on onnxruntime_genai and the only onnx variant I can find on HF doesn't come with a proper config
other than that I really love the results I get so far especially in terms of instruction following for tool calling. π
While I can somewhat follow your reasoning I see this as a barrier for other frameworks. e.g. your modification shows how to add "backwards support" if there is already proper GraniteMoeHybrid support. But for frameworks that don't have that support yet, they would need to implement a "limited support" for GraniteMoeHybrid" that actually supports neither MoE nor Hybrid. Also, this is wide outside of my capabilities when it comes to Tensor RT LLM / optimum or exllamav2. Could only a raise an issue there.
Yes, this is definitely a barrier caused by the choice to use the consolidated architecture. We're always working to try to get the architecture supported in as many places as possible, so we'll definitely keep working to improve this. In the meantime, I think it might make sense for us to put together a script that allows you to create a backwards-converted version of these non-MoE / non-hybrid models to their corresponding "simplified" architecture class. I'll see how tricky that is and let you know.
Ok, I got a little help from Bob (shameless plug):
#!/usr/bin/env python3
"""
Script to convert GraniteMoeHybrid models to Granite models.
This script converts models with the GraniteMoeHybridForCausalLM architecture to GraniteForCausalLM.
It only works for models where:
- All layers are set to "attention" (no mamba layers)
- num_local_experts is set to 0 (no MoE experts, only shared experts)
Usage:
python convert_granitemoehybrid_to_granite.py \
--input_model /path/to/granitemoehybrid/model \
--output_model /path/to/output/granite/model
"""
import argparse
import json
import os
import shutil
from pathlib import Path
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
def validate_model_eligibility(config):
"""
Validate that the model is eligible for conversion.
Args:
config: The GraniteMoeHybrid config
Returns:
bool: True if eligible, False otherwise
"""
# Check if all layers are attention layers
layer_types = config.layer_types
if layer_types is None:
print("β Error: layer_types is None. Cannot determine layer types.")
return False
if not all(layer_type == "attention" for layer_type in layer_types):
print(f"β Error: Not all layers are 'attention' layers. Found: {set(layer_types)}")
print(f" Layer types: {layer_types}")
return False
# Check if num_local_experts is 0
if config.num_local_experts != 0:
print(f"β Error: num_local_experts must be 0, but got {config.num_local_experts}")
return False
print("β
Model is eligible for conversion:")
print(f" - All {len(layer_types)} layers are 'attention' layers")
print(f" - num_local_experts = {config.num_local_experts}")
return True
def convert_config(hybrid_config):
"""
Convert GraniteMoeHybrid config to Granite config.
Args:
hybrid_config: The GraniteMoeHybrid config
Returns:
dict: Granite config dictionary
"""
# Start with common parameters
granite_config = {
"model_type": "granite",
"vocab_size": hybrid_config.vocab_size,
"hidden_size": hybrid_config.hidden_size,
"intermediate_size": hybrid_config.shared_intermediate_size, # Use shared_intermediate_size
"num_hidden_layers": hybrid_config.num_hidden_layers,
"num_attention_heads": hybrid_config.num_attention_heads,
"num_key_value_heads": hybrid_config.num_key_value_heads,
"hidden_act": hybrid_config.hidden_act,
"max_position_embeddings": hybrid_config.max_position_embeddings,
"initializer_range": hybrid_config.initializer_range,
"rms_norm_eps": hybrid_config.rms_norm_eps,
"use_cache": hybrid_config.use_cache,
"pad_token_id": hybrid_config.pad_token_id,
"bos_token_id": hybrid_config.bos_token_id,
"eos_token_id": hybrid_config.eos_token_id,
"tie_word_embeddings": hybrid_config.tie_word_embeddings,
"rope_parameters": hybrid_config.rope_parameters,
"attention_bias": hybrid_config.attention_bias,
"attention_dropout": hybrid_config.attention_dropout,
"mlp_bias": False, # GraniteMoeHybrid doesn't have this, default to False
"embedding_multiplier": hybrid_config.embedding_multiplier,
"logits_scaling": hybrid_config.logits_scaling,
"residual_multiplier": hybrid_config.residual_multiplier,
"attention_multiplier": hybrid_config.attention_multiplier,
}
# Copy over any additional attributes that might be in the config
for key in ["architectures", "torch_dtype", "_name_or_path"]:
if hasattr(hybrid_config, key):
value = getattr(hybrid_config, key)
if key == "architectures" and value:
# Update architecture name
granite_config[key] = ["GraniteForCausalLM"]
elif key == "torch_dtype":
granite_config[key] = str(value).split(".")[-1]
else:
granite_config[key] = value
return granite_config
def convert_state_dict(hybrid_state_dict, num_layers):
"""
Convert GraniteMoeHybrid state dict to Granite state dict.
The main difference is in the MLP layer naming:
- GraniteMoeHybrid: model.layers.{i}.shared_mlp.input_linear.weight
- Granite: model.layers.{i}.mlp.gate_proj.weight and model.layers.{i}.mlp.up_proj.weight
Args:
hybrid_state_dict: The GraniteMoeHybrid state dict
num_layers: Number of layers in the model
Returns:
dict: Granite state dict
"""
granite_state_dict = {}
for key, value in hybrid_state_dict.items():
new_key = key
# Convert MLP layer names from shared_mlp to mlp
if "shared_mlp" in key:
if "input_linear.weight" in key:
# The input_linear has shape [intermediate_size*2, hidden_size]
# We need to split it into gate_proj and up_proj
# Each should have shape [intermediate_size, hidden_size]
chunk_size = value.shape[0] // 2
gate_weight = value[:chunk_size, :]
up_weight = value[chunk_size:, :]
# Add both projections
gate_key = key.replace("shared_mlp.input_linear", "mlp.gate_proj")
up_key = key.replace("shared_mlp.input_linear", "mlp.up_proj")
granite_state_dict[gate_key] = gate_weight
granite_state_dict[up_key] = up_weight
continue
elif "output_linear" in key:
# output_linear -> down_proj
new_key = key.replace("shared_mlp.output_linear", "mlp.down_proj")
granite_state_dict[new_key] = value
return granite_state_dict
def main():
parser = argparse.ArgumentParser(description="Convert GraniteMoeHybrid model to Granite model")
parser.add_argument(
"--input_model",
type=str,
required=True,
help="Path to the input GraniteMoeHybrid model directory"
)
parser.add_argument(
"--output_model",
type=str,
required=True,
help="Path to save the converted Granite model"
)
parser.add_argument(
"--safe_serialization",
action="store_true",
default=True,
help="Whether to save using safetensors format (default: True)"
)
args = parser.parse_args()
input_path = Path(args.input_model)
output_path = Path(args.output_model)
if not input_path.exists():
print(f"β Error: Input model path does not exist: {input_path}")
return
print(f"π Loading model from: {input_path}")
# Load the config first to validate
config = AutoConfig.from_pretrained(input_path, trust_remote_code=True)
print(f"\nπ Model configuration:")
print(f" Architecture: {config.architectures}")
print(f" Model type: {config.model_type}")
print(f" Hidden size: {config.hidden_size}")
print(f" Num layers: {config.num_hidden_layers}")
print(f" Shared intermediate size: {config.shared_intermediate_size}")
# Validate eligibility
print(f"\nπ Validating model eligibility...")
if not validate_model_eligibility(config):
print("\nβ Model is not eligible for conversion. Exiting.")
return
# Load the full model
print(f"\nπ₯ Loading full model (this may take a while)...")
model = AutoModelForCausalLM.from_pretrained(
input_path,
trust_remote_code=True,
torch_dtype=torch.float16 if hasattr(config, "torch_dtype") else None
)
# Convert config
print(f"\nπ§ Converting configuration...")
granite_config_dict = convert_config(config)
# Convert state dict
print(f"\nπ§ Converting model weights...")
hybrid_state_dict = model.state_dict()
granite_state_dict = convert_state_dict(hybrid_state_dict, config.num_hidden_layers)
print(f" Original state dict keys: {len(hybrid_state_dict)}")
print(f" Converted state dict keys: {len(granite_state_dict)}")
# Create output directory
output_path.mkdir(parents=True, exist_ok=True)
# Save the converted config
print(f"\nπΎ Saving converted config to: {output_path}")
config_path = output_path / "config.json"
with open(config_path, "w") as f:
json.dump(granite_config_dict, f, indent=2)
# Load as Granite model and save
print(f"\nπΎ Loading as Granite model and saving...")
from transformers import GraniteConfig, GraniteForCausalLM
granite_config = GraniteConfig(**granite_config_dict)
granite_model = GraniteForCausalLM(granite_config)
# Load the converted state dict
missing_keys, unexpected_keys = granite_model.load_state_dict(granite_state_dict, strict=False)
if missing_keys:
print(f"β οΈ Warning: Missing keys in converted model: {missing_keys}")
if unexpected_keys:
print(f"β οΈ Warning: Unexpected keys in converted model: {unexpected_keys}")
# Save the model
granite_model.save_pretrained(
output_path,
safe_serialization=args.safe_serialization
)
# Copy tokenizer files if they exist
print(f"\nπ Copying tokenizer files...")
tokenizer_files = [
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"vocab.json",
"merges.txt",
"tokenizer.model",
"chat_template.jinja",
]
for filename in tokenizer_files:
src_file = input_path / filename
if src_file.exists():
dst_file = output_path / filename
shutil.copy2(src_file, dst_file)
print(f" β
Copied {filename}")
# Copy generation config if it exists
gen_config_file = input_path / "generation_config.json"
if gen_config_file.exists():
shutil.copy2(gen_config_file, output_path / "generation_config.json")
print(f" β
Copied generation_config.json")
print(f"\nβ
Conversion complete!")
print(f" Converted model saved to: {output_path}")
print(f"\nπ§ͺ You can now load the model with:")
print(f" from transformers import AutoModelForCausalLM, AutoTokenizer")
print(f" model = AutoModelForCausalLM.from_pretrained('{output_path}')")
print(f" tokenizer = AutoTokenizer.from_pretrained('{output_path}')")
if __name__ == "__main__":
main()