Why GraniteMoeHybridForCausalLM?

#4
by liebedan - opened

Hi,

I can run this model using normal transformers library. Because that is quite slow, I tried optimized inference frameworks like "transformers optimum nvidia" and "exllamav2". In both cases I get the error that "GraniteMoeHybridForCausalLM" is not supported.

Why is this the architecure of this model, anyway? According to the model card the whole point of this version (without H in name) is that it is a "simple" dense transformer model without hybrid/mamba and mix of experts. Exactly to allow more compatibility.

How should I proceed?

IBM Granite org

@liebedan Thanks for raising this concern. You're absolutely right that the architecture name GraniteMoeHybrid is confusing here. For this model (and others that use the Dense/non-Hybrid configuration), the name is not at all accurate and the architecture is functionally equivalent to Granite (or GraniteMoeShared if using Moe w/ non-Hybrid). The reason this was all consolidated under GraniteMoeHybrid was for consistency in the hparams and tensor names at training and loading time. This makes the model management much simpler to have a single superset architecture, but (as you point out) it means that any extension will need to have the full superset architecture implemented in order to load these models.

It is certainly possible to map the hparams and tensor names to their corresponding names in Granite or GraniteMoeShared, but that would require additional work and a modified config.json. This is exactly what I did when adding conversion support in llama.cpp's convert_hf_to_gguf.py: https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py#L8259.

Thank you for your response, Gabe!

While I can somewhat follow your reasoning I see this as a barrier for other frameworks. e.g. your modification shows how to add "backwards support" if there is already proper GraniteMoeHybrid support. But for frameworks that don't have that support yet, they would need to implement a "limited support" for GraniteMoeHybrid" that actually supports neither MoE nor Hybrid. Also, this is wide outside of my capabilities when it comes to Tensor RT LLM / optimum or exllamav2. Could only a raise an issue there.

btw: in the meantime I got it running using llama.cpp server with the GGUF Q4 K S variant. And it's faster by a factor of 8 compared to normal transformers. (around 70 t/s instead if 8-9 t/s on a T4)
unfortunately my current stack relies on onnxruntime_genai and the only onnx variant I can find on HF doesn't come with a proper config

other than that I really love the results I get so far especially in terms of instruction following for tool calling. πŸ˜„

IBM Granite org

While I can somewhat follow your reasoning I see this as a barrier for other frameworks. e.g. your modification shows how to add "backwards support" if there is already proper GraniteMoeHybrid support. But for frameworks that don't have that support yet, they would need to implement a "limited support" for GraniteMoeHybrid" that actually supports neither MoE nor Hybrid. Also, this is wide outside of my capabilities when it comes to Tensor RT LLM / optimum or exllamav2. Could only a raise an issue there.

Yes, this is definitely a barrier caused by the choice to use the consolidated architecture. We're always working to try to get the architecture supported in as many places as possible, so we'll definitely keep working to improve this. In the meantime, I think it might make sense for us to put together a script that allows you to create a backwards-converted version of these non-MoE / non-hybrid models to their corresponding "simplified" architecture class. I'll see how tricky that is and let you know.

IBM Granite org

Ok, I got a little help from Bob (shameless plug):

#!/usr/bin/env python3
"""
Script to convert GraniteMoeHybrid models to Granite models.

This script converts models with the GraniteMoeHybridForCausalLM architecture to GraniteForCausalLM.
It only works for models where:
- All layers are set to "attention" (no mamba layers)
- num_local_experts is set to 0 (no MoE experts, only shared experts)

Usage:
    python convert_granitemoehybrid_to_granite.py \
        --input_model /path/to/granitemoehybrid/model \
        --output_model /path/to/output/granite/model
"""

import argparse
import json
import os
import shutil
from pathlib import Path

import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer


def validate_model_eligibility(config):
    """
    Validate that the model is eligible for conversion.

    Args:
        config: The GraniteMoeHybrid config

    Returns:
        bool: True if eligible, False otherwise
    """
    # Check if all layers are attention layers
    layer_types = config.layer_types
    if layer_types is None:
        print("❌ Error: layer_types is None. Cannot determine layer types.")
        return False

    if not all(layer_type == "attention" for layer_type in layer_types):
        print(f"❌ Error: Not all layers are 'attention' layers. Found: {set(layer_types)}")
        print(f"   Layer types: {layer_types}")
        return False

    # Check if num_local_experts is 0
    if config.num_local_experts != 0:
        print(f"❌ Error: num_local_experts must be 0, but got {config.num_local_experts}")
        return False

    print("βœ… Model is eligible for conversion:")
    print(f"   - All {len(layer_types)} layers are 'attention' layers")
    print(f"   - num_local_experts = {config.num_local_experts}")

    return True


def convert_config(hybrid_config):
    """
    Convert GraniteMoeHybrid config to Granite config.

    Args:
        hybrid_config: The GraniteMoeHybrid config

    Returns:
        dict: Granite config dictionary
    """
    # Start with common parameters
    granite_config = {
        "model_type": "granite",
        "vocab_size": hybrid_config.vocab_size,
        "hidden_size": hybrid_config.hidden_size,
        "intermediate_size": hybrid_config.shared_intermediate_size,  # Use shared_intermediate_size
        "num_hidden_layers": hybrid_config.num_hidden_layers,
        "num_attention_heads": hybrid_config.num_attention_heads,
        "num_key_value_heads": hybrid_config.num_key_value_heads,
        "hidden_act": hybrid_config.hidden_act,
        "max_position_embeddings": hybrid_config.max_position_embeddings,
        "initializer_range": hybrid_config.initializer_range,
        "rms_norm_eps": hybrid_config.rms_norm_eps,
        "use_cache": hybrid_config.use_cache,
        "pad_token_id": hybrid_config.pad_token_id,
        "bos_token_id": hybrid_config.bos_token_id,
        "eos_token_id": hybrid_config.eos_token_id,
        "tie_word_embeddings": hybrid_config.tie_word_embeddings,
        "rope_parameters": hybrid_config.rope_parameters,
        "attention_bias": hybrid_config.attention_bias,
        "attention_dropout": hybrid_config.attention_dropout,
        "mlp_bias": False,  # GraniteMoeHybrid doesn't have this, default to False
        "embedding_multiplier": hybrid_config.embedding_multiplier,
        "logits_scaling": hybrid_config.logits_scaling,
        "residual_multiplier": hybrid_config.residual_multiplier,
        "attention_multiplier": hybrid_config.attention_multiplier,
    }

    # Copy over any additional attributes that might be in the config
    for key in ["architectures", "torch_dtype", "_name_or_path"]:
        if hasattr(hybrid_config, key):
            value = getattr(hybrid_config, key)
            if key == "architectures" and value:
                # Update architecture name
                granite_config[key] = ["GraniteForCausalLM"]
            elif key == "torch_dtype":
                granite_config[key] = str(value).split(".")[-1]
            else:
                granite_config[key] = value

    return granite_config


def convert_state_dict(hybrid_state_dict, num_layers):
    """
    Convert GraniteMoeHybrid state dict to Granite state dict.

    The main difference is in the MLP layer naming:
    - GraniteMoeHybrid: model.layers.{i}.shared_mlp.input_linear.weight
    - Granite: model.layers.{i}.mlp.gate_proj.weight and model.layers.{i}.mlp.up_proj.weight

    Args:
        hybrid_state_dict: The GraniteMoeHybrid state dict
        num_layers: Number of layers in the model

    Returns:
        dict: Granite state dict
    """
    granite_state_dict = {}

    for key, value in hybrid_state_dict.items():
        new_key = key

        # Convert MLP layer names from shared_mlp to mlp
        if "shared_mlp" in key:
            if "input_linear.weight" in key:
                # The input_linear has shape [intermediate_size*2, hidden_size]
                # We need to split it into gate_proj and up_proj
                # Each should have shape [intermediate_size, hidden_size]
                chunk_size = value.shape[0] // 2
                gate_weight = value[:chunk_size, :]
                up_weight = value[chunk_size:, :]

                # Add both projections
                gate_key = key.replace("shared_mlp.input_linear", "mlp.gate_proj")
                up_key = key.replace("shared_mlp.input_linear", "mlp.up_proj")

                granite_state_dict[gate_key] = gate_weight
                granite_state_dict[up_key] = up_weight
                continue

            elif "output_linear" in key:
                # output_linear -> down_proj
                new_key = key.replace("shared_mlp.output_linear", "mlp.down_proj")

        granite_state_dict[new_key] = value

    return granite_state_dict


def main():
    parser = argparse.ArgumentParser(description="Convert GraniteMoeHybrid model to Granite model")
    parser.add_argument(
        "--input_model",
        type=str,
        required=True,
        help="Path to the input GraniteMoeHybrid model directory"
    )
    parser.add_argument(
        "--output_model",
        type=str,
        required=True,
        help="Path to save the converted Granite model"
    )
    parser.add_argument(
        "--safe_serialization",
        action="store_true",
        default=True,
        help="Whether to save using safetensors format (default: True)"
    )

    args = parser.parse_args()

    input_path = Path(args.input_model)
    output_path = Path(args.output_model)

    if not input_path.exists():
        print(f"❌ Error: Input model path does not exist: {input_path}")
        return

    print(f"πŸ”„ Loading model from: {input_path}")

    # Load the config first to validate
    config = AutoConfig.from_pretrained(input_path, trust_remote_code=True)

    print(f"\nπŸ“‹ Model configuration:")
    print(f"   Architecture: {config.architectures}")
    print(f"   Model type: {config.model_type}")
    print(f"   Hidden size: {config.hidden_size}")
    print(f"   Num layers: {config.num_hidden_layers}")
    print(f"   Shared intermediate size: {config.shared_intermediate_size}")

    # Validate eligibility
    print(f"\nπŸ” Validating model eligibility...")
    if not validate_model_eligibility(config):
        print("\n❌ Model is not eligible for conversion. Exiting.")
        return

    # Load the full model
    print(f"\nπŸ“₯ Loading full model (this may take a while)...")
    model = AutoModelForCausalLM.from_pretrained(
        input_path,
        trust_remote_code=True,
        torch_dtype=torch.float16 if hasattr(config, "torch_dtype") else None
    )

    # Convert config
    print(f"\nπŸ”§ Converting configuration...")
    granite_config_dict = convert_config(config)

    # Convert state dict
    print(f"\nπŸ”§ Converting model weights...")
    hybrid_state_dict = model.state_dict()
    granite_state_dict = convert_state_dict(hybrid_state_dict, config.num_hidden_layers)

    print(f"   Original state dict keys: {len(hybrid_state_dict)}")
    print(f"   Converted state dict keys: {len(granite_state_dict)}")

    # Create output directory
    output_path.mkdir(parents=True, exist_ok=True)

    # Save the converted config
    print(f"\nπŸ’Ύ Saving converted config to: {output_path}")
    config_path = output_path / "config.json"
    with open(config_path, "w") as f:
        json.dump(granite_config_dict, f, indent=2)

    # Load as Granite model and save
    print(f"\nπŸ’Ύ Loading as Granite model and saving...")
    from transformers import GraniteConfig, GraniteForCausalLM

    granite_config = GraniteConfig(**granite_config_dict)
    granite_model = GraniteForCausalLM(granite_config)

    # Load the converted state dict
    missing_keys, unexpected_keys = granite_model.load_state_dict(granite_state_dict, strict=False)

    if missing_keys:
        print(f"⚠️  Warning: Missing keys in converted model: {missing_keys}")
    if unexpected_keys:
        print(f"⚠️  Warning: Unexpected keys in converted model: {unexpected_keys}")

    # Save the model
    granite_model.save_pretrained(
        output_path,
        safe_serialization=args.safe_serialization
    )

    # Copy tokenizer files if they exist
    print(f"\nπŸ“„ Copying tokenizer files...")
    tokenizer_files = [
        "tokenizer.json",
        "tokenizer_config.json",
        "special_tokens_map.json",
        "vocab.json",
        "merges.txt",
        "tokenizer.model",
        "chat_template.jinja",
    ]

    for filename in tokenizer_files:
        src_file = input_path / filename
        if src_file.exists():
            dst_file = output_path / filename
            shutil.copy2(src_file, dst_file)
            print(f"   βœ… Copied {filename}")

    # Copy generation config if it exists
    gen_config_file = input_path / "generation_config.json"
    if gen_config_file.exists():
        shutil.copy2(gen_config_file, output_path / "generation_config.json")
        print(f"   βœ… Copied generation_config.json")

    print(f"\nβœ… Conversion complete!")
    print(f"   Converted model saved to: {output_path}")
    print(f"\nπŸ§ͺ You can now load the model with:")
    print(f"   from transformers import AutoModelForCausalLM, AutoTokenizer")
    print(f"   model = AutoModelForCausalLM.from_pretrained('{output_path}')")
    print(f"   tokenizer = AutoTokenizer.from_pretrained('{output_path}')")


if __name__ == "__main__":
    main()

Sign up or log in to comment