MiniMind Max2: Efficient Edge-Deployed Language Models

Architecture

Mixture of Experts + Grouped Query Attention for Maximum Efficiency

Model Space License arXiv arXiv arXiv

Overview

MiniMind Max2 is a family of efficient language models designed for edge deployment, inspired by MiniMax-01's architecture. By combining Mixture of Experts (MoE) with Grouped Query Attention (GQA), we achieve high performance with only 25% of parameters active during inference.

Key Features

Feature Description
MoE Architecture 8 experts with top-2 routing (25% activation)
GQA Optimization 4:1 query-to-key ratio for memory efficiency
Edge Ready Android NDK support with JNI bindings
Multiple Formats SafeTensors, GGUF, ONNX export support
FP8 Support Optimized for FP8 quantization

Model Variants

Model Total Params Active Params Layers Hidden Experts Use Case
max2-nano 500M 125M 12 1024 8 Mobile/IoT
max2-lite 1.5B 375M 20 2048 8 Edge devices
max2-pro 3B 750M 28 3072 8 High-performance edge

Architecture Details

┌─────────────────────────────────────────────────────────────────┐
│                    MiniMind Max2 Architecture                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input Tokens                                                    │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │  Token Embedding + RoPE Positional Enc  │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  ╔═══════════════════════════════════════════════════════════╗  │
│  ║              Transformer Block (×N layers)                 ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │        Grouped Query Attention (GQA)                │  ║  │
│  ║  │   ┌────────┐  ┌────────┐  ┌────────┐               │  ║  │
│  ║  │   │Q Heads │  │K Heads │  │V Heads │               │  ║  │
│  ║  │   │  (48)  │  │  (12)  │  │  (12)  │               │  ║  │
│  ║  │   └────────┘  └────────┘  └────────┘               │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │                    RMSNorm                          │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼                                                    ║  │
│  ║  ┌─────────────────────────────────────────────────────┐  ║  │
│  ║  │           Mixture of Experts (MoE)                  │  ║  │
│  ║  │  ┌────────────────────────────────────────────┐    │  ║  │
│  ║  │  │              Router (Top-2)                │    │  ║  │
│  ║  │  └────────────────────────────────────────────┘    │  ║  │
│  ║  │       │                                             │  ║  │
│  ║  │       ▼                                             │  ║  │
│  ║  │  ┌──────┐┌──────┐┌──────┐┌──────┐    ┌──────┐     │  ║  │
│  ║  │  │Exp 1 ││Exp 2 ││Exp 3 ││Exp 4 │....│Exp 8 │     │  ║  │
│  ║  │  │SwiGLU││SwiGLU││SwiGLU││SwiGLU│    │SwiGLU│     │  ║  │
│  ║  │  └──────┘└──────┘└──────┘└──────┘    └──────┘     │  ║  │
│  ║  └─────────────────────────────────────────────────────┘  ║  │
│  ║       │                                                    ║  │
│  ║       ▼  (+Residual)                                       ║  │
│  ╚═══════════════════════════════════════════════════════════╝  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────────────────────────────┐                    │
│  │      Final RMSNorm + LM Head            │                    │
│  └─────────────────────────────────────────┘                    │
│       │                                                          │
│       ▼                                                          │
│  Output Logits (vocab_size: 102,400)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Installation

pip install torch transformers safetensors

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "fariasultana/MiniMind",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("fariasultana/MiniMind")

# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Using the API

from huggingface_hub import InferenceClient

client = InferenceClient("fariasultana/MiniMind-API")
response = client.text_generation("Explain quantum computing in simple terms")
print(response)

Technical Specifications

Model Configuration (max2-nano)

Architecture:
  hidden_size: 1024
  num_layers: 12
  num_attention_heads: 16
  num_key_value_heads: 4  # GQA ratio 4:1
  intermediate_size: 2816
  
MoE Configuration:
  num_experts: 8
  num_experts_per_token: 2  # Top-2 routing
  expert_intermediate_size: 1408
  
Efficiency:
  total_parameters: 500M
  active_parameters: 125M  # 25% activation
  activation_ratio: 0.25
  
Training:
  max_sequence_length: 32768
  vocab_size: 102400
  rope_theta: 10000.0

Evaluation Results

Benchmark max2-nano max2-lite max2-pro
HellaSwag 41.2% 52.8% 61.4%
ARC-Challenge 29.8% 38.5% 45.2%
MMLU 26.7% 35.2% 42.8%
TruthfulQA 38.5% 44.2% 48.6%
Winogrande 52.8% 58.4% 63.1%

Export Formats

GGUF (llama.cpp)

python -m scripts.export --model max2-nano --format gguf --output model.gguf

ONNX

python -m scripts.export --model max2-nano --format onnx --output model.onnx

Android Deployment

python -m scripts.export --model max2-nano --format android --output ./android_export

Citation

@misc{minimind-max2-2024,
  title={MiniMind Max2: Efficient Language Models for Edge Deployment},
  author={Matrix Agent},
  year={2024},
  howpublished={\url{https://huggingface.co/fariasultana/MiniMind}}
}

Related Papers

License

Apache 2.0 - See LICENSE for details.


Built with efficiency in mind for the edge AI revolution
Downloads last month
163
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train fariasultana/MiniMind

Space using fariasultana/MiniMind 1

Evaluation results