---
base_model: zai-org/AutoGLM-Phone-9B-Multilingual
library_name: gguf
license: other
license_name: glm-4
tags:
- gguf
- llama.cpp
- vision
- multimodal
- autoglm
- phone-agent
- android
- gui-agent
pipeline_tag: text-generation
---

# AutoGLM-Phone-9B-Multilingual (GGUF Quantizations)

This is a **GGUF** quantized version of [zai-org/AutoGLM-Phone-9B-Multilingual](https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual), optimized for local inference with llama.cpp.

**Includes vision encoder (mmproj)** for multimodal capabilities and GUI agent tasks.

## 📦 Model Files

| File | Quantization | Size | VRAM | Description |
|------|-------------|------|------|-------------|
| `AutoGLM-Phone-9B-Multilingual-q4_k_m.gguf` | Q4_K_M | 5.7G | ~10GB | Performance balanced |
| `AutoGLM-Phone-9B-Multilingual-q5_k_m.gguf` | Q5_K_M | 6.6G | ~11GB | High quality |
| `AutoGLM-Phone-9B-Multilingual-q6_k.gguf` | Q6_K | 7.7G | ~12GB | Excellent quality |
| `AutoGLM-Phone-9B-Multilingual-q8_0.gguf` | Q8_0 | 9.4G | ~14GB | Best quality |
| `mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf` | F16 | 1.7G | - | Vision Encoder (required) |

**Total storage**: ~31GB (all quantizations + vision encoder)

## 🚀 Quick Start

### 1. Install llama.cpp

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make llama-server
```

### 2. Download Model

```bash
huggingface-cli download gannima/AutoGLM-Phone-9B-Multilingual-GGUF \
    --local-dir ./AutoGLM-Phone-9B-Multilingual \
    --local-dir-use-symlinks False
```

### 3. Run Server

```bash
./llama-server \
    -m AutoGLM-Phone-9B-Multilingual/AutoGLM-Phone-9B-Multilingual-q8_0.gguf \
    --mmproj AutoGLM-Phone-9B-Multilingual/mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf \
    -c 32768 \
    -ngl 99 \
    --flash-attn on \
    --host 0.0.0.0 \
    --port 8080
```

### 4. Use with Open-AutoGLM

```bash
cd Open-AutoGLM
python main.py \
    --base-url http://localhost:8080 \
    --model "AutoGLM-Phone-9B-Multilingual" \
    --apikey dummy \
    "打开设置应用" \
    --max-steps 20
```

## 💻 Hardware Requirements

### Quick Reference (Tested on RTX 4090)

| Quantization | Model Size | Vision Encoder | Total | Actual VRAM* | Quality |
|--------------|------------|----------------|-------|--------------|---------|
| Q4_K_M | 5.7G | 1.7G | ~7.4G | ~10GB | Good |
| Q5_K_M | 6.6G | 1.7G | ~8.3G | ~11GB | Very Good |
| Q6_K | 7.7G | 1.7G | ~9.4G | ~12GB | Excellent |
| Q8_0 | 9.4G | 1.7G | ~11.1G | ~14GB | Best |

\*VRAM usage measured with `--flash-attn on` and all layers on GPU (`-ngl 99`)

### System Requirements

- **OS**: Linux (Ubuntu 22.04+ recommended), Windows 11 with WSL2
- **RAM**: 32GB+ system memory recommended
- **Storage**: SSD with sufficient space for model files
- **CUDA**: 12.0+ for GPU acceleration
- **llama.cpp**: Latest version with GLM4V support (PR #18042 merged)

### Performance Notes

- **Flash Attention**: Enabled by default for better performance
- **KV Cache**: Quantized to Q8_0 to reduce memory usage
- **Batch Size**: Optimized for RTX 4090 (adjust based on your GPU)
- **Context**: Supports up to 32K tokens with M-RoPE
- **All layers on GPU**: Set `-ngl 99` to offload all transformer layers to GPU

## 🎯 Recommended Usage

### For GUI Agent Tasks (Recommended)
Use **Q5_K_M** or **Q6_K** for the best balance between quality and performance:
- Better reasoning accuracy
- Faster inference than Q8_0
- Lower VRAM usage

### For Maximum Quality
Use **Q8_0** when:
- You want the highest possible accuracy
- Running on RTX 4090 or better
- Complex multi-step GUI automation tasks

### For Consumer GPUs
Use **Q4_K_M** when:
- Limited VRAM (12GB cards like RTX 4070)
- Need faster inference
- Running on gaming GPUs

## 📄 License

This model is governed by the **GLM-4 License**. Please refer to the original model repository for details:
https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual

## 🙏 Acknowledgments

- **Original Model**: [zai-org/AutoGLM-Phone-9B-Multilingual](https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual)
- **Conversion Tool**: [llama.cpp](https://github.com/ggml-org/llama.cpp)
- **GLM4V Support**: [PR #18042](https://github.com/ggml-org/llama.cpp/pull/18042)

---

**Conversion Date**: 2025-12-29
**llama.cpp Version**: latest (with GLM4V support)
**Tested Hardware**: RTX 4090 24GB