--- base_model: zai-org/AutoGLM-Phone-9B-Multilingual library_name: gguf license: other license_name: glm-4 tags: - gguf - llama.cpp - vision - multimodal - autoglm - phone-agent - android - gui-agent pipeline_tag: text-generation --- # AutoGLM-Phone-9B-Multilingual (GGUF Quantizations) This is a **GGUF** quantized version of [zai-org/AutoGLM-Phone-9B-Multilingual](https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual), optimized for local inference with llama.cpp. **Includes vision encoder (mmproj)** for multimodal capabilities and GUI agent tasks. ## 📦 Model Files | File | Quantization | Size | VRAM | Description | |------|-------------|------|------|-------------| | `AutoGLM-Phone-9B-Multilingual-q4_k_m.gguf` | Q4_K_M | 5.7G | ~10GB | Performance balanced | | `AutoGLM-Phone-9B-Multilingual-q5_k_m.gguf` | Q5_K_M | 6.6G | ~11GB | High quality | | `AutoGLM-Phone-9B-Multilingual-q6_k.gguf` | Q6_K | 7.7G | ~12GB | Excellent quality | | `AutoGLM-Phone-9B-Multilingual-q8_0.gguf` | Q8_0 | 9.4G | ~14GB | Best quality | | `mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf` | F16 | 1.7G | - | Vision Encoder (required) | **Total storage**: ~31GB (all quantizations + vision encoder) ## 🚀 Quick Start ### 1. Install llama.cpp ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp make llama-server ``` ### 2. Download Model ```bash huggingface-cli download gannima/AutoGLM-Phone-9B-Multilingual-GGUF \ --local-dir ./AutoGLM-Phone-9B-Multilingual \ --local-dir-use-symlinks False ``` ### 3. Run Server ```bash ./llama-server \ -m AutoGLM-Phone-9B-Multilingual/AutoGLM-Phone-9B-Multilingual-q8_0.gguf \ --mmproj AutoGLM-Phone-9B-Multilingual/mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf \ -c 32768 \ -ngl 99 \ --flash-attn on \ --host 0.0.0.0 \ --port 8080 ``` ### 4. Use with Open-AutoGLM ```bash cd Open-AutoGLM python main.py \ --base-url http://localhost:8080 \ --model "AutoGLM-Phone-9B-Multilingual" \ --apikey dummy \ "打开设置应用" \ --max-steps 20 ``` ## 💻 Hardware Requirements ### Quick Reference (Tested on RTX 4090) | Quantization | Model Size | Vision Encoder | Total | Actual VRAM* | Quality | |--------------|------------|----------------|-------|--------------|---------| | Q4_K_M | 5.7G | 1.7G | ~7.4G | ~10GB | Good | | Q5_K_M | 6.6G | 1.7G | ~8.3G | ~11GB | Very Good | | Q6_K | 7.7G | 1.7G | ~9.4G | ~12GB | Excellent | | Q8_0 | 9.4G | 1.7G | ~11.1G | ~14GB | Best | \*VRAM usage measured with `--flash-attn on` and all layers on GPU (`-ngl 99`) ### System Requirements - **OS**: Linux (Ubuntu 22.04+ recommended), Windows 11 with WSL2 - **RAM**: 32GB+ system memory recommended - **Storage**: SSD with sufficient space for model files - **CUDA**: 12.0+ for GPU acceleration - **llama.cpp**: Latest version with GLM4V support (PR #18042 merged) ### Performance Notes - **Flash Attention**: Enabled by default for better performance - **KV Cache**: Quantized to Q8_0 to reduce memory usage - **Batch Size**: Optimized for RTX 4090 (adjust based on your GPU) - **Context**: Supports up to 32K tokens with M-RoPE - **All layers on GPU**: Set `-ngl 99` to offload all transformer layers to GPU ## 🎯 Recommended Usage ### For GUI Agent Tasks (Recommended) Use **Q5_K_M** or **Q6_K** for the best balance between quality and performance: - Better reasoning accuracy - Faster inference than Q8_0 - Lower VRAM usage ### For Maximum Quality Use **Q8_0** when: - You want the highest possible accuracy - Running on RTX 4090 or better - Complex multi-step GUI automation tasks ### For Consumer GPUs Use **Q4_K_M** when: - Limited VRAM (12GB cards like RTX 4070) - Need faster inference - Running on gaming GPUs ## 📄 License This model is governed by the **GLM-4 License**. Please refer to the original model repository for details: https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual ## 🙏 Acknowledgments - **Original Model**: [zai-org/AutoGLM-Phone-9B-Multilingual](https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual) - **Conversion Tool**: [llama.cpp](https://github.com/ggml-org/llama.cpp) - **GLM4V Support**: [PR #18042](https://github.com/ggml-org/llama.cpp/pull/18042) --- **Conversion Date**: 2025-12-29 **llama.cpp Version**: latest (with GLM4V support) **Tested Hardware**: RTX 4090 24GB