Instructions to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", dtype="auto") - llama-cpp-python
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", filename="Mistral-Nemo-Instruct-2407.Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
Use Docker
docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
- SGLang
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Ollama:
ollama run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
- Unsloth Studio
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting
- Docker Model Runner
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Docker Model Runner:
docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
- Lemonade
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Mistral-Nemo-Instruct-2407-GGUF-Q4_K_M
List all available models
lemonade list
And where is the GGUF file itself?
And where is the GGUF file itself?
Give the guy some time, new repo's always get made first so the upload scripts can do their job.
It may not even be possible to convert it yet.
Sorry, I launched this last night thinking it's the exact same model as Mistral 7B so it should be all fine. However, they are using a slightly different but not that different tokenizer.
I am helping testing this PR, once it's resolved it should be pretty quick :)
https://github.com/ggerganov/llama.cpp/pull/8579
Of course, the PR is ready to be merged. So hopefully it will be ready today :)
is merged ;)
The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible
The PR seems to be just a piece of support for
Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible
@MaziyarPanahi Does that mean it's just a workaround and not a fix?
@MaziyarPanahi Does that mean it's just a workaround and not a fix?
It's not a workaround, it's just one part of the whole Support Mistral-Nemo-Instruct-2407 128K issue solution.
If you try to use this part only, the model will start loading, but then will fail with a wrong tensor shape error because Mistral-Nemo uses non-standard tensor shapes.
The Llama.cpp team is currently working on this part of the issue.
Last PR is merge and models are being uploaded!
Last PR is merge and models are being uploaded!
Can confirm that they work :3
I've tested Q4_K_S with b3437 and it's coherent to 16K, with cache quant too
Nice!!!! Love to see how far we can go with the context length here! :D
Thanks for the fine quants!
I threw a friend's 450 page Ph.D. dissertation (just over ~50k tokens) at the Q8_0 and it generally returned a rough summary. Can almost fit 128k context on my 3090TI 24GB VRAM GPU (had to dial it back just a bit to not OOM when offloading all layers).
I'll likely use this model to experiment quickly generating summaries of medium sized chunks of text (up to 16k or 32k).
Runtime
$ ./llama-server --version
version: 3441 (081fe431)
built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu
$ ./llama-server \
--model "../models/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.Q8_0.gguf" \
--n-gpu-layers 41 \
--ctx-size 102400 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads 8 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
Client Config
{
"temperature": 0.2,
"top_k": 40,
"top_p": 0.95,
"min_p": 0.05,
"repeat_penalty": 1.1,
"n_predict": -1,
"seed": -1,
}
Mixtral Prompt Format
Make sure to use the correct Mixtral Prompt Format being mindful of preserving white spaces and how to fudge in a "system prompt" or not.
Using the wrong prompt format e.g. ChatML it sometimes evaluates the entire prompt and immediately returns end of string generating nothing.
[INST] Just tell it what to do here without system prompt and keep the space in front. [/INST]
Example Timings
INFO [ print_timings] prompt eval time = 34172.36 ms / 51617 tokens ( 0.66 ms per token, 1510.49 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 n_prompt_tokens_processed=51617 t_token=0.6620369258190131 n_tokens_second=1510.489764242212
INFO [ print_timings] generation eval time = 25648.80 ms / 557 runs ( 46.05 ms per token, 21.72 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_token_generation=25648.798 n_decoded=557 t_token=46.04811131059246 n_tokens_second=21.716417276162417
INFO [ print_timings] total time = 59821.16 ms | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 t_token_generation=25648.798 t_total=59821.157999999996
Cheers!