Instructions to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF", dtype="auto")

llama-cpp-python

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF",
	filename="Mistral-Nemo-Instruct-2407.Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

Use Docker

docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

SGLang

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Ollama:
```
ollama run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
```

Unsloth Studio

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF to start chatting

Docker Model Runner
How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Docker Model Runner:
```
docker model run hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M
```

Lemonade

How to use MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Mistral-Nemo-Instruct-2407-GGUF-Q4_K_M

List all available models

lemonade list

And where is the GGUF file itself?

by Anonimus12345678902 - opened Jul 18, 2024

Discussion

Anonimus12345678902

Jul 18, 2024

And where is the GGUF file itself?

Henk717

Jul 18, 2024

•

edited Jul 18, 2024

Give the guy some time, new repo's always get made first so the upload scripts can do their job.
It may not even be possible to convert it yet.

MaziyarPanahi

Owner Jul 19, 2024

Sorry, I launched this last night thinking it's the exact same model as Mistral 7B so it should be all fine. However, they are using a slightly different but not that different tokenizer.

I am helping testing this PR, once it's resolved it should be pretty quick :)
https://github.com/ggerganov/llama.cpp/pull/8579

Abdelhak

Jul 20, 2024

@MaziyarPanahi Kindly let us know when the quants are ready :)
Thank you.

MaziyarPanahi

Owner Jul 20, 2024

Of course, the PR is ready to be merged. So hopefully it will be ready today :)

mirek190

Jul 20, 2024

is merged ;)

MaziyarPanahi

Owner Jul 20, 2024

The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible

Abdelhak

Jul 20, 2024

The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible

@MaziyarPanahi Does that mean it's just a workaround and not a fix?

segaa

Jul 20, 2024

•

edited Jul 20, 2024

@MaziyarPanahi Does that mean it's just a workaround and not a fix?

It's not a workaround, it's just one part of the whole Support Mistral-Nemo-Instruct-2407 128K issue solution.
If you try to use this part only, the model will start loading, but then will fail with a wrong tensor shape error because Mistral-Nemo uses non-standard tensor shapes.
The Llama.cpp team is currently working on this part of the issue.

MaziyarPanahi

Owner Jul 22, 2024

Last PR is merge and models are being uploaded!

saishf

Jul 22, 2024

Last PR is merge and models are being uploaded!

Can confirm that they work :3
I've tested Q4_K_S with b3437 and it's coherent to 16K, with cache quant too

MaziyarPanahi

Owner Jul 22, 2024

Nice!!!! Love to see how far we can go with the context length here! :D

ubergarm

Jul 22, 2024

•

edited Jul 22, 2024

Thanks for the fine quants!

I threw a friend's 450 page Ph.D. dissertation (just over ~50k tokens) at the Q8_0 and it generally returned a rough summary. Can almost fit 128k context on my 3090TI 24GB VRAM GPU (had to dial it back just a bit to not OOM when offloading all layers).

I'll likely use this model to experiment quickly generating summaries of medium sized chunks of text (up to 16k or 32k).

Runtime

$ ./llama-server --version
version: 3441 (081fe431)
built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu

$ ./llama-server \
    --model "../models/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.Q8_0.gguf" \
    --n-gpu-layers 41 \
    --ctx-size 102400 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 8 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

Client Config

{
    "temperature": 0.2,
    "top_k": 40,
    "top_p": 0.95,
    "min_p": 0.05,
    "repeat_penalty": 1.1,
    "n_predict": -1,
    "seed": -1,
}

Mixtral Prompt Format

Make sure to use the correct Mixtral Prompt Format being mindful of preserving white spaces and how to fudge in a "system prompt" or not.

Using the wrong prompt format e.g. ChatML it sometimes evaluates the entire prompt and immediately returns end of string generating nothing.

[INST] Just tell it what to do here without system prompt and keep the space in front. [/INST]

Example Timings

INFO [           print_timings] prompt eval time     =   34172.36 ms / 51617 tokens (    0.66 ms per token,  1510.49 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 n_prompt_tokens_processed=51617 t_token=0.6620369258190131 n_tokens_second=1510.489764242212
INFO [           print_timings] generation eval time =   25648.80 ms /   557 runs   (   46.05 ms per token,    21.72 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_token_generation=25648.798 n_decoded=557 t_token=46.04811131059246 n_tokens_second=21.716417276162417
INFO [           print_timings]           total time =   59821.16 ms | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 t_token_generation=25648.798 t_total=59821.157999999996

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment