Instructions to use Jinx-org/Jinx-gpt-oss-20b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Jinx-org/Jinx-gpt-oss-20b-GGUF",
	filename="jinx-gpt-oss-20b-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jinx-org/Jinx-gpt-oss-20b-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jinx-org/Jinx-gpt-oss-20b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Ollama
How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Ollama:
```
ollama run hf.co/Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
```

Unsloth Studio new

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jinx-org/Jinx-gpt-oss-20b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Jinx-org/Jinx-gpt-oss-20b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Jinx-org/Jinx-gpt-oss-20b-GGUF to start chatting

Pi new

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Docker Model Runner:
```
docker model run hf.co/Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M
```

Lemonade

How to use Jinx-org/Jinx-gpt-oss-20b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Jinx-org/Jinx-gpt-oss-20b-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Jinx-gpt-oss-20b-GGUF-Q4_K_M

List all available models

lemonade list

MXFP4_MOE

by marcelone - opened Aug 16, 2025

Discussion

marcelone

Aug 16, 2025

Please, add MXFP4_MOE.

Jeol

Jinx org Aug 18, 2025

As far as I know, modified GPT-OSS is unable to support MXFP4 anymore. If you know of any solution for supporting MXFP4, please let us know.

Best,
Jinx Team.

Maria99934

Aug 18, 2025

As far as I know, modified GPT-OSS is unable to support MXFP4 anymore. If you know of any solution for supporting MXFP4, please let us know.

Best,
Jinx Team.

Hi! it exists, someone made it and I downloaded it, but then deleted it due to lack of space.

as a result, I also decided to focus on it, it was optimal. but she was no longer on huggingface, she was probably deleted.

Maria99934

Aug 18, 2025

•

edited Aug 18, 2025

Your jinx versions are super! They support thinking in Russian! It helps a lot in the work!
The original GPT-OSS one thinks only in English

Thanks to the JINX team!

Jeol

Jinx org Aug 18, 2025

Thanks for your information. I find it does support MXFP4_MOE. And I converted a MXFP4_MOE version. Then I manually tested some queries. The safety filter comes back from nowhere, and I don't know why. Thus I skip to make MXFP4_MOE for now. Maybe you can try it by yourself.
https://github.com/ggml-org/llama.cpp/blob/618575c5825d7d4f170e686e772178d2aae148ae/tools/quantize/quantize.cpp#L25

Best,
Jinx Team

marcelone

Aug 21, 2025

•

edited Aug 21, 2025

Thanks for your information. I find it does support MXFP4_MOE. And I converted a MXFP4_MOE version. Then I manually tested some queries. The safety filter comes back from nowhere, and I don't know why. Thus I skip to make MXFP4_MOE for now. Maybe you can try it by yourself.
https://github.com/ggml-org/llama.cpp/blob/618575c5825d7d4f170e686e772178d2aae148ae/tools/quantize/quantize.cpp#L25

Best,
Jinx Team

try add --tensor-type ".*ffn_gate_exps.weight=Q8_0 or Q4_1

Lissanro

Aug 24, 2025

•

edited Aug 24, 2025

It is notable that "safety" filter coming back with MXFP4 is something that was mentioned in another uncensored model: https://huggingface.co/huizimao/gpt-oss-20b-uncensored-mxfp4 (they have it mentioned in the model card, refusal rate increases by many times if MXFP4 is used). So this may be a general problem of GPT-OSS that is not specific to Jinx. My guess it may be because the model was fine-tuned in BF16, and after quantizing to MXFP4 some part of fine-tuning gets lost due to rounding errors, perhaps because MXFP4 quant for best results requires further direct fine-tuning of its MXFP4 weights after quantization.

In any case, thanks to Jinx for sharing a great uncensored model, I think this is the first uncensored GPT-OSS model that preserved intelligence properly. Would be great if 120B version made too if possible (even with just standard quantization, it still would be awesome).

Joseph717171

Aug 24, 2025

•

edited Aug 24, 2025

Thanks for your information. I find it does support MXFP4_MOE. And I converted a MXFP4_MOE version. Then I manually tested some queries. The safety filter comes back from nowhere, and I don't know why. Thus I skip to make MXFP4_MOE for now. Maybe you can try it by yourself.
https://github.com/ggml-org/llama.cpp/blob/618575c5825d7d4f170e686e772178d2aae148ae/tools/quantize/quantize.cpp#L25

Best,
Jinx Team

Try upcasting the model weights to F32 (convert the model to F32 instead of BF16), and then quantize to GGUF MXFP4_MOE. 😋

Joseph717171

Aug 24, 2025

•

edited Aug 24, 2025

https://huggingface.co/Joseph717171/Jinx-gpt-OSS-20B-GGUF

Thanks Jinx team for your wonderful work on GPT-OSS-20B. I upcasted the model weights to F32, quantized to GGUF MXFP4_MOE.

ParthProLegend

Oct 11, 2025

Was the problem regarding the refusal when using MXFP4_MOE is still prevalent?

Jeol

Jinx org Oct 11, 2025

Hi @ParthProLegend , you can try our provided mxfp4 model. It works fine when we test it. Looking forward to hear your feedback!

ParthProLegend

Oct 14, 2025

•

edited Oct 14, 2025

I tested them(Q5, MXFP4), MXFP4 has better performance than Q5, and that "safety" filter is also NOT coming back. Not to mention, MXFP4 gives higher quality outputs too than Q5. Thanks for your excellent work man, much appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment