Guidance Needed: GPT-OSS 20B Fine-Tuning with Unsloth → GGUF → Ollama → Triton (vLLM / TensorRT-LLM)

#225
by GauravEA - opened

I am currently fine-tuning the GPT-OSS 20B model using Unsloth with HuggingFace TRL (SFTTrainer).

Long-term goal

Serve the model in production using Triton with either vLLM or TensorRT-LLM as the backend

Short-term / initial deployment using Ollama (GGUF)

Current challenge
GPT-OSS uses a Harmony-style chat template, which includes:

developer role

Explicit EOS handling

thinking / analysis channels

Tool / function calling structure

When converting the fine-tuned model to GGUF and deploying it in Ollama using the default GPT-OSS Modelfile, I am running into ambiguity around:

Whether the default Jinja chat template provided by GPT-OSS should be modified for Ollama compatibility

How to correctly handle:

EOS token behavior

Internal reasoning / analysis channels

Developer role alignment

How to do this without degrading the model’s default performance or alignment

Constraints / Intent

I already have training data prepared strictly in system / user / assistant format

I want to:

Preserve GPT-OSS’s native behavior as much as possible

Perform accurate, non-destructive fine-tuning

Avoid hacks that work short-term but break compatibility with vLLM / TensorRT-LLM later

What I’m looking for

Has anyone successfully:

Fine-tuned GPT-OSS

Converted it to GGUF

Deployed it with Ollama

While preserving the Harmony template behavior?

If yes:

Did you modify the chat template / Modelfile?

How did you handle EOS + reasoning channels?

Any pitfalls to avoid to keep it production-ready for Triton later?

Any concrete guidance, references, or proven setups would be extremely helpful.

following up

hi @GauravEA - you can use triton inference server with vllm backend.

I am using the same but first I want to deploy it on ollama for production I am going for Triton + vllm

Sign up or log in to comment