Instructions to use ngocbh/TrimKV-Phi-3-mini-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ngocbh/TrimKV-Phi-3-mini-128k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ngocbh/TrimKV-Phi-3-mini-128k-instruct", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ngocbh/TrimKV-Phi-3-mini-128k-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("ngocbh/TrimKV-Phi-3-mini-128k-instruct", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ngocbh/TrimKV-Phi-3-mini-128k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ngocbh/TrimKV-Phi-3-mini-128k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ngocbh/TrimKV-Phi-3-mini-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct

SGLang

How to use ngocbh/TrimKV-Phi-3-mini-128k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ngocbh/TrimKV-Phi-3-mini-128k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ngocbh/TrimKV-Phi-3-mini-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ngocbh/TrimKV-Phi-3-mini-128k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ngocbh/TrimKV-Phi-3-mini-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ngocbh/TrimKV-Phi-3-mini-128k-instruct with Docker Model Runner:
```
docker model run hf.co/ngocbh/TrimKV-Phi-3-mini-128k-instruct
```

TrimKV-Phi-3-mini-128k-instruct

TRIM-KV is an efficient and learnable key–value eviction strategy designed to improve the efficiency of large language models (LLMs) in long-horizon inference.

This model is a version of Phi-3-mini-128k-instruct enhanced with TRIM-KV, as presented in the paper Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs.

The core idea behind TRIM-KV is to learn the intrinsic importance of each key–value pair at creation time, which we call token retention, and then decay this importance exponentially over time to mimic standard inference running with eviction.

The retention score is query-agnostic and captures the long-term utility of tokens. This is different from attention scores, which are query-dependent: they capture the short-term utility for predicting the next token and are recomputed at every step.

Paper: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Code: GitHub - ngocbh/trimkv
Arxiv: 2512.03324

Getting Started

Requirements

Python 3.11 or higher
PyTorch 2.7.0 or higher
FlashAttention 2.7.2.post1 or higher
Transformers 4.57.1

Installation

pip install trimkv

Quick Start

import torch
from trimkv.models.phi3 import TrimKVPhi3ForCausalLM
from trimkv.cache_utils import TrimKVCache
from transformers import AutoTokenizer

model_path = "ngocbh/TrimKV-Phi-3-mini-128k-instruct"

model = TrimKVPhi3ForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    load_trimkv_weights=True,
    use_cache=True,
    device_map="cuda",
    trust_remote_code=True,
)

# Configure TRIM-KV settings
model.config._attn_implementation = "flash_attention_2"
model.config.compress_memory = True
model.config.memory_size = 512
model.config.buffer_size = 128

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    use_fast=True,
    padding_side="left",
)

# Use model.generate as normal.
# Note: TRIM-KV uses TrimKVCache under the hood. 
# Please pass a TrimKVCache instance to past_key_values in model.generate.

Citation

@article{bui2025cache,
  title={Cache what lasts: Token retention for memory-bounded kv cache in llms},
  author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex},
  journal={arXiv preprint arXiv:2512.03324},
  year={2025}
}
@article{bui2025make,
  title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
  author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
  journal={arXiv preprint arXiv:2512.03324},
  year={2025}
}