Instructions to use swiss-ai/Apertus-70B-Instruct-2509 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use swiss-ai/Apertus-70B-Instruct-2509 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="swiss-ai/Apertus-70B-Instruct-2509") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("swiss-ai/Apertus-70B-Instruct-2509") model = AutoModelForCausalLM.from_pretrained("swiss-ai/Apertus-70B-Instruct-2509") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use swiss-ai/Apertus-70B-Instruct-2509 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "swiss-ai/Apertus-70B-Instruct-2509" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "swiss-ai/Apertus-70B-Instruct-2509", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/swiss-ai/Apertus-70B-Instruct-2509
- SGLang
How to use swiss-ai/Apertus-70B-Instruct-2509 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "swiss-ai/Apertus-70B-Instruct-2509" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "swiss-ai/Apertus-70B-Instruct-2509", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "swiss-ai/Apertus-70B-Instruct-2509" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "swiss-ai/Apertus-70B-Instruct-2509", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use swiss-ai/Apertus-70B-Instruct-2509 with Docker Model Runner:
docker model run hf.co/swiss-ai/Apertus-70B-Instruct-2509
GPU-Setup
What GPU setup do you recommend to run Apertus ideally on and what setup might still do the trick?
Apertus is part of an R&D initiative, and the parameters of the technical deployment of our models is largely a matter of third party software and hardware choices. There is no general guidance on system requirements. Speaking from my own experience, it is possible to get the 8B model to run quite well on consumer-level hardware. 12 GB of VRAM is advisable, but quantized versions exist to fit much smaller amounts of memory. To get the full 70B version of Apertus to run, you will ideally need an industrial-strength graphics card. Again, it depends a lot on what kind of performance you expect to have, how many users you will serve etc. There are some reviews online, which may provide relevant performance (tokens / second) benchmarks. I would suggest contacting some of the providers that deploy Apertus, do your own experiments, and please share any useful advice that you hear about.