---
license: other
license_name: thestageai-elastic
base_model:
- Qwen/Qwen2.5-7B-Instruct
base_model_relation: quantized
pipeline_tag: text-generation
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---

# Elastic model: Qwen2.5-7B-Instruct


## Overview

---

ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:

- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
- **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
- **M**: Faster model, with accuracy degradation less than 1.5%.
- **S**: The fastest model, with accuracy degradation less than 2%.

Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).

## Installation

---


### System Requirements

---

| **Property**| **Value** |
 | ---  | ---  |
| **GPU** | L40s, RTX 5090, H100, RTX 4090 |
| **Python Version** | 3.10-3.12 |
| **CPU** | Intel/AMD x86_64 |
| **CUDA Version** | 12.8+ |


### TheStage AI Access token setup

---

Install TheStage AI CLI and setup API token:

```bash
pip install thestage
thestage config set --access-token <YOUR_ACCESS_TOKEN>
```

### ElasticModels installation

---

Install TheStage Elastic Models package:

```bash
pip install 'thestage-elastic-models[nvidia,cudnn]' \
    --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
```

## Usage example

---


Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the Qwen2.5-7B-Instruct model:

```python
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model configuration as well
model_name = "Qwen/Qwen2.5-7B-Instruct"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
```


## Quality Benchmarks

---


We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Winogrande.

![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422559-0c9621c5-9e7f-4c81-8698-70f6d6872cb5/Elastic_Qwen2.5_7B_Instruct_MMLU.png)

### Quality Benchmark Results

---

| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8, int8** |
 | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
| **Arc Challenge** | 54.2 | 55.2 | 55.3 | 54.9 | 54.7 | 41.7 |
| **MMLU** | 71.5 | 71.6 | 71.9 | 71.9 | 71.8 | 64.6 |
| **PIQA** | 78.3 | 79.9 | 79.5 | 79.5 | 79.6 | 67.1 |
| **Winogrande** | 70.4 | 70.3 | 71.5 | 70.4 | 71.0 | 53.1 |


## Datasets

---


- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.

## Metrics

---


- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.


## Latency Benchmarks

---


We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.

![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422585-3065316c-5c07-4430-befb-61daac95f712/Elastic_Qwen2.5_7B_Instruct_latency.png)

### Latency Benchmark Results

---

Tokens per second for different model sizes on various GPUs.

| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original**| **W8A8_int8** |
 | ---  | ---  | ---  | ---  | ---  | ---  | ---  |
| **H100** | 184 | 177 | 157 | 138 | 62 | 201 |
| **L40s** | 72 | 67 | 57 | 48 | 42 | 78 |
| **B200** | 239 | 232 | 216 | 199 | 114 | N/A |
| **GeForce RTX 5090** | 141 | N/A | N/A | N/A | 66 | N/A |
| **GeForce RTX 4090** | 95 | N/A | N/A | N/A | 45 | N/A |


## Benchmarking Methodology

---


The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.

> **Algorithm summary:**
> 1. Load the Qwen2.5-7B-Instruct model with the specified size (S, M, L, XL, original).
> 2. Move the model to the GPU.
> 3. Prepare a sample prompt for text generation.
> 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
>    - Synchronize the GPU to flush any previous operations.
>    - Record the start time.
>    - Generate the text using the model.
>    - Synchronize the GPU again.
>    - Record the end time and calculate the TTFT and TPS for that iteration.
> 5. Calculate the average TTFT and TPS over all iterations.


## Serving with Docker Image

---


For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
You can also use this container to run inference through TheStage AI platform.

### Prebuilt image from ECR

---

Pull docker image and start inference container:

```bash
docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
```
```bash
docker run --rm -ti \
  --name serving_thestage_model \
  -p 8000:80 \
  -e AUTH_TOKEN=<AUTH_TOKEN> \
  -e MODEL_REPO=Qwen/Qwen2.5-7B-Instruct \
  -e MODEL_SIZE=<MODEL_SIZE> \
  -e MODEL_BATCH=<MAX_BATCH_SIZE> \
  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
  -v /mnt/hf_cache:/root/.cache/huggingface \
  public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-llm-24.09c
```

| **Parameter**              | **Description**                                                                                      |
|----------------------------|------------------------------------------------------------------------------------------------------|
| `<MODEL_SIZE>`             | Available: S, M, L, XL.                                                                              |
| `<MAX_BATCH_SIZE>`         | Maximum batch size to process in parallel.                                                           |
| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
| `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
| `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |

## Invocation

---


You can invoke the endpoint using CURL as follows:

```bash
curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
    -H 'Authorization: Bearer 123' \
    -H 'Content-Type: application/json' \
    -H "X-Model-Name: qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
    -d '{
        "messages":[{"role":"user","content":"Define AI"}]
    }'
```

Or using OpenAI python client:

```python
import os, base64, pathlib, json
from openai import OpenAI

BASE_URL = "http://<your_ip>/v1"
API_KEY  = "123"
MODEL    = "qwen-2-5-7b-instruct-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"

client = OpenAI(
    api_key=API_KEY,
    base_url=BASE_URL,
    default_headers={"X-Model-Name": MODEL}
)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Define AI"}
    ]
)

print(response.choices[0].message.content)
```

## Endpoint Parameters

---


### Method

---

> **POST** `/v1/chat/completions`

### Header Parameters

---

> `Authorization`: `string`
>
> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.

> `Content-Type`: `string`
>
> Must be set to `application/json`.

> `X-Model-Name`: `string`
>
> Specifies the model to use for generation. Format: `qwen-2-5-7b-instruct-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.

### Input Body

---

> `messages` : `string`
>
> The input text prompt.


## Deploy on Modal

---


For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)

### Clone modal serving code

---

```shell
git clone https://github.com/TheStageAI/ElasticModels.git
cd ElasticModels/examples/modal
```

### Configuration of environment variables

---

Set your environment variables in `modal_serving.py`:

```python
# modal_serving.py

ENVS = {
    "MODEL_REPO": "Qwen/Qwen2.5-7B-Instruct",
    "MODEL_BATCH": "4",
    "THESTAGE_AUTH_TOKEN": "",
    "HUGGINGFACE_ACCESS_TOKEN": "",
    "PORT": "80",
    "PORT_HEALTH": "80",
    "HF_HOME": "/cache/huggingface",
}
```

### Configuration of GPUs

---

Set your desired GPU type and autoscaling variables in `modal_serving.py`:

```python
# modal_serving.py

@app.function(
    image=image,
    gpu="B200",
    min_containers=8,
    max_containers=8,
    timeout=10000,
    ephemeral_disk=600 * 1024,
    volumes={"/opt/project/.cache": HF_CACHE},
    startup_timeout=60*20
)
@modal.web_server(
    80,
    label="Qwen/Qwen2.5-7B-Instruct-test",
    startup_timeout=60*20
)
def serve():
    pass
```

### Run serving

---

```shell
modal serve modal_serving.py
```


## Links

---

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
* __Contact email__: contact@thestage.ai