Update README.md

Browse files

Files changed (1) hide show

README.md +55 -5

README.md CHANGED Viewed

@@ -34,18 +34,18 @@ For all samples, the response portion was regenerated using the target model `op
 ## 🚀 Quick Start
 ### SGLang
-DFlash is now supported on SGLang. And vLLM integration is currently in progress.
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
-#### Inference
 ```bash
-export SGLANG_ENABLE_SPEC_V2=1
-export SGLANG_ENABLE_DFLASH_SPEC_V2=1
-export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 python -m sglang.launch_server \
     --model-path openai/gpt-oss-20b \
@@ -58,6 +58,56 @@ python -m sglang.launch_server \
     --trust-remote-code
 ```
 ## Evaluation
 We use a **block size of 8 (7 draft tokens)** during speculation. DFlash consistently achieves high acceptance lengths and speedups across different concurrency levels. All experiments are conducted using **SGLang** on a single **H200 GPU**.

 ## 🚀 Quick Start
 ### SGLang
 #### Installation
 ```bash
 uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
 ```
+#### Launch Server
 ```bash
+# Optional: enable schedule overlapping (experimental, may not be stable)
+# export SGLANG_ENABLE_SPEC_V2=1
+# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
+# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 python -m sglang.launch_server \
     --model-path openai/gpt-oss-20b \
     --trust-remote-code
 ```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="openai/gpt-oss-20b",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
+### vLLM
+#### Installation
+```bash
+uv pip install vllm
+uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
+```
+#### Launch Server
+```bash
+vllm serve openai/gpt-oss-20b \
+  --speculative-config '{"method": "dflash", "model": "z-lab/gpt-oss-20b-DFlash", "num_speculative_tokens": 7}' \
+  --attention-backend flash_attn \
+  --max-num-batched-tokens 32768
+```
+#### Usage
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="openai/gpt-oss-20b",
+    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
+    max_tokens=2048,
+    temperature=0.0,
+)
+print(response.choices[0].message.content)
+```
 ## Evaluation
 We use a **block size of 8 (7 draft tokens)** during speculation. DFlash consistently achieves high acceptance lengths and speedups across different concurrency levels. All experiments are conducted using **SGLang** on a single **H200 GPU**.