jianchen0311 commited on
Commit
d53f655
·
verified ·
1 Parent(s): d1b9919

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -5
README.md CHANGED
@@ -34,18 +34,18 @@ For all samples, the response portion was regenerated using the target model `op
34
  ## 🚀 Quick Start
35
 
36
  ### SGLang
37
- DFlash is now supported on SGLang. And vLLM integration is currently in progress.
38
 
39
  #### Installation
40
  ```bash
41
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
42
  ```
43
 
44
- #### Inference
45
  ```bash
46
- export SGLANG_ENABLE_SPEC_V2=1
47
- export SGLANG_ENABLE_DFLASH_SPEC_V2=1
48
- export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
 
49
 
50
  python -m sglang.launch_server \
51
  --model-path openai/gpt-oss-20b \
@@ -58,6 +58,56 @@ python -m sglang.launch_server \
58
  --trust-remote-code
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Evaluation
62
  We use a **block size of 8 (7 draft tokens)** during speculation. DFlash consistently achieves high acceptance lengths and speedups across different concurrency levels. All experiments are conducted using **SGLang** on a single **H200 GPU**.
63
 
 
34
  ## 🚀 Quick Start
35
 
36
  ### SGLang
 
37
 
38
  #### Installation
39
  ```bash
40
  uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
41
  ```
42
 
43
+ #### Launch Server
44
  ```bash
45
+ # Optional: enable schedule overlapping (experimental, may not be stable)
46
+ # export SGLANG_ENABLE_SPEC_V2=1
47
+ # export SGLANG_ENABLE_DFLASH_SPEC_V2=1
48
+ # export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
49
 
50
  python -m sglang.launch_server \
51
  --model-path openai/gpt-oss-20b \
 
58
  --trust-remote-code
59
  ```
60
 
61
+ #### Usage
62
+
63
+ ```python
64
+ from openai import OpenAI
65
+
66
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
67
+
68
+ response = client.chat.completions.create(
69
+ model="openai/gpt-oss-20b",
70
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
71
+ max_tokens=2048,
72
+ temperature=0.0,
73
+ )
74
+ print(response.choices[0].message.content)
75
+ ```
76
+
77
+ ### vLLM
78
+
79
+ #### Installation
80
+
81
+ ```bash
82
+ uv pip install vllm
83
+ uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
84
+ ```
85
+
86
+ #### Launch Server
87
+
88
+ ```bash
89
+ vllm serve openai/gpt-oss-20b \
90
+ --speculative-config '{"method": "dflash", "model": "z-lab/gpt-oss-20b-DFlash", "num_speculative_tokens": 7}' \
91
+ --attention-backend flash_attn \
92
+ --max-num-batched-tokens 32768
93
+ ```
94
+
95
+ #### Usage
96
+
97
+ ```python
98
+ from openai import OpenAI
99
+
100
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
101
+
102
+ response = client.chat.completions.create(
103
+ model="openai/gpt-oss-20b",
104
+ messages=[{"role": "user", "content": "Write a quicksort in Python."}],
105
+ max_tokens=2048,
106
+ temperature=0.0,
107
+ )
108
+ print(response.choices[0].message.content)
109
+ ```
110
+
111
  ## Evaluation
112
  We use a **block size of 8 (7 draft tokens)** during speculation. DFlash consistently achieves high acceptance lengths and speedups across different concurrency levels. All experiments are conducted using **SGLang** on a single **H200 GPU**.
113