Hanzo Dev commited on
Commit
3e34c30
·
1 Parent(s): 664f1e2

Create unified Zen training space for all models and datasets

Browse files
Files changed (3) hide show
  1. README.md +141 -5
  2. app.py +373 -0
  3. requirements.txt +16 -0
README.md CHANGED
@@ -1,12 +1,148 @@
1
  ---
2
  title: Zen Training
3
- emoji: 🐢
4
- colorFrom: red
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
- pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Zen Training
3
+ emoji: 🧘
4
+ colorFrom: blue
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ hardware: a10g-large
12
  ---
13
 
14
+ # 🧘 Zen Training Space
15
+
16
+ **Unified Training Platform for All Zen Models**
17
+
18
+ Train any Zen model with any dataset combination from HuggingFace. Everything runs directly from HF datasets - no local storage needed!
19
+
20
+ ## 🎯 Features
21
+
22
+ ### Supported Models
23
+
24
+ **Language Models:**
25
+ - `zen-nano` (0.6B) - Edge deployment
26
+ - `zen-eco` (4B) - Balanced performance
27
+ - `zen-omni` (7B) - Multi-task
28
+ - `zen-coder` (14B) - Code generation
29
+ - `zen-next` (32B) - Frontier performance
30
+
31
+ **Vision-Language Models:**
32
+ - `zen-vl-4b` - Efficient VL with function calling
33
+ - `zen-vl-8b` - Enhanced VL capabilities
34
+ - `zen-vl-30b` - Maximum VL performance
35
+
36
+ ### Supported Datasets
37
+
38
+ **Agent Training (ADP):**
39
+ - AgentTuning OS/KG/DB (~15k samples)
40
+ - Synatra (99k agent trajectories)
41
+ - Code Feedback (66k samples)
42
+ - Go Browse (27k web interactions)
43
+
44
+ **Function Calling:**
45
+ - xLAM 60k (Salesforce high-quality function calling)
46
+
47
+ **Instruction Tuning:**
48
+ - Alpaca (52k instruction samples)
49
+
50
+ ## 🚀 How to Use
51
+
52
+ 1. **Select Model**: Choose from language or vision-language models
53
+ 2. **Select Datasets**: Check multiple datasets to combine them
54
+ 3. **Configure Training**: Set epochs, batch size, learning rate, max samples
55
+ 4. **Set Output Repo**: Specify HuggingFace repo for trained model
56
+ 5. **Start Training**: Click the button and monitor logs
57
+
58
+ ## ⚙️ Training Configuration
59
+
60
+ ### Recommended Settings
61
+
62
+ **4B Models (A10G - 24GB):**
63
+ - Batch Size: 1-2
64
+ - Max Samples: 10,000-30,000
65
+ - Time: 4-8 hours
66
+ - Cost: ~$3-5
67
+
68
+ **8B Models (A100 - 40GB):**
69
+ - Batch Size: 2-4
70
+ - Max Samples: 30,000-50,000
71
+ - Time: 8-12 hours
72
+ - Cost: ~$15-20
73
+
74
+ **32B Models (A100 - 80GB):**
75
+ - Batch Size: 1-2
76
+ - Max Samples: 50,000-100,000
77
+ - Time: 20-30 hours
78
+ - Cost: ~$50-80
79
+
80
+ ## 📊 Dataset Combinations
81
+
82
+ ### For Agent Training:
83
+ ```
84
+ ADP Synatra (80%) + xLAM (20%)
85
+ = Strong agent + quality function calling
86
+ ```
87
+
88
+ ### For Code Models:
89
+ ```
90
+ Code Feedback (70%) + Alpaca (30%)
91
+ = Code expertise + general instruction following
92
+ ```
93
+
94
+ ### For VL Models:
95
+ ```
96
+ ADP (all configs) + xLAM
97
+ = Complete vision-language agent training
98
+ ```
99
+
100
+ ## 🔒 Requirements
101
+
102
+ - HuggingFace Pro account (for GPU access)
103
+ - Write access to output repository
104
+ - HF_TOKEN secret set in Space settings
105
+
106
+ ## 💡 Tips
107
+
108
+ 1. **Start Small**: Test with 1,000 samples first
109
+ 2. **Mix Datasets**: Combine complementary datasets for best results
110
+ 3. **Monitor Logs**: Watch for OOM errors and adjust batch size
111
+ 4. **Save Often**: Lower save_steps for longer training runs
112
+
113
+ ## 📚 Resources
114
+
115
+ - **Website**: https://zenlm.org
116
+ - **GitHub**: https://github.com/zenlm
117
+ - **Models**: https://huggingface.co/zenlm
118
+ - **Datasets**:
119
+ - [ADP](https://huggingface.co/datasets/neulab/agent-data-collection)
120
+ - [xLAM](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)
121
+
122
+ ## 📄 License
123
+
124
+ Apache 2.0
125
+
126
+ ## 🙏 Citations
127
+
128
+ ```bibtex
129
+ @software{zen-training-2025,
130
+ title={Zen Training: Unified Training Platform for Zen Models},
131
+ author={Zen AI Team},
132
+ year={2025},
133
+ url={https://huggingface.co/spaces/zenlm/zen-training}
134
+ }
135
+
136
+ @article{adp2024,
137
+ title={Agent Data Protocol},
138
+ author={NeuLab},
139
+ journal={arXiv preprint arXiv:2510.24702},
140
+ year={2024}
141
+ }
142
+
143
+ @dataset{xlam2024,
144
+ title={xLAM Function Calling Dataset},
145
+ author={Salesforce Research},
146
+ year={2024}
147
+ }
148
+ ```
app.py ADDED
@@ -0,0 +1,373 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Zen Training Space - Unified Training for All Zen Models
3
+ Train any Zen model with any dataset combination from HuggingFace
4
+ """
5
+
6
+ import os
7
+ import gradio as gr
8
+ import torch
9
+ from transformers import AutoModel, AutoTokenizer, AutoProcessor, TrainingArguments, Trainer
10
+ from datasets import load_dataset, concatenate_datasets
11
+ import json
12
+ from typing import List, Dict
13
+
14
+ # Model configurations
15
+ MODELS = {
16
+ "Language Models": {
17
+ "zen-nano-0.6b": {
18
+ "hf_id": "zenlm/zen-nano-0.6b",
19
+ "type": "language",
20
+ "size": "0.6B",
21
+ "context": "32K"
22
+ },
23
+ "zen-eco-4b-instruct": {
24
+ "hf_id": "zenlm/zen-eco-4b-instruct",
25
+ "type": "language",
26
+ "size": "4B",
27
+ "context": "32K"
28
+ },
29
+ "zen-eco-4b-agent": {
30
+ "hf_id": "zenlm/zen-eco-4b-agent",
31
+ "type": "language",
32
+ "size": "4B",
33
+ "context": "32K"
34
+ },
35
+ "zen-omni-7b": {
36
+ "hf_id": "zenlm/zen-omni-7b",
37
+ "type": "language",
38
+ "size": "7B",
39
+ "context": "32K"
40
+ },
41
+ "zen-coder-14b": {
42
+ "hf_id": "zenlm/zen-coder-14b",
43
+ "type": "language",
44
+ "size": "14B",
45
+ "context": "128K"
46
+ },
47
+ "zen-next-32b": {
48
+ "hf_id": "zenlm/zen-next-32b",
49
+ "type": "language",
50
+ "size": "32B",
51
+ "context": "32K"
52
+ },
53
+ },
54
+ "Vision-Language Models": {
55
+ "zen-vl-4b-instruct": {
56
+ "hf_id": "zenlm/zen-vl-4b-instruct",
57
+ "type": "vision-language",
58
+ "size": "4B",
59
+ "context": "32K"
60
+ },
61
+ "zen-vl-8b-instruct": {
62
+ "hf_id": "zenlm/zen-vl-8b-instruct",
63
+ "type": "vision-language",
64
+ "size": "8B",
65
+ "context": "32K"
66
+ },
67
+ "zen-vl-30b-instruct": {
68
+ "hf_id": "zenlm/zen-vl-30b-instruct",
69
+ "type": "vision-language",
70
+ "size": "30B",
71
+ "context": "32K"
72
+ },
73
+ }
74
+ }
75
+
76
+ # Dataset configurations
77
+ DATASETS = {
78
+ "Agent Training": {
79
+ "ADP - AgentTuning OS": {
80
+ "hf_id": "neulab/agent-data-collection",
81
+ "config": "agenttuning_os",
82
+ "size": "~5k samples"
83
+ },
84
+ "ADP - AgentTuning KG": {
85
+ "hf_id": "neulab/agent-data-collection",
86
+ "config": "agenttuning_kg",
87
+ "size": "~5k samples"
88
+ },
89
+ "ADP - AgentTuning DB": {
90
+ "hf_id": "neulab/agent-data-collection",
91
+ "config": "agenttuning_db",
92
+ "size": "~5k samples"
93
+ },
94
+ "ADP - Synatra": {
95
+ "hf_id": "neulab/agent-data-collection",
96
+ "config": "synatra",
97
+ "size": "99k samples"
98
+ },
99
+ "ADP - Code Feedback": {
100
+ "hf_id": "neulab/agent-data-collection",
101
+ "config": "code_feedback",
102
+ "size": "66k samples"
103
+ },
104
+ "ADP - Go Browse": {
105
+ "hf_id": "neulab/agent-data-collection",
106
+ "config": "go-browse-wa",
107
+ "size": "27k samples"
108
+ },
109
+ },
110
+ "Function Calling": {
111
+ "xLAM Function Calling 60k": {
112
+ "hf_id": "Salesforce/xlam-function-calling-60k",
113
+ "config": None,
114
+ "size": "60k samples"
115
+ },
116
+ },
117
+ "Instruction Tuning": {
118
+ "Alpaca": {
119
+ "hf_id": "tatsu-lab/alpaca",
120
+ "config": None,
121
+ "size": "52k samples"
122
+ },
123
+ }
124
+ }
125
+
126
+ def train_model(
127
+ model_name: str,
128
+ selected_datasets: List[str],
129
+ max_samples: int,
130
+ epochs: int,
131
+ batch_size: int,
132
+ learning_rate: float,
133
+ output_repo: str
134
+ ):
135
+ """Main training function"""
136
+
137
+ try:
138
+ logs = []
139
+
140
+ def log(msg):
141
+ print(msg)
142
+ logs.append(msg)
143
+ yield "\n".join(logs)
144
+
145
+ yield from log("=" * 80)
146
+ yield from log("🧘 ZEN TRAINING SPACE")
147
+ yield from log("=" * 80)
148
+ yield from log("")
149
+
150
+ # GPU info
151
+ yield from log(f"🎮 GPU Available: {torch.cuda.is_available()}")
152
+ if torch.cuda.is_available():
153
+ yield from log(f" Device: {torch.cuda.get_device_name(0)}")
154
+ yield from log(f" Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
155
+ yield from log("")
156
+
157
+ # Find model config
158
+ model_config = None
159
+ for category in MODELS.values():
160
+ if model_name in category:
161
+ model_config = category[model_name]
162
+ break
163
+
164
+ if not model_config:
165
+ yield from log(f"❌ Model {model_name} not found")
166
+ return
167
+
168
+ yield from log(f"📦 Loading model: {model_name}")
169
+ yield from log(f" HF ID: {model_config['hf_id']}")
170
+ yield from log(f" Size: {model_config['size']}")
171
+ yield from log(f" Type: {model_config['type']}")
172
+
173
+ # Load model
174
+ model = AutoModel.from_pretrained(
175
+ model_config['hf_id'],
176
+ torch_dtype=torch.bfloat16,
177
+ device_map="auto",
178
+ trust_remote_code=True
179
+ )
180
+
181
+ if model_config['type'] == "vision-language":
182
+ processor = AutoProcessor.from_pretrained(model_config['hf_id'])
183
+ else:
184
+ processor = AutoTokenizer.from_pretrained(model_config['hf_id'])
185
+
186
+ yield from log("✅ Model loaded")
187
+ yield from log("")
188
+
189
+ # Load datasets
190
+ yield from log("📚 Loading datasets...")
191
+ all_datasets = []
192
+
193
+ for dataset_name in selected_datasets:
194
+ # Find dataset config
195
+ dataset_config = None
196
+ for category in DATASETS.values():
197
+ if dataset_name in category:
198
+ dataset_config = category[dataset_name]
199
+ break
200
+
201
+ if not dataset_config:
202
+ yield from log(f"⚠️ Dataset {dataset_name} not found, skipping")
203
+ continue
204
+
205
+ yield from log(f" Loading: {dataset_name}")
206
+ yield from log(f" HF ID: {dataset_config['hf_id']}")
207
+
208
+ try:
209
+ if dataset_config['config']:
210
+ ds = load_dataset(
211
+ dataset_config['hf_id'],
212
+ dataset_config['config'],
213
+ split="train",
214
+ streaming=True
215
+ )
216
+ else:
217
+ ds = load_dataset(
218
+ dataset_config['hf_id'],
219
+ split="train",
220
+ streaming=True
221
+ )
222
+
223
+ # Take limited samples
224
+ samples = []
225
+ for i, example in enumerate(ds):
226
+ if i >= max_samples // len(selected_datasets):
227
+ break
228
+ samples.append(example)
229
+
230
+ all_datasets.extend(samples)
231
+ yield from log(f" ✅ Loaded {len(samples)} samples")
232
+
233
+ except Exception as e:
234
+ yield from log(f" ❌ Error: {e}")
235
+
236
+ yield from log(f"\n✅ Total samples loaded: {len(all_datasets)}")
237
+ yield from log("")
238
+
239
+ # Training setup
240
+ yield from log("⚙️ Training Configuration:")
241
+ yield from log(f" Epochs: {epochs}")
242
+ yield from log(f" Batch Size: {batch_size}")
243
+ yield from log(f" Learning Rate: {learning_rate}")
244
+ yield from log(f" Samples: {len(all_datasets)}")
245
+ yield from log(f" Output: {output_repo}")
246
+ yield from log("")
247
+
248
+ training_args = TrainingArguments(
249
+ output_dir="./training-output",
250
+ num_train_epochs=epochs,
251
+ per_device_train_batch_size=batch_size,
252
+ learning_rate=learning_rate,
253
+ logging_steps=10,
254
+ save_steps=100,
255
+ bf16=True,
256
+ push_to_hub=True,
257
+ hub_model_id=output_repo,
258
+ report_to="tensorboard",
259
+ )
260
+
261
+ # Create trainer
262
+ trainer = Trainer(
263
+ model=model,
264
+ args=training_args,
265
+ train_dataset=all_datasets if len(all_datasets) > 0 else None,
266
+ )
267
+
268
+ # Train!
269
+ yield from log("🔥 TRAINING STARTED")
270
+ yield from log("=" * 80)
271
+
272
+ result = trainer.train()
273
+
274
+ yield from log("")
275
+ yield from log("=" * 80)
276
+ yield from log("✅ TRAINING COMPLETED!")
277
+ yield from log("=" * 80)
278
+ yield from log(f"📊 Final Loss: {result.training_loss:.4f}")
279
+ yield from log(f"☁️ Model uploaded to: {output_repo}")
280
+ yield from log("")
281
+ yield from log("🎉 SUCCESS!")
282
+
283
+ except Exception as e:
284
+ yield from log(f"\n❌ ERROR: {str(e)}")
285
+ import traceback
286
+ yield from log(f"\n{traceback.format_exc()}")
287
+
288
+ # Build Gradio Interface
289
+ with gr.Blocks(title="Zen Training Space", theme=gr.themes.Soft()) as demo:
290
+ gr.Markdown("""
291
+ # 🧘 Zen Training Space
292
+ ### Unified Training Platform for All Zen Models
293
+
294
+ Train any Zen model with any dataset combination from HuggingFace.
295
+ All datasets are loaded directly from HF - no local storage needed!
296
+ """)
297
+
298
+ with gr.Row():
299
+ with gr.Column(scale=1):
300
+ gr.Markdown("### 1. Select Model")
301
+
302
+ model_choice = gr.Dropdown(
303
+ choices=[
304
+ *[f"{cat} / {model}" for cat in MODELS for model in MODELS[cat]]
305
+ ],
306
+ label="Model",
307
+ value="Vision-Language Models / zen-vl-4b-instruct"
308
+ )
309
+
310
+ gr.Markdown("### 2. Select Datasets")
311
+
312
+ dataset_choices = gr.CheckboxGroup(
313
+ choices=[
314
+ *[f"{cat} / {ds}" for cat in DATASETS for ds in DATASETS[cat]]
315
+ ],
316
+ label="Datasets",
317
+ value=[
318
+ "Agent Training / ADP - Synatra",
319
+ "Function Calling / xLAM Function Calling 60k"
320
+ ]
321
+ )
322
+
323
+ gr.Markdown("### 3. Training Config")
324
+
325
+ max_samples = gr.Slider(100, 100000, value=10000, step=100, label="Max Samples")
326
+ epochs = gr.Slider(1, 10, value=3, step=1, label="Epochs")
327
+ batch_size = gr.Slider(1, 8, value=1, step=1, label="Batch Size")
328
+ learning_rate = gr.Number(value=2e-5, label="Learning Rate")
329
+
330
+ output_repo = gr.Textbox(
331
+ value="zenlm/zen-vl-4b-agent-custom",
332
+ label="Output Repository (HuggingFace)"
333
+ )
334
+
335
+ train_btn = gr.Button("🚀 Start Training", variant="primary", size="lg")
336
+
337
+ with gr.Column(scale=2):
338
+ gr.Markdown("### Training Logs")
339
+ output = gr.Textbox(label="", lines=35, max_lines=50, show_label=False)
340
+
341
+ train_btn.click(
342
+ train_model,
343
+ inputs=[
344
+ model_choice,
345
+ dataset_choices,
346
+ max_samples,
347
+ epochs,
348
+ batch_size,
349
+ learning_rate,
350
+ output_repo
351
+ ],
352
+ outputs=output
353
+ )
354
+
355
+ gr.Markdown("""
356
+ ---
357
+ ### 📊 Available Models
358
+ - **Language**: nano (0.6B), eco (4B), omni (7B), coder (14B), next (32B)
359
+ - **Vision-Language**: zen-vl (4B, 8B, 30B)
360
+
361
+ ### 📚 Available Datasets
362
+ - **Agent Training**: ADP (220k+ trajectories across 15+ configs)
363
+ - **Function Calling**: xLAM (60k high-quality examples)
364
+ - **Instruction**: Alpaca (52k samples)
365
+
366
+ ### 💰 Cost Estimates (HF Pro GPU)
367
+ - 4B model: $3-5 for 10k samples
368
+ - 8B model: $8-12 for 10k samples
369
+ - 32B model: $30-50 for 10k samples
370
+ """)
371
+
372
+ if __name__ == "__main__":
373
+ demo.launch(server_name="0.0.0.0", server_port=7860)
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.57.1
2
+ torch>=2.0.0
3
+ torchvision>=0.15.0
4
+ datasets>=2.14.0
5
+ accelerate>=0.27.0
6
+ pillow>=10.0.0
7
+ gradio>=4.0.0
8
+ huggingface-hub>=0.20.0
9
+ tensorboard>=2.15.0
10
+ pydantic>=2.0.0
11
+ qwen-vl-utils
12
+ av
13
+ opencv-python
14
+ decord
15
+ sentencepiece
16
+ protobuf