Spaces:

zenlm
/

zen-training

Running

App Files Files Community

Hanzo Dev commited on Nov 5

Commit

3e34c30

1 Parent(s): 664f1e2

Create unified Zen training space for all models and datasets

Browse files

Files changed (3) hide show

README.md +141 -5
app.py +373 -0
requirements.txt +16 -0

README.md CHANGED Viewed

@@ -1,12 +1,148 @@
 ---
 title: Zen Training
-emoji: 🐢
-colorFrom: red
 colorTo: purple
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Zen Training
+emoji: 🧘
+colorFrom: blue
 colorTo: purple
 sdk: gradio
+sdk_version: 4.0.0
 app_file: app.py
+pinned: true
+license: apache-2.0
+hardware: a10g-large
 ---
+# 🧘 Zen Training Space
+**Unified Training Platform for All Zen Models**
+Train any Zen model with any dataset combination from HuggingFace. Everything runs directly from HF datasets - no local storage needed!
+## 🎯 Features
+### Supported Models
+**Language Models:**
+- `zen-nano` (0.6B) - Edge deployment
+- `zen-eco` (4B) - Balanced performance
+- `zen-omni` (7B) - Multi-task
+- `zen-coder` (14B) - Code generation
+- `zen-next` (32B) - Frontier performance
+**Vision-Language Models:**
+- `zen-vl-4b` - Efficient VL with function calling
+- `zen-vl-8b` - Enhanced VL capabilities
+- `zen-vl-30b` - Maximum VL performance
+### Supported Datasets
+**Agent Training (ADP):**
+- AgentTuning OS/KG/DB (~15k samples)
+- Synatra (99k agent trajectories)
+- Code Feedback (66k samples)
+- Go Browse (27k web interactions)
+**Function Calling:**
+- xLAM 60k (Salesforce high-quality function calling)
+**Instruction Tuning:**
+- Alpaca (52k instruction samples)
+## 🚀 How to Use
+1. **Select Model**: Choose from language or vision-language models
+2. **Select Datasets**: Check multiple datasets to combine them
+3. **Configure Training**: Set epochs, batch size, learning rate, max samples
+4. **Set Output Repo**: Specify HuggingFace repo for trained model
+5. **Start Training**: Click the button and monitor logs
+## ⚙️ Training Configuration
+### Recommended Settings
+**4B Models (A10G - 24GB):**
+- Batch Size: 1-2
+- Max Samples: 10,000-30,000
+- Time: 4-8 hours
+- Cost: ~$3-5
+**8B Models (A100 - 40GB):**
+- Batch Size: 2-4
+- Max Samples: 30,000-50,000
+- Time: 8-12 hours
+- Cost: ~$15-20
+**32B Models (A100 - 80GB):**
+- Batch Size: 1-2
+- Max Samples: 50,000-100,000
+- Time: 20-30 hours
+- Cost: ~$50-80
+## 📊 Dataset Combinations
+### For Agent Training:
+```
+ADP Synatra (80%) + xLAM (20%)
+= Strong agent + quality function calling
+```
+### For Code Models:
+```
+Code Feedback (70%) + Alpaca (30%)
+= Code expertise + general instruction following
+```
+### For VL Models:
+```
+ADP (all configs) + xLAM
+= Complete vision-language agent training
+```
+## 🔒 Requirements
+- HuggingFace Pro account (for GPU access)
+- Write access to output repository
+- HF_TOKEN secret set in Space settings
+## 💡 Tips
+1. **Start Small**: Test with 1,000 samples first
+2. **Mix Datasets**: Combine complementary datasets for best results
+3. **Monitor Logs**: Watch for OOM errors and adjust batch size
+4. **Save Often**: Lower save_steps for longer training runs
+## 📚 Resources
+- **Website**: https://zenlm.org
+- **GitHub**: https://github.com/zenlm
+- **Models**: https://huggingface.co/zenlm
+- **Datasets**:
+  - [ADP](https://huggingface.co/datasets/neulab/agent-data-collection)
+  - [xLAM](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)
+## 📄 License
+Apache 2.0
+## 🙏 Citations
+```bibtex
+@software{zen-training-2025,
+  title={Zen Training: Unified Training Platform for Zen Models},
+  author={Zen AI Team},
+  year={2025},
+  url={https://huggingface.co/spaces/zenlm/zen-training}
+}
+@article{adp2024,
+  title={Agent Data Protocol},
+  author={NeuLab},
+  journal={arXiv preprint arXiv:2510.24702},
+  year={2024}
+}
+@dataset{xlam2024,
+  title={xLAM Function Calling Dataset},
+  author={Salesforce Research},
+  year={2024}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,373 @@

+"""
+Zen Training Space - Unified Training for All Zen Models
+Train any Zen model with any dataset combination from HuggingFace
+"""
+import os
+import gradio as gr
+import torch
+from transformers import AutoModel, AutoTokenizer, AutoProcessor, TrainingArguments, Trainer
+from datasets import load_dataset, concatenate_datasets
+import json
+from typing import List, Dict
+# Model configurations
+MODELS = {
+    "Language Models": {
+        "zen-nano-0.6b": {
+            "hf_id": "zenlm/zen-nano-0.6b",
+            "type": "language",
+            "size": "0.6B",
+            "context": "32K"
+        },
+        "zen-eco-4b-instruct": {
+            "hf_id": "zenlm/zen-eco-4b-instruct",
+            "type": "language",
+            "size": "4B",
+            "context": "32K"
+        },
+        "zen-eco-4b-agent": {
+            "hf_id": "zenlm/zen-eco-4b-agent",
+            "type": "language",
+            "size": "4B",
+            "context": "32K"
+        },
+        "zen-omni-7b": {
+            "hf_id": "zenlm/zen-omni-7b",
+            "type": "language",
+            "size": "7B",
+            "context": "32K"
+        },
+        "zen-coder-14b": {
+            "hf_id": "zenlm/zen-coder-14b",
+            "type": "language",
+            "size": "14B",
+            "context": "128K"
+        },
+        "zen-next-32b": {
+            "hf_id": "zenlm/zen-next-32b",
+            "type": "language",
+            "size": "32B",
+            "context": "32K"
+        },
+    },
+    "Vision-Language Models": {
+        "zen-vl-4b-instruct": {
+            "hf_id": "zenlm/zen-vl-4b-instruct",
+            "type": "vision-language",
+            "size": "4B",
+            "context": "32K"
+        },
+        "zen-vl-8b-instruct": {
+            "hf_id": "zenlm/zen-vl-8b-instruct",
+            "type": "vision-language",
+            "size": "8B",
+            "context": "32K"
+        },
+        "zen-vl-30b-instruct": {
+            "hf_id": "zenlm/zen-vl-30b-instruct",
+            "type": "vision-language",
+            "size": "30B",
+            "context": "32K"
+        },
+    }
+}
+# Dataset configurations
+DATASETS = {
+    "Agent Training": {
+        "ADP - AgentTuning OS": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "agenttuning_os",
+            "size": "~5k samples"
+        },
+        "ADP - AgentTuning KG": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "agenttuning_kg",
+            "size": "~5k samples"
+        },
+        "ADP - AgentTuning DB": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "agenttuning_db",
+            "size": "~5k samples"
+        },
+        "ADP - Synatra": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "synatra",
+            "size": "99k samples"
+        },
+        "ADP - Code Feedback": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "code_feedback",
+            "size": "66k samples"
+        },
+        "ADP - Go Browse": {
+            "hf_id": "neulab/agent-data-collection",
+            "config": "go-browse-wa",
+            "size": "27k samples"
+        },
+    },
+    "Function Calling": {
+        "xLAM Function Calling 60k": {
+            "hf_id": "Salesforce/xlam-function-calling-60k",
+            "config": None,
+            "size": "60k samples"
+        },
+    },
+    "Instruction Tuning": {
+        "Alpaca": {
+            "hf_id": "tatsu-lab/alpaca",
+            "config": None,
+            "size": "52k samples"
+        },
+    }
+}
+def train_model(
+    model_name: str,
+    selected_datasets: List[str],
+    max_samples: int,
+    epochs: int,
+    batch_size: int,
+    learning_rate: float,
+    output_repo: str
+):
+    """Main training function"""
+    try:
+        logs = []
+        def log(msg):
+            print(msg)
+            logs.append(msg)
+            yield "\n".join(logs)
+        yield from log("=" * 80)
+        yield from log("🧘 ZEN TRAINING SPACE")
+        yield from log("=" * 80)
+        yield from log("")
+        # GPU info
+        yield from log(f"🎮 GPU Available: {torch.cuda.is_available()}")
+        if torch.cuda.is_available():
+            yield from log(f"   Device: {torch.cuda.get_device_name(0)}")
+            yield from log(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
+        yield from log("")
+        # Find model config
+        model_config = None
+        for category in MODELS.values():
+            if model_name in category:
+                model_config = category[model_name]
+                break
+        if not model_config:
+            yield from log(f"❌ Model {model_name} not found")
+            return
+        yield from log(f"📦 Loading model: {model_name}")
+        yield from log(f"   HF ID: {model_config['hf_id']}")
+        yield from log(f"   Size: {model_config['size']}")
+        yield from log(f"   Type: {model_config['type']}")
+        # Load model
+        model = AutoModel.from_pretrained(
+            model_config['hf_id'],
+            torch_dtype=torch.bfloat16,
+            device_map="auto",
+            trust_remote_code=True
+        )
+        if model_config['type'] == "vision-language":
+            processor = AutoProcessor.from_pretrained(model_config['hf_id'])
+        else:
+            processor = AutoTokenizer.from_pretrained(model_config['hf_id'])
+        yield from log("✅ Model loaded")
+        yield from log("")
+        # Load datasets
+        yield from log("📚 Loading datasets...")
+        all_datasets = []
+        for dataset_name in selected_datasets:
+            # Find dataset config
+            dataset_config = None
+            for category in DATASETS.values():
+                if dataset_name in category:
+                    dataset_config = category[dataset_name]
+                    break
+            if not dataset_config:
+                yield from log(f"⚠️  Dataset {dataset_name} not found, skipping")
+                continue
+            yield from log(f"   Loading: {dataset_name}")
+            yield from log(f"   HF ID: {dataset_config['hf_id']}")
+            try:
+                if dataset_config['config']:
+                    ds = load_dataset(
+                        dataset_config['hf_id'],
+                        dataset_config['config'],
+                        split="train",
+                        streaming=True
+                    )
+                else:
+                    ds = load_dataset(
+                        dataset_config['hf_id'],
+                        split="train",
+                        streaming=True
+                    )
+                # Take limited samples
+                samples = []
+                for i, example in enumerate(ds):
+                    if i >= max_samples // len(selected_datasets):
+                        break
+                    samples.append(example)
+                all_datasets.extend(samples)
+                yield from log(f"   ✅ Loaded {len(samples)} samples")
+            except Exception as e:
+                yield from log(f"   ❌ Error: {e}")
+        yield from log(f"\n✅ Total samples loaded: {len(all_datasets)}")
+        yield from log("")
+        # Training setup
+        yield from log("⚙️  Training Configuration:")
+        yield from log(f"   Epochs: {epochs}")
+        yield from log(f"   Batch Size: {batch_size}")
+        yield from log(f"   Learning Rate: {learning_rate}")
+        yield from log(f"   Samples: {len(all_datasets)}")
+        yield from log(f"   Output: {output_repo}")
+        yield from log("")
+        training_args = TrainingArguments(
+            output_dir="./training-output",
+            num_train_epochs=epochs,
+            per_device_train_batch_size=batch_size,
+            learning_rate=learning_rate,
+            logging_steps=10,
+            save_steps=100,
+            bf16=True,
+            push_to_hub=True,
+            hub_model_id=output_repo,
+            report_to="tensorboard",
+        )
+        # Create trainer
+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=all_datasets if len(all_datasets) > 0 else None,
+        )
+        # Train!
+        yield from log("🔥 TRAINING STARTED")
+        yield from log("=" * 80)
+        result = trainer.train()
+        yield from log("")
+        yield from log("=" * 80)
+        yield from log("✅ TRAINING COMPLETED!")
+        yield from log("=" * 80)
+        yield from log(f"📊 Final Loss: {result.training_loss:.4f}")
+        yield from log(f"☁️  Model uploaded to: {output_repo}")
+        yield from log("")
+        yield from log("🎉 SUCCESS!")
+    except Exception as e:
+        yield from log(f"\n❌ ERROR: {str(e)}")
+        import traceback
+        yield from log(f"\n{traceback.format_exc()}")
+# Build Gradio Interface
+with gr.Blocks(title="Zen Training Space", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("""
+    # 🧘 Zen Training Space
+    ### Unified Training Platform for All Zen Models
+    Train any Zen model with any dataset combination from HuggingFace.
+    All datasets are loaded directly from HF - no local storage needed!
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### 1. Select Model")
+            model_choice = gr.Dropdown(
+                choices=[
+                    *[f"{cat} / {model}" for cat in MODELS for model in MODELS[cat]]
+                ],
+                label="Model",
+                value="Vision-Language Models / zen-vl-4b-instruct"
+            )
+            gr.Markdown("### 2. Select Datasets")
+            dataset_choices = gr.CheckboxGroup(
+                choices=[
+                    *[f"{cat} / {ds}" for cat in DATASETS for ds in DATASETS[cat]]
+                ],
+                label="Datasets",
+                value=[
+                    "Agent Training / ADP - Synatra",
+                    "Function Calling / xLAM Function Calling 60k"
+                ]
+            )
+            gr.Markdown("### 3. Training Config")
+            max_samples = gr.Slider(100, 100000, value=10000, step=100, label="Max Samples")
+            epochs = gr.Slider(1, 10, value=3, step=1, label="Epochs")
+            batch_size = gr.Slider(1, 8, value=1, step=1, label="Batch Size")
+            learning_rate = gr.Number(value=2e-5, label="Learning Rate")
+            output_repo = gr.Textbox(
+                value="zenlm/zen-vl-4b-agent-custom",
+                label="Output Repository (HuggingFace)"
+            )
+            train_btn = gr.Button("🚀 Start Training", variant="primary", size="lg")
+        with gr.Column(scale=2):
+            gr.Markdown("### Training Logs")
+            output = gr.Textbox(label="", lines=35, max_lines=50, show_label=False)
+    train_btn.click(
+        train_model,
+        inputs=[
+            model_choice,
+            dataset_choices,
+            max_samples,
+            epochs,
+            batch_size,
+            learning_rate,
+            output_repo
+        ],
+        outputs=output
+    )
+    gr.Markdown("""
+    ---
+    ### 📊 Available Models
+    - **Language**: nano (0.6B), eco (4B), omni (7B), coder (14B), next (32B)
+    - **Vision-Language**: zen-vl (4B, 8B, 30B)
+    ### 📚 Available Datasets
+    - **Agent Training**: ADP (220k+ trajectories across 15+ configs)
+    - **Function Calling**: xLAM (60k high-quality examples)
+    - **Instruction**: Alpaca (52k samples)
+    ### 💰 Cost Estimates (HF Pro GPU)
+    - 4B model: $3-5 for 10k samples
+    - 8B model: $8-12 for 10k samples
+    - 32B model: $30-50 for 10k samples
+    """)
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+transformers>=4.57.1
+torch>=2.0.0
+torchvision>=0.15.0
+datasets>=2.14.0
+accelerate>=0.27.0
+pillow>=10.0.0
+gradio>=4.0.0
+huggingface-hub>=0.20.0
+tensorboard>=2.15.0
+pydantic>=2.0.0
+qwen-vl-utils
+av
+opencv-python
+decord
+sentencepiece
+protobuf