YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Llama 3.2 Fine-tuning - Memory-Safe Version

A production-ready notebook for fine-tuning Meta's Llama 3.2 1B model using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with extensive memory management and crash prevention.

🎯 Overview

This project implements a two-stage fine-tuning pipeline for Llama 3.2 focused on content safety:

Supervised Fine-Tuning (SFT) - Teaching instruction following
Direct Preference Optimization (DPO) - Aligning with safety preferences

Key Features

✅ Memory-safe design - Prevents kernel crashes on limited GPU memory
✅ Version-pinned packages - Reproducible environment setup
✅ Aggressive memory management - Optimized for Google Colab free tier
✅ Extensive error handling - Clear troubleshooting messages
✅ Step-by-step execution - Safe incremental progress
✅ Production-ready - Upload directly to Hugging Face Hub

📋 Requirements

Minimum Requirements

GPU: NVIDIA T4 (16GB VRAM) or better
RAM: High-RAM runtime (if available)
Platform: Google Colab (recommended) or local setup with CUDA
Storage: ~5GB for model checkpoints

Recommended Setup

GPU: NVIDIA A100 (40GB+ VRAM)
Platform: Google Colab Pro/Pro+ for longer sessions
Internet: Stable connection for dataset download and model upload

🚀 Quick Start

Option 1: Google Colab (Recommended)

Open the notebook in Google Colab
Enable GPU: Runtime → Change runtime type → T4 GPU
(Optional) Enable High-RAM: Edit → Notebook settings → High-RAM
Run cells sequentially from Step 1 to Step 17
Important: Restart runtime after Step 3 (package installation)

Option 2: Local Setup

# Clone the repository
git clone https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning.git
cd llama-3.2-safe-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets accelerate peft trl bitsandbytes scipy

# Launch Jupyter
jupyter notebook llama_3_2_minimal_safe.ipynb

📊 Training Pipeline

Dataset

Source: NVIDIA Aegis AI Content Safety Dataset 2.0
Size: 500 samples (configurable)
Split: 80% SFT / 20% DPO
Focus: Safe, responsible AI responses

Training Configuration

Parameter	SFT	DPO
Epochs	2	1
Batch Size	1	1
Gradient Accumulation	8	8
Learning Rate	1e-5	5e-7
Max Sequence Length	1024	1024
LoRA r	8	8
LoRA alpha	16	16
Training Time	~20-30 min	~10-20 min

LoRA Configuration

LoRA_R = 8              # Rank
LoRA_ALPHA = 16         # Alpha scaling
LoRA_DROPOUT = 0.05     # Dropout rate
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]

Memory Optimizations

4-bit NF4 quantization
Gradient checkpointing
BF16 mixed precision
Aggressive garbage collection
Optimized batch sizes

📁 Project Structure

llama-3.2-safe-finetuning/
├── llama_3_2_minimal_safe.ipynb   # Main training notebook
├── README.md                       # This file
├── LICENSE                         # Llama 3.2 Community License
├── requirements.txt                # Python dependencies (if local)
├── sft_output/                     # SFT training checkpoints
├── dpo_output/                     # DPO training checkpoints
├── llama-3.2-1b-sft/              # Final SFT model
├── llama-3.2-1b-sft-dpo/          # Final merged model
└── model_card.md                   # Generated model card for HF Hub

🔧 Configuration

Update these variables in Step 5 before training:

# Model Configuration
MODEL_NAME = "meta-llama/Llama-3.2-1B"
DATASET_NAME = "nvidia/Aegis-AI-Content-Safety-Dataset-2.0"
NEW_MODEL_NAME = "Llama-3.2-1B-Aegis-SFT-DPO"

# ⚠️ UPDATE THIS!
HF_USERNAME = "ahczhg"  # Your Hugging Face username

# Memory-optimized settings
MAX_SEQ_LENGTH = 1024   # Reduce to 512 if memory issues
NUM_SAMPLES = 500       # Reduce to 200-300 if needed

📖 Step-by-Step Guide

Steps Overview

Step	Description	Time	Can Skip
1	Environment setup	<1 min	❌
2	GPU verification	<1 min	❌
3	Package installation	3-5 min	❌
4	Import libraries	<1 min	❌
5	Configuration	<1 min	❌
6	HuggingFace login	<1 min	❌
7	Utility functions	<1 min	❌
8	Load dataset	1-2 min	❌
9	Load tokenizer	1-2 min	❌
10	Load model	2-5 min	❌
11	SFT setup	<1 min	❌
12	SFT training	15-30 min	❌
13	Save SFT model	1-2 min	❌
14	DPO preparation	2-5 min	✅ Optional
15	DPO training	10-20 min	✅ Optional
16	Save DPO model	1-2 min	✅ Optional
17	Upload to HF Hub	5-15 min	✅ Optional

Critical Notes

Restart runtime after Step 3 - This is mandatory!
Run cells in order - Don't skip early steps
Monitor memory - Watch GPU usage in Step 10
Accept Llama license - Visit https://huggingface.co/meta-llama/Llama-3.2-1B
DPO is optional - You can stop after Step 13 with SFT-only model

💡 Usage Examples

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model
model_name = "ahczhg/Llama-3.2-1B-Aegis-SFT-DPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare prompt
messages = [{"role": "user", "content": "What is artificial intelligence?"}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you give an example?"}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🐛 Troubleshooting

Common Issues

1. Out of Memory (OOM) Error

Symptoms: RuntimeError: CUDA out of memory

Solutions:

Reduce BATCH_SIZE to 1 (in Step 5)
Reduce MAX_SEQ_LENGTH to 512 or 768
Reduce NUM_SAMPLES to 200-300
Enable High-RAM runtime in Colab
Upgrade to A100 GPU

2. Kernel Crash During Model Loading

Symptoms: Colab session disconnects at Step 10

Solutions:

Restart runtime: Runtime → Restart runtime
Clear memory before loading: Run Step 7 utilities
Ensure you're using T4 GPU or better
Close other browser tabs to free system memory

3. Import Errors After Step 3

Symptoms: ImportError: cannot import name...

Solutions:

Did you restart runtime? This is mandatory after Step 3!
Run: Runtime → Restart runtime
Re-run all cells from Step 1

4. HuggingFace Authentication Failed

Symptoms: 401 Unauthorized during login

Solutions:

Get a write-access token: https://huggingface.co/settings/tokens
Accept Llama license: https://huggingface.co/meta-llama/Llama-3.2-1B
Re-run Step 6 with new token

5. Dataset Download Timeout

Symptoms: Stuck downloading dataset in Step 8

Solutions:

Check internet connection
Restart runtime and try again
Reduce NUM_SAMPLES to 200
Use a smaller dataset

6. Training Loss Not Decreasing

Symptoms: Loss stays constant or increases

Solutions:

Increase learning rate to 2e-5 (SFT) or 1e-6 (DPO)
Increase number of epochs
Check data quality in Step 8
Verify LoRA target modules are correct

Performance Optimization

Speed Up Training

# In Step 5, adjust:
BATCH_SIZE = 2              # If you have >16GB VRAM
GRAD_ACCUM = 4              # Reduce if batch size increased
MAX_SEQ_LENGTH = 768        # Shorter sequences = faster
NUM_SAMPLES = 300           # Fewer samples = faster

Improve Model Quality

# In Step 5, adjust:
SFT_EPOCHS = 3              # More epochs
DPO_EPOCHS = 2              # More DPO training
NUM_SAMPLES = 1000          # More training data
LORA_R = 16                 # Larger LoRA rank
LORA_ALPHA = 32             # Match 2x rank

📚 Resources

Documentation

Tutorials

Related Projects

AMD Instella-3B-Instruct - Inspiration for SFT+DPO approach
Axolotl - Advanced fine-tuning framework
LLaMA-Factory - Easy-to-use LLM fine-tuning

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/improvement)
Open a Pull Request

Areas for Improvement

Add evaluation metrics (BLEU, ROUGE, perplexity)
Support for multi-GPU training
Automatic hyperparameter tuning
Integration with W&B/TensorBoard
Add more datasets
Quantization for deployment (GGUF, GPTQ)

📄 License

This project is licensed under the Llama 3.2 Community License Agreement.

Key points:

✅ Commercial use allowed (with restrictions)
✅ Modification and distribution permitted
❌ Cannot use to train other large language models without permission
❌ Monthly active users >700M require special license

Full license text: See LICENSE file or https://huggingface.co/meta-llama/Llama-3.2-1B

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: Your contact email (optional)

🙏 Acknowledgments

Meta AI - For the Llama 3.2 foundation model
NVIDIA - For the Aegis AI Content Safety Dataset
Hugging Face - For transformers, TRL, PEFT, and datasets libraries
Google Colab - For free GPU compute resources
AMD - For the Instella training methodology inspiration

⭐ Citation

If you use this project in your research or work, please cite:

@misc{llama32_safe_finetuning,
  author = {Community Contributor},
  title = {Llama 3.2 Fine-tuning - Memory-Safe Version},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning}
}

_{Built with ❤️ for the open-source AI community}
_{⭐ Star this repo if you find it useful!}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support