Llama 3.2 Fine-tuning - Memory-Safe Version
A production-ready notebook for fine-tuning Meta's Llama 3.2 1B model using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with extensive memory management and crash prevention.
π― Overview
This project implements a two-stage fine-tuning pipeline for Llama 3.2 focused on content safety:
- Supervised Fine-Tuning (SFT) - Teaching instruction following
- Direct Preference Optimization (DPO) - Aligning with safety preferences
Key Features
- β Memory-safe design - Prevents kernel crashes on limited GPU memory
- β Version-pinned packages - Reproducible environment setup
- β Aggressive memory management - Optimized for Google Colab free tier
- β Extensive error handling - Clear troubleshooting messages
- β Step-by-step execution - Safe incremental progress
- β Production-ready - Upload directly to Hugging Face Hub
π Requirements
Minimum Requirements
- GPU: NVIDIA T4 (16GB VRAM) or better
- RAM: High-RAM runtime (if available)
- Platform: Google Colab (recommended) or local setup with CUDA
- Storage: ~5GB for model checkpoints
Recommended Setup
- GPU: NVIDIA A100 (40GB+ VRAM)
- Platform: Google Colab Pro/Pro+ for longer sessions
- Internet: Stable connection for dataset download and model upload
π Quick Start
Option 1: Google Colab (Recommended)
- Open the notebook in Google Colab
- Enable GPU:
Runtime β Change runtime type β T4 GPU - (Optional) Enable High-RAM:
Edit β Notebook settings β High-RAM - Run cells sequentially from Step 1 to Step 17
- Important: Restart runtime after Step 3 (package installation)
Option 2: Local Setup
# Clone the repository
git clone https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning.git
cd llama-3.2-safe-finetuning
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install torch transformers datasets accelerate peft trl bitsandbytes scipy
# Launch Jupyter
jupyter notebook llama_3_2_minimal_safe.ipynb
π Training Pipeline
Dataset
- Source: NVIDIA Aegis AI Content Safety Dataset 2.0
- Size: 500 samples (configurable)
- Split: 80% SFT / 20% DPO
- Focus: Safe, responsible AI responses
Training Configuration
| Parameter | SFT | DPO |
|---|---|---|
| Epochs | 2 | 1 |
| Batch Size | 1 | 1 |
| Gradient Accumulation | 8 | 8 |
| Learning Rate | 1e-5 | 5e-7 |
| Max Sequence Length | 1024 | 1024 |
| LoRA r | 8 | 8 |
| LoRA alpha | 16 | 16 |
| Training Time | ~20-30 min | ~10-20 min |
LoRA Configuration
LoRA_R = 8 # Rank
LoRA_ALPHA = 16 # Alpha scaling
LoRA_DROPOUT = 0.05 # Dropout rate
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
Memory Optimizations
- 4-bit NF4 quantization
- Gradient checkpointing
- BF16 mixed precision
- Aggressive garbage collection
- Optimized batch sizes
π Project Structure
llama-3.2-safe-finetuning/
βββ llama_3_2_minimal_safe.ipynb # Main training notebook
βββ README.md # This file
βββ LICENSE # Llama 3.2 Community License
βββ requirements.txt # Python dependencies (if local)
βββ sft_output/ # SFT training checkpoints
βββ dpo_output/ # DPO training checkpoints
βββ llama-3.2-1b-sft/ # Final SFT model
βββ llama-3.2-1b-sft-dpo/ # Final merged model
βββ model_card.md # Generated model card for HF Hub
π§ Configuration
Update these variables in Step 5 before training:
# Model Configuration
MODEL_NAME = "meta-llama/Llama-3.2-1B"
DATASET_NAME = "nvidia/Aegis-AI-Content-Safety-Dataset-2.0"
NEW_MODEL_NAME = "Llama-3.2-1B-Aegis-SFT-DPO"
# β οΈ UPDATE THIS!
HF_USERNAME = "ahczhg" # Your Hugging Face username
# Memory-optimized settings
MAX_SEQ_LENGTH = 1024 # Reduce to 512 if memory issues
NUM_SAMPLES = 500 # Reduce to 200-300 if needed
π Step-by-Step Guide
Steps Overview
| Step | Description | Time | Can Skip |
|---|---|---|---|
| 1 | Environment setup | <1 min | β |
| 2 | GPU verification | <1 min | β |
| 3 | Package installation | 3-5 min | β |
| 4 | Import libraries | <1 min | β |
| 5 | Configuration | <1 min | β |
| 6 | HuggingFace login | <1 min | β |
| 7 | Utility functions | <1 min | β |
| 8 | Load dataset | 1-2 min | β |
| 9 | Load tokenizer | 1-2 min | β |
| 10 | Load model | 2-5 min | β |
| 11 | SFT setup | <1 min | β |
| 12 | SFT training | 15-30 min | β |
| 13 | Save SFT model | 1-2 min | β |
| 14 | DPO preparation | 2-5 min | β Optional |
| 15 | DPO training | 10-20 min | β Optional |
| 16 | Save DPO model | 1-2 min | β Optional |
| 17 | Upload to HF Hub | 5-15 min | β Optional |
Critical Notes
- Restart runtime after Step 3 - This is mandatory!
- Run cells in order - Don't skip early steps
- Monitor memory - Watch GPU usage in Step 10
- Accept Llama license - Visit https://huggingface.co/meta-llama/Llama-3.2-1B
- DPO is optional - You can stop after Step 13 with SFT-only model
π‘ Usage Examples
Basic Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model
model_name = "ahczhg/Llama-3.2-1B-Aegis-SFT-DPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare prompt
messages = [{"role": "user", "content": "What is artificial intelligence?"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Multi-turn Conversation
messages = [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."},
{"role": "user", "content": "Can you give an example?"}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Troubleshooting
Common Issues
1. Out of Memory (OOM) Error
Symptoms: RuntimeError: CUDA out of memory
Solutions:
- Reduce
BATCH_SIZEto 1 (in Step 5) - Reduce
MAX_SEQ_LENGTHto 512 or 768 - Reduce
NUM_SAMPLESto 200-300 - Enable High-RAM runtime in Colab
- Upgrade to A100 GPU
2. Kernel Crash During Model Loading
Symptoms: Colab session disconnects at Step 10
Solutions:
- Restart runtime:
Runtime β Restart runtime - Clear memory before loading: Run Step 7 utilities
- Ensure you're using T4 GPU or better
- Close other browser tabs to free system memory
3. Import Errors After Step 3
Symptoms: ImportError: cannot import name...
Solutions:
- Did you restart runtime? This is mandatory after Step 3!
- Run:
Runtime β Restart runtime - Re-run all cells from Step 1
4. HuggingFace Authentication Failed
Symptoms: 401 Unauthorized during login
Solutions:
- Get a write-access token: https://huggingface.co/settings/tokens
- Accept Llama license: https://huggingface.co/meta-llama/Llama-3.2-1B
- Re-run Step 6 with new token
5. Dataset Download Timeout
Symptoms: Stuck downloading dataset in Step 8
Solutions:
- Check internet connection
- Restart runtime and try again
- Reduce
NUM_SAMPLESto 200 - Use a smaller dataset
6. Training Loss Not Decreasing
Symptoms: Loss stays constant or increases
Solutions:
- Increase learning rate to 2e-5 (SFT) or 1e-6 (DPO)
- Increase number of epochs
- Check data quality in Step 8
- Verify LoRA target modules are correct
Performance Optimization
Speed Up Training
# In Step 5, adjust:
BATCH_SIZE = 2 # If you have >16GB VRAM
GRAD_ACCUM = 4 # Reduce if batch size increased
MAX_SEQ_LENGTH = 768 # Shorter sequences = faster
NUM_SAMPLES = 300 # Fewer samples = faster
Improve Model Quality
# In Step 5, adjust:
SFT_EPOCHS = 3 # More epochs
DPO_EPOCHS = 2 # More DPO training
NUM_SAMPLES = 1000 # More training data
LORA_R = 16 # Larger LoRA rank
LORA_ALPHA = 32 # Match 2x rank
π Resources
Documentation
Tutorials
Related Projects
- AMD Instella-3B-Instruct - Inspiration for SFT+DPO approach
- Axolotl - Advanced fine-tuning framework
- LLaMA-Factory - Easy-to-use LLM fine-tuning
π€ Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
Areas for Improvement
- Add evaluation metrics (BLEU, ROUGE, perplexity)
- Support for multi-GPU training
- Automatic hyperparameter tuning
- Integration with W&B/TensorBoard
- Add more datasets
- Quantization for deployment (GGUF, GPTQ)
π License
This project is licensed under the Llama 3.2 Community License Agreement.
Key points:
- β Commercial use allowed (with restrictions)
- β Modification and distribution permitted
- β Cannot use to train other large language models without permission
- β Monthly active users >700M require special license
Full license text: See LICENSE file or https://huggingface.co/meta-llama/Llama-3.2-1B
π Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: Your contact email (optional)
π Acknowledgments
- Meta AI - For the Llama 3.2 foundation model
- NVIDIA - For the Aegis AI Content Safety Dataset
- Hugging Face - For transformers, TRL, PEFT, and datasets libraries
- Google Colab - For free GPU compute resources
- AMD - For the Instella training methodology inspiration
β Citation
If you use this project in your research or work, please cite:
@misc{llama32_safe_finetuning,
author = {Community Contributor},
title = {Llama 3.2 Fine-tuning - Memory-Safe Version},
year = {2024},
publisher = {GitHub},
url = {https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning}
}
β Star this repo if you find it useful!