YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Llama 3.2 Fine-tuning - Memory-Safe Version

License: Llama 3.2 Python 3.8+ Colab

A production-ready notebook for fine-tuning Meta's Llama 3.2 1B model using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with extensive memory management and crash prevention.

🎯 Overview

This project implements a two-stage fine-tuning pipeline for Llama 3.2 focused on content safety:

  1. Supervised Fine-Tuning (SFT) - Teaching instruction following
  2. Direct Preference Optimization (DPO) - Aligning with safety preferences

Key Features

  • βœ… Memory-safe design - Prevents kernel crashes on limited GPU memory
  • βœ… Version-pinned packages - Reproducible environment setup
  • βœ… Aggressive memory management - Optimized for Google Colab free tier
  • βœ… Extensive error handling - Clear troubleshooting messages
  • βœ… Step-by-step execution - Safe incremental progress
  • βœ… Production-ready - Upload directly to Hugging Face Hub

πŸ“‹ Requirements

Minimum Requirements

  • GPU: NVIDIA T4 (16GB VRAM) or better
  • RAM: High-RAM runtime (if available)
  • Platform: Google Colab (recommended) or local setup with CUDA
  • Storage: ~5GB for model checkpoints

Recommended Setup

  • GPU: NVIDIA A100 (40GB+ VRAM)
  • Platform: Google Colab Pro/Pro+ for longer sessions
  • Internet: Stable connection for dataset download and model upload

πŸš€ Quick Start

Option 1: Google Colab (Recommended)

  1. Open the notebook in Google Colab
  2. Enable GPU: Runtime β†’ Change runtime type β†’ T4 GPU
  3. (Optional) Enable High-RAM: Edit β†’ Notebook settings β†’ High-RAM
  4. Run cells sequentially from Step 1 to Step 17
  5. Important: Restart runtime after Step 3 (package installation)

Option 2: Local Setup

# Clone the repository
git clone https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning.git
cd llama-3.2-safe-finetuning

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets accelerate peft trl bitsandbytes scipy

# Launch Jupyter
jupyter notebook llama_3_2_minimal_safe.ipynb

πŸ“Š Training Pipeline

Dataset

Training Configuration

Parameter SFT DPO
Epochs 2 1
Batch Size 1 1
Gradient Accumulation 8 8
Learning Rate 1e-5 5e-7
Max Sequence Length 1024 1024
LoRA r 8 8
LoRA alpha 16 16
Training Time ~20-30 min ~10-20 min

LoRA Configuration

LoRA_R = 8              # Rank
LoRA_ALPHA = 16         # Alpha scaling
LoRA_DROPOUT = 0.05     # Dropout rate
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]

Memory Optimizations

  • 4-bit NF4 quantization
  • Gradient checkpointing
  • BF16 mixed precision
  • Aggressive garbage collection
  • Optimized batch sizes

πŸ“ Project Structure

llama-3.2-safe-finetuning/
β”œβ”€β”€ llama_3_2_minimal_safe.ipynb   # Main training notebook
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ LICENSE                         # Llama 3.2 Community License
β”œβ”€β”€ requirements.txt                # Python dependencies (if local)
β”œβ”€β”€ sft_output/                     # SFT training checkpoints
β”œβ”€β”€ dpo_output/                     # DPO training checkpoints
β”œβ”€β”€ llama-3.2-1b-sft/              # Final SFT model
β”œβ”€β”€ llama-3.2-1b-sft-dpo/          # Final merged model
└── model_card.md                   # Generated model card for HF Hub

πŸ”§ Configuration

Update these variables in Step 5 before training:

# Model Configuration
MODEL_NAME = "meta-llama/Llama-3.2-1B"
DATASET_NAME = "nvidia/Aegis-AI-Content-Safety-Dataset-2.0"
NEW_MODEL_NAME = "Llama-3.2-1B-Aegis-SFT-DPO"

# ⚠️ UPDATE THIS!
HF_USERNAME = "ahczhg"  # Your Hugging Face username

# Memory-optimized settings
MAX_SEQ_LENGTH = 1024   # Reduce to 512 if memory issues
NUM_SAMPLES = 500       # Reduce to 200-300 if needed

πŸ“– Step-by-Step Guide

Steps Overview

Step Description Time Can Skip
1 Environment setup <1 min ❌
2 GPU verification <1 min ❌
3 Package installation 3-5 min ❌
4 Import libraries <1 min ❌
5 Configuration <1 min ❌
6 HuggingFace login <1 min ❌
7 Utility functions <1 min ❌
8 Load dataset 1-2 min ❌
9 Load tokenizer 1-2 min ❌
10 Load model 2-5 min ❌
11 SFT setup <1 min ❌
12 SFT training 15-30 min ❌
13 Save SFT model 1-2 min ❌
14 DPO preparation 2-5 min βœ… Optional
15 DPO training 10-20 min βœ… Optional
16 Save DPO model 1-2 min βœ… Optional
17 Upload to HF Hub 5-15 min βœ… Optional

Critical Notes

  1. Restart runtime after Step 3 - This is mandatory!
  2. Run cells in order - Don't skip early steps
  3. Monitor memory - Watch GPU usage in Step 10
  4. Accept Llama license - Visit https://huggingface.co/meta-llama/Llama-3.2-1B
  5. DPO is optional - You can stop after Step 13 with SFT-only model

πŸ’‘ Usage Examples

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model
model_name = "ahczhg/Llama-3.2-1B-Aegis-SFT-DPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Prepare prompt
messages = [{"role": "user", "content": "What is artificial intelligence?"}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."},
    {"role": "user", "content": "Can you give an example?"}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ› Troubleshooting

Common Issues

1. Out of Memory (OOM) Error

Symptoms: RuntimeError: CUDA out of memory

Solutions:

  • Reduce BATCH_SIZE to 1 (in Step 5)
  • Reduce MAX_SEQ_LENGTH to 512 or 768
  • Reduce NUM_SAMPLES to 200-300
  • Enable High-RAM runtime in Colab
  • Upgrade to A100 GPU

2. Kernel Crash During Model Loading

Symptoms: Colab session disconnects at Step 10

Solutions:

  • Restart runtime: Runtime β†’ Restart runtime
  • Clear memory before loading: Run Step 7 utilities
  • Ensure you're using T4 GPU or better
  • Close other browser tabs to free system memory

3. Import Errors After Step 3

Symptoms: ImportError: cannot import name...

Solutions:

  • Did you restart runtime? This is mandatory after Step 3!
  • Run: Runtime β†’ Restart runtime
  • Re-run all cells from Step 1

4. HuggingFace Authentication Failed

Symptoms: 401 Unauthorized during login

Solutions:

5. Dataset Download Timeout

Symptoms: Stuck downloading dataset in Step 8

Solutions:

  • Check internet connection
  • Restart runtime and try again
  • Reduce NUM_SAMPLES to 200
  • Use a smaller dataset

6. Training Loss Not Decreasing

Symptoms: Loss stays constant or increases

Solutions:

  • Increase learning rate to 2e-5 (SFT) or 1e-6 (DPO)
  • Increase number of epochs
  • Check data quality in Step 8
  • Verify LoRA target modules are correct

Performance Optimization

Speed Up Training

# In Step 5, adjust:
BATCH_SIZE = 2              # If you have >16GB VRAM
GRAD_ACCUM = 4              # Reduce if batch size increased
MAX_SEQ_LENGTH = 768        # Shorter sequences = faster
NUM_SAMPLES = 300           # Fewer samples = faster

Improve Model Quality

# In Step 5, adjust:
SFT_EPOCHS = 3              # More epochs
DPO_EPOCHS = 2              # More DPO training
NUM_SAMPLES = 1000          # More training data
LORA_R = 16                 # Larger LoRA rank
LORA_ALPHA = 32             # Match 2x rank

πŸ“š Resources

Documentation

Tutorials

Related Projects

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/improvement)
  5. Open a Pull Request

Areas for Improvement

  • Add evaluation metrics (BLEU, ROUGE, perplexity)
  • Support for multi-GPU training
  • Automatic hyperparameter tuning
  • Integration with W&B/TensorBoard
  • Add more datasets
  • Quantization for deployment (GGUF, GPTQ)

πŸ“„ License

This project is licensed under the Llama 3.2 Community License Agreement.

Key points:

  • βœ… Commercial use allowed (with restrictions)
  • βœ… Modification and distribution permitted
  • ❌ Cannot use to train other large language models without permission
  • ❌ Monthly active users >700M require special license

Full license text: See LICENSE file or https://huggingface.co/meta-llama/Llama-3.2-1B

πŸ“ž Support

πŸ™ Acknowledgments

  • Meta AI - For the Llama 3.2 foundation model
  • NVIDIA - For the Aegis AI Content Safety Dataset
  • Hugging Face - For transformers, TRL, PEFT, and datasets libraries
  • Google Colab - For free GPU compute resources
  • AMD - For the Instella training methodology inspiration

⭐ Citation

If you use this project in your research or work, please cite:

@misc{llama32_safe_finetuning,
  author = {Community Contributor},
  title = {Llama 3.2 Fine-tuning - Memory-Safe Version},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/YOUR_USERNAME/llama-3.2-safe-finetuning}
}

Built with ❀️ for the open-source AI community
⭐ Star this repo if you find it useful!
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support