🚀 QVAC Cross-Platform LoRA Adapters

Fine-tuned LoRA adapters trained using qvac-finetune - the first truly cross-platform LoRA fine-tuning framework for Large Language Models. These adapters work on any GPU (Adreno, Mali, Apple Silicon, AMD, Intel, NVIDIA) using Vulkan, Metal, or CUDA backends.

📦 Available Adapters

Adapter	Size	Base Model
Qwen3-0.6B LoRA Adapter	20.5 MB	Qwen3-0.6B-GGUF
Qwen3-1.7B LoRA Adapter	35.1 MB	Qwen3-1.7B-GGUF
Qwen3-4B LoRA Adapter	60+ MB	Qwen3-4B-GGUF
Gemma3-1B LoRA Adapter	26.6 MB	google/gemma-3-1b-gguf
Gemma3-4B LoRA Adapter	60.2 MB	google/gemma-3-4b-gguf

🚀 Empowering the Community with Open Resources

To accelerate development and innovation, Tether Data is publicly releasing:

Multi‑Platform Binaries
👉 qvac‑rnd‑fabric‑llm‑finetune
Source Code (Work‑in‑Progress)
👉 qvac‑fabric‑llm.cpp (fabric‑llm‑finetune branch)
Currently experimental and intended for developers to extend the solution for other LLM models.

🎯 Quick Start Guide

Option 1: Direct Inference (Recommended)

Use the adapter directly without merging - this is faster and uses less memory.

Step 1: Download Platform-Specific Binary

🖥️ Linux/Windows (AMD/Intel/NVIDIA)

# Download Vulkan binary (works on all GPUs)
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-linux-vulkan-x64-v1.0.zip
unzip qvac-linux-vulkan-x64-v1.0.zip
cd qvac-linux-vulkan-x64-v1.0

🍎 macOS (Apple Silicon)

# Download Metal binary
curl -L https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-macos-apple-silicon-v1.0.zip -o qvac-macos.zip
unzip qvac-macos.zip
cd qvac-macos-apple-silicon-v1.0

📱 Android (Termux)

# Download Adreno/Mali binary
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-android-adreno-arm64-v1.0.zip
unzip qvac-android-adreno-arm64-v1.0.zip
cd qvac-android-adreno-arm64-v1.0
export LD_LIBRARY_PATH=.

Step 2: Download Base Model & Adapter

Choose your model and download both base model and adapter:

# Create directories
mkdir -p models adapters

# === CHOOSE ONE MODEL ===

# Option 1: Qwen3-1.7B (recommended for most use cases)
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/base.gguf
wget https://huggingface.co/qvac/finetune/resolve/main/qwen3-1.7b-qkvo-ffn-lora-adapter.gguf -O adapters/adapter.gguf

Step 3: Run Inference with Adapter

# Interactive chat mode
./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  -c 2048 \
  --temp 0.7 \
  -p "Q: Does vitamin D supplementation prevent fractures?\nA:"

# Single prompt mode
./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  -p "Explain the mechanism of action for beta-blockers in treating hypertension."

Expected Output:

Q: Does vitamin D supplementation prevent fractures?
A: Yes. Rationale: Meta-analysis of randomized controlled trials shows that 
vitamin D supplementation, particularly when combined with calcium, significantly 
reduces the risk of hip fractures and other non-vertebral fractures in elderly 
populations...

Option 2: Merge Adapter into Base Model

Merge the adapter permanently into the base model for distribution or if you don't need to switch adapters.

Step 1-2: Same as Option 1

Follow Steps 1-2 from Option 1 to download binaries and models.

Step 3: Export & Merge Adapter

# Export LoRA adapter to base model format
./bin/llama-export-lora \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -o models/merged.gguf

# Verify merged model
ls -lh models/merged.gguf

Step 4: Run Inference with Merged Model

# Use merged model directly (no --lora flag needed)
./bin/llama-cli \
  -m models/merged.gguf \
  -ngl 999 \
  -c 2048 \
  -p "Q: What are the contraindications for aspirin therapy?\nA:"

Custom Temperature & Sampling

Fine-tune the generation parameters for your use case:

./bin/llama-cli \
  -m models/base.gguf \
  --lora adapters/adapter.gguf \
  -ngl 999 \
  --temp 0.3 \        # Lower = more focused (good for medical)
  --top-p 0.9 \       # Nucleus sampling
  --top-k 40 \        # Top-k sampling
  --repeat-penalty 1.1 \
  -n 512 \            # Max tokens to generate
  -p "Your prompt"

Recommended settings for biomedical Q&A:

Temperature: 0.3-0.5 (deterministic, factual)
Temperature: 0.7-0.9 (creative explanations)

Batch Processing

Process multiple prompts from a file:

# Create prompts file
cat > prompts.txt << 'EOF'
Q: Does vitamin D supplementation prevent fractures?
Q: Is aspirin effective for primary prevention of cardiovascular disease?
Q: Do statins reduce mortality in patients with heart failure?
EOF

# Process all prompts
cat prompts.txt | while read prompt; do
  echo "=== Processing: $prompt ==="
  ./bin/llama-cli \
    -m models/base.gguf \
    --lora adapters/adapter.gguf \
    -ngl 999 \
    --temp 0.4 \
    -p "$prompt\nA:"
  echo ""
done

📋 Command Line Reference

Essential Flags

Flag	Description	Example	Default
`-m`	Base model path (REQUIRED)	`-m model.gguf`	-
`--lora`	LoRA adapter path	`--lora adapter.gguf`	none
`-ngl`	GPU layers (999 = all)	`-ngl 999`	0
`-c`	Context size	`-c 2048`	512
`-p`	Prompt text	`-p "Question"`	-
`--temp`	Temperature (0-2)	`--temp 0.7`	0.8
`-n`	Max tokens to generate	`-n 512`	-1
`-b`	Batch size	`-b 512`	512
`-fa`	Flash attention	`-fa off`	on

Mobile-Specific Flags

For Android/iOS with limited memory:

./bin/llama-cli \
  -m model.gguf \
  --lora adapter.gguf \
  -ngl 99 \           # Partial GPU offload
  -c 512 \            # Smaller context
  -b 128 \            # Smaller batch
  -fa off \           # Disable flash attention (Vulkan)
  -ub 128             # Uniform batch size

🌍 Cross-Platform Compatibility

Supported Platforms

These adapters work identically across:

Platform	Hardware	Backend	Status
✅ Android	Qualcomm Adreno, ARM Mali	Vulkan	Supported
✅ iOS	Apple A-series	Metal	Supported
✅ macOS	Apple M1/M2/M3/M4	Metal	Supported
✅ Linux	AMD, Intel, NVIDIA	Vulkan	Supported
✅ Windows	AMD, Intel, NVIDIA	Vulkan	Supported
✅ CPU	Any x86_64, ARM64	CPU	Fallback

No Conversion Needed

Unlike traditional frameworks:

❌ No need to convert between different frameworks
❌ No platform-specific model formats
❌ No separate training for each device
✅ Train once, run everywhere!

📚 Additional Resources

Documentation

Platform-Specific Guides

Community

🔍 Troubleshooting

Common Issues

1. "DeviceLost" error on Android/Adreno:

# Use smaller batch size and disable flash attention
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 99 -c 512 -b 128 -ub 128 -fa off

2. Out of Memory (OOM) errors:

# Reduce context size or use smaller model
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 50 -c 512

3. Slow inference on mobile:

# Offload fewer layers to GPU
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 20

4. Adapter not loading:

# Verify adapter file exists and matches model architecture
ls -lh adapters/
./bin/llama-cli -m model.gguf --lora adapter.gguf --verbose

📝 Citation

If you use these adapters in your research, please cite:

@article{qvac-finetune,
  title={An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs},
  author={Subash, Akshay, Patrik, Milan, Nurman},
  journal={arXiv preprint},
  year={2025}
}

📄 License

LoRA Adapters: Apache 2.0 License
Base Models:
- Qwen3: Apache 2.0 license
- Gemma3: Gemma Terms of Use
Training Framework (qvac-fabric-llm): Apache 2.0 License

🙏 Acknowledgments

llama.cpp - Foundation inference engine by Georgi Gerganov
LoRA - Parameter-efficient fine-tuning method (Hu et al., 2021)
PubMedQA - Biomedical dataset source (Jin et al., 2019)
Qwen Team - Base models
Google - Gemma base models
Hardware vendors who provided testing devices

Making LLM fine-tuning accessible to everyone, everywhere

From smartphones to datacenters • No vendor lock-in • Privacy-preserving

⭐ Star the qvac-rnd-fabric-llm-finetune repo if you find it useful!

Downloads last month: 80

GGUF

Model size

6.63M params

Architecture

gemma3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants