π QVAC Cross-Platform LoRA Adapters
Fine-tuned LoRA adapters trained using qvac-finetune - the first truly cross-platform LoRA fine-tuning framework for Large Language Models. These adapters work on any GPU (Adreno, Mali, Apple Silicon, AMD, Intel, NVIDIA) using Vulkan, Metal, or CUDA backends.
π¦ Available Adapters
| Adapter | Size | Base Model |
|---|---|---|
| Qwen3-0.6B LoRA Adapter | 20.5 MB | Qwen3-0.6B-GGUF |
| Qwen3-1.7B LoRA Adapter | 35.1 MB | Qwen3-1.7B-GGUF |
| Qwen3-4B LoRA Adapter | 60+ MB | Qwen3-4B-GGUF |
| Gemma3-1B LoRA Adapter | 26.6 MB | google/gemma-3-1b-gguf |
| Gemma3-4B LoRA Adapter | 60.2 MB | google/gemma-3-4b-gguf |
π Empowering the Community with Open Resources
To accelerate development and innovation, Tether Data is publicly releasing:
MultiβPlatform Binaries
π qvacβrndβfabricβllmβfinetuneSource Code (WorkβinβProgress)
π qvacβfabricβllm.cpp (fabricβllmβfinetune branch)
Currently experimental and intended for developers to extend the solution for other LLM models.
π― Quick Start Guide
Option 1: Direct Inference (Recommended)
Use the adapter directly without merging - this is faster and uses less memory.
Step 1: Download Platform-Specific Binary
π₯οΈ Linux/Windows (AMD/Intel/NVIDIA)
# Download Vulkan binary (works on all GPUs)
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-linux-vulkan-x64-v1.0.zip
unzip qvac-linux-vulkan-x64-v1.0.zip
cd qvac-linux-vulkan-x64-v1.0
π macOS (Apple Silicon)
# Download Metal binary
curl -L https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-macos-apple-silicon-v1.0.zip -o qvac-macos.zip
unzip qvac-macos.zip
cd qvac-macos-apple-silicon-v1.0
π± Android (Termux)
# Download Adreno/Mali binary
wget https://github.com/tetherto/qvac-finetune/releases/download/v1.0/qvac-android-adreno-arm64-v1.0.zip
unzip qvac-android-adreno-arm64-v1.0.zip
cd qvac-android-adreno-arm64-v1.0
export LD_LIBRARY_PATH=.
Step 2: Download Base Model & Adapter
Choose your model and download both base model and adapter:
# Create directories
mkdir -p models adapters
# === CHOOSE ONE MODEL ===
# Option 1: Qwen3-1.7B (recommended for most use cases)
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/base.gguf
wget https://huggingface.co/qvac/finetune/resolve/main/qwen3-1.7b-qkvo-ffn-lora-adapter.gguf -O adapters/adapter.gguf
Step 3: Run Inference with Adapter
# Interactive chat mode
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
-c 2048 \
--temp 0.7 \
-p "Q: Does vitamin D supplementation prevent fractures?\nA:"
# Single prompt mode
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
-p "Explain the mechanism of action for beta-blockers in treating hypertension."
Expected Output:
Q: Does vitamin D supplementation prevent fractures?
A: Yes. Rationale: Meta-analysis of randomized controlled trials shows that
vitamin D supplementation, particularly when combined with calcium, significantly
reduces the risk of hip fractures and other non-vertebral fractures in elderly
populations...
Option 2: Merge Adapter into Base Model
Merge the adapter permanently into the base model for distribution or if you don't need to switch adapters.
Step 1-2: Same as Option 1
Follow Steps 1-2 from Option 1 to download binaries and models.
Step 3: Export & Merge Adapter
# Export LoRA adapter to base model format
./bin/llama-export-lora \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-o models/merged.gguf
# Verify merged model
ls -lh models/merged.gguf
Step 4: Run Inference with Merged Model
# Use merged model directly (no --lora flag needed)
./bin/llama-cli \
-m models/merged.gguf \
-ngl 999 \
-c 2048 \
-p "Q: What are the contraindications for aspirin therapy?\nA:"
Custom Temperature & Sampling
Fine-tune the generation parameters for your use case:
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
--temp 0.3 \ # Lower = more focused (good for medical)
--top-p 0.9 \ # Nucleus sampling
--top-k 40 \ # Top-k sampling
--repeat-penalty 1.1 \
-n 512 \ # Max tokens to generate
-p "Your prompt"
Recommended settings for biomedical Q&A:
- Temperature: 0.3-0.5 (deterministic, factual)
- Temperature: 0.7-0.9 (creative explanations)
Batch Processing
Process multiple prompts from a file:
# Create prompts file
cat > prompts.txt << 'EOF'
Q: Does vitamin D supplementation prevent fractures?
Q: Is aspirin effective for primary prevention of cardiovascular disease?
Q: Do statins reduce mortality in patients with heart failure?
EOF
# Process all prompts
cat prompts.txt | while read prompt; do
echo "=== Processing: $prompt ==="
./bin/llama-cli \
-m models/base.gguf \
--lora adapters/adapter.gguf \
-ngl 999 \
--temp 0.4 \
-p "$prompt\nA:"
echo ""
done
π Command Line Reference
Essential Flags
| Flag | Description | Example | Default |
|---|---|---|---|
-m |
Base model path (REQUIRED) | -m model.gguf |
- |
--lora |
LoRA adapter path | --lora adapter.gguf |
none |
-ngl |
GPU layers (999 = all) | -ngl 999 |
0 |
-c |
Context size | -c 2048 |
512 |
-p |
Prompt text | -p "Question" |
- |
--temp |
Temperature (0-2) | --temp 0.7 |
0.8 |
-n |
Max tokens to generate | -n 512 |
-1 |
-b |
Batch size | -b 512 |
512 |
-fa |
Flash attention | -fa off |
on |
Mobile-Specific Flags
For Android/iOS with limited memory:
./bin/llama-cli \
-m model.gguf \
--lora adapter.gguf \
-ngl 99 \ # Partial GPU offload
-c 512 \ # Smaller context
-b 128 \ # Smaller batch
-fa off \ # Disable flash attention (Vulkan)
-ub 128 # Uniform batch size
π Cross-Platform Compatibility
Supported Platforms
These adapters work identically across:
| Platform | Hardware | Backend | Status |
|---|---|---|---|
| β Android | Qualcomm Adreno, ARM Mali | Vulkan | Supported |
| β iOS | Apple A-series | Metal | Supported |
| β macOS | Apple M1/M2/M3/M4 | Metal | Supported |
| β Linux | AMD, Intel, NVIDIA | Vulkan | Supported |
| β Windows | AMD, Intel, NVIDIA | Vulkan | Supported |
| β CPU | Any x86_64, ARM64 | CPU | Fallback |
No Conversion Needed
Unlike traditional frameworks:
- β No need to convert between different frameworks
- β No platform-specific model formats
- β No separate training for each device
- β Train once, run everywhere!
π Additional Resources
Documentation
- π qvac-finetune Repository
- π Complete Documentation
- π Detailed Benchmarks
- π Training Guide
Platform-Specific Guides
- π± Android Setup Guide
- π macOS Setup Guide
- π iOS Setup Guide
- π§ Linux Setup Guide
Community
- π¬ Discussion Forum
- π Issue Tracker
- π Release Notes
π Troubleshooting
Common Issues
1. "DeviceLost" error on Android/Adreno:
# Use smaller batch size and disable flash attention
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 99 -c 512 -b 128 -ub 128 -fa off
2. Out of Memory (OOM) errors:
# Reduce context size or use smaller model
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 50 -c 512
3. Slow inference on mobile:
# Offload fewer layers to GPU
./bin/llama-cli -m model.gguf --lora adapter.gguf -ngl 20
4. Adapter not loading:
# Verify adapter file exists and matches model architecture
ls -lh adapters/
./bin/llama-cli -m model.gguf --lora adapter.gguf --verbose
π Citation
If you use these adapters in your research, please cite:
@article{qvac-finetune,
title={An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs},
author={Subash, Akshay, Patrik, Milan, Nurman},
journal={arXiv preprint},
year={2025}
}
π License
- LoRA Adapters: Apache 2.0 License
- Base Models:
- Qwen3: Apache 2.0 license
- Gemma3: Gemma Terms of Use
- Training Framework (qvac-fabric-llm): Apache 2.0 License
π Acknowledgments
- llama.cpp - Foundation inference engine by Georgi Gerganov
- LoRA - Parameter-efficient fine-tuning method (Hu et al., 2021)
- PubMedQA - Biomedical dataset source (Jin et al., 2019)
- Qwen Team - Base models
- Google - Gemma base models
- Hardware vendors who provided testing devices
Making LLM fine-tuning accessible to everyone, everywhere
From smartphones to datacenters β’ No vendor lock-in β’ Privacy-preserving
β Star the qvac-rnd-fabric-llm-finetune repo if you find it useful!
- Downloads last month
- 80
We're not able to determine the quantization variants.