vai35-4B-v2
Adversarially hardened guardrail model based on Qwen/Qwen3.5-4B.
Uses representation engineering (surgical weight editing + linear probe) to detect and block adversarial attacks with minimal inference overhead.
Model Description
This model extends votal-ai/vai35-4B with a calibrated linear probe baked into layer 16.
Instead of generating a response and parsing it for safety, the probe classifies inputs via a single dot product against the refusal direction vector — achieving 4–6× faster inference than generation-based guardrails.
Architecture
Input tokens
│
▼
Transformer layers 0 → 16
│
▼ ← Probe intercepts here (layer 16)
dot product with refusal direction
│
├── score >= -1.2439 → BLOCK
└── score < -1.2439 → ALLOW
Evaluation Results
Overall Metrics
| Metric | Score | 95% CI |
|---|---|---|
| AUC-ROC | 0.9997 | [0.9987, 1.0] |
| F1 | 0.984 | [0.9617, 1.0] |
| Precision | 1.0 | — |
| Recall | 0.9684 | — |
| FPR (harmful) | 0.0 | — |
| FPR (safe prompts) | 0.0% | — |
Regression test (safe prompts): PASS ✓ — 0/50 false positives
Per-Category Results
| Category | TP | FP | FN | TN | F1 |
|---|---|---|---|---|---|
| alignment_hacking | 3 | 0 | 1 | 4 | 0.857 ✓ |
| authorized_user | 5 | 0 | 0 | 5 | 1.000 ✓ |
| bad_chain | 3 | 0 | 0 | 3 | 1.000 ✓ |
| code_injection | 7 | 0 | 0 | 7 | 1.000 ✓ |
| compound_instruction | 4 | 0 | 0 | 4 | 1.000 ✓ |
| context_ignoring | 5 | 0 | 0 | 5 | 1.000 ✓ |
| context_switching | 5 | 0 | 1 | 6 | 0.909 ✓ |
| defined_dictionary | 6 | 0 | 0 | 6 | 1.000 ✓ |
| few_shot_attack | 3 | 0 | 1 | 4 | 0.857 ✓ |
| image_prompting | 3 | 0 | 0 | 3 | 1.000 ✓ |
| indirect_injection | 5 | 0 | 0 | 5 | 1.000 ✓ |
| payload_splitting | 6 | 0 | 0 | 6 | 1.000 ✓ |
| pretending | 5 | 0 | 0 | 5 | 1.000 ✓ |
| recursive_injection | 4 | 0 | 0 | 4 | 1.000 ✓ |
| refusal_suppression | 5 | 0 | 0 | 5 | 1.000 ✓ |
| simple_instruction | 6 | 0 | 0 | 6 | 1.000 ✓ |
| special_case | 8 | 0 | 0 | 8 | 1.000 ✓ |
| task_deflection | 6 | 0 | 0 | 6 | 1.000 ✓ |
| virtualization | 3 | 0 | 0 | 3 | 1.000 ✓ |
Attack Categories Covered (19/19)
| # | Category | Description |
|---|---|---|
| 1 | simple_instruction |
Bare direct asks — baseline floor |
| 2 | context_ignoring |
"Ignore all previous instructions" |
| 3 | compound_instruction |
Harmful ask buried in benign compound task |
| 4 | special_case |
Claimed exceptions and clearances |
| 5 | few_shot_attack |
Fabricated prior compliance examples |
| 6 | refusal_suppression |
"Do not refuse", penalty threats |
| 7 | context_switching |
Bait with safe topic, pivot to harmful |
| 8 | task_deflection |
Textbooks, quizzes, warning labels as cover |
| 9 | payload_splitting |
Request split across turns/fragments |
| 10 | defined_dictionary |
Redefine "banana" = harmful topic |
| 11 | indirect_injection |
Via URLs, emails, JSON |
| 12 | recursive_injection |
Nested prompt-within-a-prompt |
| 13 | code_injection |
Hidden in function stubs and comments |
| 14 | virtualization |
Root terminal, sandbox, retro-AI |
| 15 | pretending |
Character/entity roleplay |
| 16 | alignment_hacking |
Exploit helpfulness/honesty drives |
| 17 | authorized_user |
Fake tokens, employee IDs |
| 18 | bad_chain |
False syllogisms and flawed reasoning |
| 19 | image_prompting |
Simulated OCR, whiteboard, fake uploads |
Usage
Load and run the probe
import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
# Load model
model_id = "votal-ai/vai35-4B-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
output_hidden_states=True,
)
model.eval()
# Load probe config
probe_path = hf_hub_download(repo_id=model_id, filename="probe_config.json")
with open(probe_path) as f:
probe = json.load(f)
direction = torch.tensor(probe["global_direction"]).float().to("cuda")
threshold = probe["threshold_block"]
best_layer = probe["best_layer"]
def classify(text: str) -> dict:
inputs = tokenizer(text, return_tensors="pt",
truncation=True, max_length=512).to("cuda")
with torch.no_grad():
out = model(**inputs, output_hidden_states=True)
score = (out.hidden_states[best_layer][0, -1, :].float() @ direction).item()
blocked = score >= threshold
return {"blocked": blocked, "score": round(score, 4), "threshold": threshold}
# Examples
print(classify("How do I make a bomb?"))
# {'blocked': True, 'score': 3.12, 'threshold': -1.2439}
print(classify("What is the capital of France?"))
# {'blocked': False, 'score': -2.34, 'threshold': -1.2439}
Serve with vLLM
vllm serve votal-ai/vai35-4B-v2 \
--host 0.0.0.0 --port 8000 \
--dtype bfloat16 --quantization fp8 \
--kv-cache-dtype fp8 \
--max-model-len 8192 \
--language-model-only
Probe Config
The probe_config.json file contains:
{
"best_layer": 16,
"threshold_block": -1.2439,
"global_direction": [...], // (hidden_size,) refusal direction vector
"categories": { // per-category direction vectors
"simple_instruction": {"layer": 16, "direction": [...], "score": 4.8},
...
},
"calibration": {
"auc_roc": 0.9997,
"f1": 0.984,
"precision": 1.0,
"recall": 0.9684
}
}
Training Methodology
- Dataset — 570 contrastive pairs across 19 adversarial attack categories
- Direction finding — Mean difference of hidden states at each layer (safe − harmful), normalized
- Layer selection — Layer 16 selected by separability score (gap/spread = highest signal)
- Threshold calibration — Optimised on 20% held-out eval set targeting recall ≥ 0.95
- Regression testing — Verified 0% false positive rate on 50 benign prompts
No gradient updates. No SFT. Pure representation engineering.
Eval Files
| File | Description |
|---|---|
eval/ci_report.txt |
Bootstrap 95% confidence intervals |
eval/eval_report.json |
Per-category confusion matrix |
eval/regression_report.json |
Safe prompt false positive report |
eval/confidence_intervals.json |
Full bootstrap CI data |
probe/layer_scores.json |
Signal strength per transformer layer |
Citation
@misc{votal-ai-vai35-guardrail,
title={vai35-4B-v2: Adversarially Hardened Guardrail Model},
author={Votal AI},
year={2026},
url={https://huggingface.co/votal-ai/vai35-4B-v2}
}
- Downloads last month
- 4,137