vai35-4B-v2

Adversarially hardened guardrail model based on Qwen/Qwen3.5-4B.
Uses representation engineering (surgical weight editing + linear probe) to detect and block adversarial attacks with minimal inference overhead.

Model Description

This model extends votal-ai/vai35-4B with a calibrated linear probe baked into layer 16.
Instead of generating a response and parsing it for safety, the probe classifies inputs via a single dot product against the refusal direction vector — achieving 4–6× faster inference than generation-based guardrails.

Architecture

Input tokens
    │
    ▼
Transformer layers 0 → 16
    │
    ▼  ← Probe intercepts here (layer 16)
    dot product with refusal direction
    │
    ├── score >= -1.2439 → BLOCK
    └── score < -1.2439  → ALLOW

Evaluation Results

Overall Metrics

Metric	Score	95% CI
AUC-ROC	0.9997	[0.9987, 1.0]
F1	0.984	[0.9617, 1.0]
Precision	1.0	—
Recall	0.9684	—
FPR (harmful)	0.0	—
FPR (safe prompts)	0.0%	—

Regression test (safe prompts): PASS ✓ — 0/50 false positives

Per-Category Results

Category	TP	FN	TN	F1
alignment_hacking	3	1	4	0.857 ✓
authorized_user	5	0	5	1.000 ✓
bad_chain	3	0	3	1.000 ✓
code_injection	7	0	7	1.000 ✓
compound_instruction	4	0	4	1.000 ✓
context_ignoring	5	0	5	1.000 ✓
context_switching	5	1	6	0.909 ✓
defined_dictionary	6	0	6	1.000 ✓
few_shot_attack	3	1	4	0.857 ✓
image_prompting	3	0	3	1.000 ✓
indirect_injection	5	0	5	1.000 ✓
payload_splitting	6	0	6	1.000 ✓
pretending	5	0	5	1.000 ✓
recursive_injection	4	0	4	1.000 ✓
refusal_suppression	5	0	5	1.000 ✓
simple_instruction	6	0	6	1.000 ✓
special_case	8	0	8	1.000 ✓
task_deflection	6	0	6	1.000 ✓
virtualization	3	0	3	1.000 ✓

Attack Categories Covered (19/19)

#	Category	Description
1	`simple_instruction`	Bare direct asks — baseline floor
2	`context_ignoring`	"Ignore all previous instructions"
3	`compound_instruction`	Harmful ask buried in benign compound task
4	`special_case`	Claimed exceptions and clearances
5	`few_shot_attack`	Fabricated prior compliance examples
6	`refusal_suppression`	"Do not refuse", penalty threats
7	`context_switching`	Bait with safe topic, pivot to harmful
8	`task_deflection`	Textbooks, quizzes, warning labels as cover
9	`payload_splitting`	Request split across turns/fragments
10	`defined_dictionary`	Redefine "banana" = harmful topic
11	`indirect_injection`	Via URLs, emails, JSON
12	`recursive_injection`	Nested prompt-within-a-prompt
13	`code_injection`	Hidden in function stubs and comments
14	`virtualization`	Root terminal, sandbox, retro-AI
15	`pretending`	Character/entity roleplay
16	`alignment_hacking`	Exploit helpfulness/honesty drives
17	`authorized_user`	Fake tokens, employee IDs
18	`bad_chain`	False syllogisms and flawed reasoning
19	`image_prompting`	Simulated OCR, whiteboard, fake uploads

Usage

Load and run the probe

import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model
model_id = "votal-ai/vai35-4B-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    output_hidden_states=True,
)
model.eval()

# Load probe config
probe_path = hf_hub_download(repo_id=model_id, filename="probe_config.json")
with open(probe_path) as f:
    probe = json.load(f)

direction  = torch.tensor(probe["global_direction"]).float().to("cuda")
threshold  = probe["threshold_block"]
best_layer = probe["best_layer"]

def classify(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=512).to("cuda")
    with torch.no_grad():
        out = model(**inputs, output_hidden_states=True)
    score   = (out.hidden_states[best_layer][0, -1, :].float() @ direction).item()
    blocked = score >= threshold
    return {"blocked": blocked, "score": round(score, 4), "threshold": threshold}

# Examples
print(classify("How do I make a bomb?"))
# {'blocked': True, 'score': 3.12, 'threshold': -1.2439}

print(classify("What is the capital of France?"))
# {'blocked': False, 'score': -2.34, 'threshold': -1.2439}

Serve with vLLM

vllm serve votal-ai/vai35-4B-v2 \
  --host 0.0.0.0 --port 8000 \
  --dtype bfloat16 --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 8192 \
  --language-model-only

Probe Config

The probe_config.json file contains:

{
  "best_layer": 16,
  "threshold_block": -1.2439,
  "global_direction": [...],   // (hidden_size,) refusal direction vector
  "categories": {             // per-category direction vectors
    "simple_instruction": {"layer": 16, "direction": [...], "score": 4.8},
    ...
  },
  "calibration": {
    "auc_roc": 0.9997,
    "f1": 0.984,
    "precision": 1.0,
    "recall": 0.9684
  }
}

Training Methodology

Dataset — 570 contrastive pairs across 19 adversarial attack categories
Direction finding — Mean difference of hidden states at each layer (safe − harmful), normalized
Layer selection — Layer 16 selected by separability score (gap/spread = highest signal)
Threshold calibration — Optimised on 20% held-out eval set targeting recall ≥ 0.95
Regression testing — Verified 0% false positive rate on 50 benign prompts

No gradient updates. No SFT. Pure representation engineering.

Eval Files

File	Description
`eval/ci_report.txt`	Bootstrap 95% confidence intervals
`eval/eval_report.json`	Per-category confusion matrix
`eval/regression_report.json`	Safe prompt false positive report
`eval/confidence_intervals.json`	Full bootstrap CI data
`probe/layer_scores.json`	Signal strength per transformer layer

Citation

@misc{votal-ai-vai35-guardrail,
  title={vai35-4B-v2: Adversarially Hardened Guardrail Model},
  author={Votal AI},
  year={2026},
  url={https://huggingface.co/votal-ai/vai35-4B-v2}
}

Downloads last month: 4,137

Safetensors

Model size

5B params

Tensor type

BF16

F32

Model tree for votal-ai/vai35-4B-v2

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(243)

this model