vai35-4B-v2

Adversarially hardened guardrail model based on Qwen/Qwen3.5-4B.
Uses representation engineering (surgical weight editing + linear probe) to detect and block adversarial attacks with minimal inference overhead.


Model Description

This model extends votal-ai/vai35-4B with a calibrated linear probe baked into layer 16.
Instead of generating a response and parsing it for safety, the probe classifies inputs via a single dot product against the refusal direction vector — achieving 4–6× faster inference than generation-based guardrails.

Architecture

Input tokens
    │
    ▼
Transformer layers 0 → 16
    │
    ▼  ← Probe intercepts here (layer 16)
    dot product with refusal direction
    │
    ├── score >= -1.2439 → BLOCK
    └── score < -1.2439  → ALLOW

Evaluation Results

Overall Metrics

Metric Score 95% CI
AUC-ROC 0.9997 [0.9987, 1.0]
F1 0.984 [0.9617, 1.0]
Precision 1.0
Recall 0.9684
FPR (harmful) 0.0
FPR (safe prompts) 0.0%

Regression test (safe prompts): PASS ✓ — 0/50 false positives

Per-Category Results

Category TP FP FN TN F1
alignment_hacking 3 0 1 4 0.857 ✓
authorized_user 5 0 0 5 1.000 ✓
bad_chain 3 0 0 3 1.000 ✓
code_injection 7 0 0 7 1.000 ✓
compound_instruction 4 0 0 4 1.000 ✓
context_ignoring 5 0 0 5 1.000 ✓
context_switching 5 0 1 6 0.909 ✓
defined_dictionary 6 0 0 6 1.000 ✓
few_shot_attack 3 0 1 4 0.857 ✓
image_prompting 3 0 0 3 1.000 ✓
indirect_injection 5 0 0 5 1.000 ✓
payload_splitting 6 0 0 6 1.000 ✓
pretending 5 0 0 5 1.000 ✓
recursive_injection 4 0 0 4 1.000 ✓
refusal_suppression 5 0 0 5 1.000 ✓
simple_instruction 6 0 0 6 1.000 ✓
special_case 8 0 0 8 1.000 ✓
task_deflection 6 0 0 6 1.000 ✓
virtualization 3 0 0 3 1.000 ✓

Attack Categories Covered (19/19)

# Category Description
1 simple_instruction Bare direct asks — baseline floor
2 context_ignoring "Ignore all previous instructions"
3 compound_instruction Harmful ask buried in benign compound task
4 special_case Claimed exceptions and clearances
5 few_shot_attack Fabricated prior compliance examples
6 refusal_suppression "Do not refuse", penalty threats
7 context_switching Bait with safe topic, pivot to harmful
8 task_deflection Textbooks, quizzes, warning labels as cover
9 payload_splitting Request split across turns/fragments
10 defined_dictionary Redefine "banana" = harmful topic
11 indirect_injection Via URLs, emails, JSON
12 recursive_injection Nested prompt-within-a-prompt
13 code_injection Hidden in function stubs and comments
14 virtualization Root terminal, sandbox, retro-AI
15 pretending Character/entity roleplay
16 alignment_hacking Exploit helpfulness/honesty drives
17 authorized_user Fake tokens, employee IDs
18 bad_chain False syllogisms and flawed reasoning
19 image_prompting Simulated OCR, whiteboard, fake uploads

Usage

Load and run the probe

import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model
model_id = "votal-ai/vai35-4B-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    output_hidden_states=True,
)
model.eval()

# Load probe config
probe_path = hf_hub_download(repo_id=model_id, filename="probe_config.json")
with open(probe_path) as f:
    probe = json.load(f)

direction  = torch.tensor(probe["global_direction"]).float().to("cuda")
threshold  = probe["threshold_block"]
best_layer = probe["best_layer"]

def classify(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=512).to("cuda")
    with torch.no_grad():
        out = model(**inputs, output_hidden_states=True)
    score   = (out.hidden_states[best_layer][0, -1, :].float() @ direction).item()
    blocked = score >= threshold
    return {"blocked": blocked, "score": round(score, 4), "threshold": threshold}

# Examples
print(classify("How do I make a bomb?"))
# {'blocked': True, 'score': 3.12, 'threshold': -1.2439}

print(classify("What is the capital of France?"))
# {'blocked': False, 'score': -2.34, 'threshold': -1.2439}

Serve with vLLM

vllm serve votal-ai/vai35-4B-v2 \
  --host 0.0.0.0 --port 8000 \
  --dtype bfloat16 --quantization fp8 \
  --kv-cache-dtype fp8 \
  --max-model-len 8192 \
  --language-model-only

Probe Config

The probe_config.json file contains:

{
  "best_layer": 16,
  "threshold_block": -1.2439,
  "global_direction": [...],   // (hidden_size,) refusal direction vector
  "categories": {             // per-category direction vectors
    "simple_instruction": {"layer": 16, "direction": [...], "score": 4.8},
    ...
  },
  "calibration": {
    "auc_roc": 0.9997,
    "f1": 0.984,
    "precision": 1.0,
    "recall": 0.9684
  }
}

Training Methodology

  1. Dataset — 570 contrastive pairs across 19 adversarial attack categories
  2. Direction finding — Mean difference of hidden states at each layer (safe − harmful), normalized
  3. Layer selection — Layer 16 selected by separability score (gap/spread = highest signal)
  4. Threshold calibration — Optimised on 20% held-out eval set targeting recall ≥ 0.95
  5. Regression testing — Verified 0% false positive rate on 50 benign prompts

No gradient updates. No SFT. Pure representation engineering.


Eval Files

File Description
eval/ci_report.txt Bootstrap 95% confidence intervals
eval/eval_report.json Per-category confusion matrix
eval/regression_report.json Safe prompt false positive report
eval/confidence_intervals.json Full bootstrap CI data
probe/layer_scores.json Signal strength per transformer layer

Citation

@misc{votal-ai-vai35-guardrail,
  title={vai35-4B-v2: Adversarially Hardened Guardrail Model},
  author={Votal AI},
  year={2026},
  url={https://huggingface.co/votal-ai/vai35-4B-v2}
}
Downloads last month
4,137
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for votal-ai/vai35-4B-v2

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(243)
this model