vqwen-qformer-tiktok (v7)

A MiniGPT-4-style vision-language model specialized for TikTok "sludge" content detection with grounded, factually-accurate explanations.

Loads as a stock Blip2ForConditionalGeneration — no trust_remote_code.

Architecture

Vision tower: Salesforce/blip2-opt-2.7b ViT-G/14-224 (frozen)
Q-Former: Salesforce/blip2-opt-2.7b 32 pretrained query tokens (frozen)
Linear projector: 768 → 2560 (trained)
LLM: Qwen/Qwen3-4B with r=16 LoRA merged in (trained)

Quick start

import torch
from PIL import Image
from transformers import Blip2ForConditionalGeneration, AutoProcessor

model_id = "alpharomercoma/vqwen-qformer-tiktok"
model = Blip2ForConditionalGeneration.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("my_frame.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Is this sludge content? Answer yes or no."},
    ],
}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Capabilities (all tested)

Task	Prompt	Output
Binary classify	"Is this sludge content?"	`Yes.` / `No.`
Layout label	"What layout type is shown?"	`vertical` / `picture-in-picture` / `collage` / `single_scene` / …
Describe	"Describe this frame."	1-3 sentence scene description, grounded in visible content only
Explain	"Why?" (multi-turn after classify)	Structural reasoning tied to the visible panes
Refuse specifics	"What specific show is on the top pane?"	Acknowledges it cannot identify specific shows/games/creators from one frame

Training recipe

Stage 1 (feature alignment, projector only) on liuhaotian/LLaVA-Pretrain 558 K: Standard MiniGPT-4 setup. Global batch 256, LR 1e-4 cosine, warmup 0.03, bf16, 1 epoch.

Stage 2 — TikTok distillation (v7):

Teacher labels: Qwen/Qwen3-VL-30B-A3B-Instruct labeled 8,487 TikTok frames (1,700 videos × up to 5 frames) with structured JSON: {sludge, layout, description, explanation}. Bit-aligned grounded descriptions with zero fabricated show/game/channel names.
Cross-validation: google/gemma-3-27b-it judged 250 teacher↔GT disagreements; sided with teacher 232/250 (92.8%) — confirming GT has ~10-12% semantic label noise by the strict sludge definition. Teacher is used as classify GT for training.
Conv mix (per frame, one random task):
- 25% classify (yes/no)
- 15% layout (category)
- 20% describe (teacher description)
- 35% coupled — 2-turn classify + explain, forcing internal consistency
- 5% refuse — "what specific show" → explicit refusal template
LoRA: r=16 α=32 dropout=0.15 on Qwen3 (q/k/v/o/gate/up/down) = 35 M trainable.
Optimizer: bf16 AdamW fused, cosine warmup 0.03, LR 2e-4 (LoRA) / 2e-5 (projector)
Schedule: batch 128 × 10 epochs = 530 steps on an H200. Eval every 25 steps, load_best_model_at_end on eval_loss.
Runtime: ~12 min for training + ~1.5 h for teacher labeling (bs=8 batched).

Evaluation — held-out 300 videos (never seen during training)

Two benchmarks:

Benchmark	correct	accuracy
vs original GT labels	267 / 300	89.0 %
vs Gemma-3-validated cleaned labels	290 / 300	96.7 %

The 7.7-pp gap reflects measurable noise in the human labels (Gemma-3 sided with the teacher on 92.8 % of human/teacher disagreements). On the strict sludge definition, v7 is at 96.7 % classify accuracy with zero hallucinated show/game/channel mentions — the original goal of this iteration arc.

Explanation samples (no hallucinated content)

Sludge frame (vertical layout): "Yes, this is sludge. The frame is vertically split into two distinct sections. The top pane shows an animated character holding a sign with '10' and a green mark; the bottom pane shows red splatters on a black background."

Non-sludge frame: "No, this is not sludge. The frame shows a single continuous scene without the multi-pane structure characteristic of sludge content."

"What specific show is on the top pane?" (any frame) → "I cannot reliably identify the specific show, game, or creator from this single frame. What I can see is …"

Limitations

English-only output.
Single-frame reasoning — no temporal / multi-frame understanding.
Will not name specific shows/games even when a human could read them off the image, because the training labels explicitly suppressed that.
Trained on short vertical TikTok-style content; generalization to other multi-pane media (PC streams, IPTV mosaics) is untested.

Credits

Teacher: Qwen/Qwen3-VL-30B-A3B-Instruct
Judge (label QA): google/gemma-3-27b-it
Base vision: Salesforce/blip2-opt-2.7b (ViT-G + Q-Former)
Base LLM: Qwen/Qwen3-4B
Recipe follows MiniGPT-4 (stage-1 alignment)
- LLaVA-1.5 LoRA (stage-2 instruction tuning)
- teacher-student distillation from a frontier VLM with cross-judge validation.

License

Apache 2.0 for the trained deltas (LoRA + projector). Base models retain their original licenses: Salesforce/blip2-opt-2.7b (BSD-3), Qwen/Qwen3-4B (Apache 2.0).