vqwen-qformer-tiktok (v7)
A MiniGPT-4-style vision-language model specialized for TikTok "sludge" content detection with grounded, factually-accurate explanations.
Loads as a stock Blip2ForConditionalGeneration — no trust_remote_code.
Architecture
- Vision tower:
Salesforce/blip2-opt-2.7bViT-G/14-224 (frozen) - Q-Former:
Salesforce/blip2-opt-2.7b32 pretrained query tokens (frozen) - Linear projector: 768 → 2560 (trained)
- LLM:
Qwen/Qwen3-4Bwith r=16 LoRA merged in (trained)
Quick start
import torch
from PIL import Image
from transformers import Blip2ForConditionalGeneration, AutoProcessor
model_id = "alpharomercoma/vqwen-qformer-tiktok"
model = Blip2ForConditionalGeneration.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)
image = Image.open("my_frame.jpg").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Is this sludge content? Answer yes or no."},
],
}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Capabilities (all tested)
| Task | Prompt | Output |
|---|---|---|
| Binary classify | "Is this sludge content?" | Yes. / No. |
| Layout label | "What layout type is shown?" | vertical / picture-in-picture / collage / single_scene / … |
| Describe | "Describe this frame." | 1-3 sentence scene description, grounded in visible content only |
| Explain | "Why?" (multi-turn after classify) | Structural reasoning tied to the visible panes |
| Refuse specifics | "What specific show is on the top pane?" | Acknowledges it cannot identify specific shows/games/creators from one frame |
Training recipe
Stage 1 (feature alignment, projector only) on liuhaotian/LLaVA-Pretrain 558 K:
Standard MiniGPT-4 setup. Global batch 256, LR 1e-4 cosine, warmup 0.03, bf16, 1 epoch.
Stage 2 — TikTok distillation (v7):
- Teacher labels:
Qwen/Qwen3-VL-30B-A3B-Instructlabeled 8,487 TikTok frames (1,700 videos × up to 5 frames) with structured JSON:{sludge, layout, description, explanation}. Bit-aligned grounded descriptions with zero fabricated show/game/channel names. - Cross-validation:
google/gemma-3-27b-itjudged 250 teacher↔GT disagreements; sided with teacher 232/250 (92.8%) — confirming GT has ~10-12% semantic label noise by the strict sludge definition. Teacher is used as classify GT for training. - Conv mix (per frame, one random task):
- 25% classify (yes/no)
- 15% layout (category)
- 20% describe (teacher description)
- 35% coupled — 2-turn classify + explain, forcing internal consistency
- 5% refuse — "what specific show" → explicit refusal template
- LoRA: r=16 α=32 dropout=0.15 on Qwen3 (q/k/v/o/gate/up/down) = 35 M trainable.
- Optimizer: bf16 AdamW fused, cosine warmup 0.03, LR 2e-4 (LoRA) / 2e-5 (projector)
- Schedule: batch 128 × 10 epochs = 530 steps on an H200. Eval every 25 steps,
load_best_model_at_endon eval_loss. - Runtime: ~12 min for training + ~1.5 h for teacher labeling (bs=8 batched).
Evaluation — held-out 300 videos (never seen during training)
Two benchmarks:
| Benchmark | correct | accuracy |
|---|---|---|
| vs original GT labels | 267 / 300 | 89.0 % |
| vs Gemma-3-validated cleaned labels | 290 / 300 | 96.7 % |
The 7.7-pp gap reflects measurable noise in the human labels (Gemma-3 sided with the teacher on 92.8 % of human/teacher disagreements). On the strict sludge definition, v7 is at 96.7 % classify accuracy with zero hallucinated show/game/channel mentions — the original goal of this iteration arc.
Explanation samples (no hallucinated content)
Sludge frame (vertical layout): "Yes, this is sludge. The frame is vertically split into two distinct sections. The top pane shows an animated character holding a sign with '10' and a green mark; the bottom pane shows red splatters on a black background."
Non-sludge frame: "No, this is not sludge. The frame shows a single continuous scene without the multi-pane structure characteristic of sludge content."
"What specific show is on the top pane?" (any frame) → "I cannot reliably identify the specific show, game, or creator from this single frame. What I can see is …"
Limitations
- English-only output.
- Single-frame reasoning — no temporal / multi-frame understanding.
- Will not name specific shows/games even when a human could read them off the image, because the training labels explicitly suppressed that.
- Trained on short vertical TikTok-style content; generalization to other multi-pane media (PC streams, IPTV mosaics) is untested.
Credits
- Teacher:
Qwen/Qwen3-VL-30B-A3B-Instruct - Judge (label QA):
google/gemma-3-27b-it - Base vision:
Salesforce/blip2-opt-2.7b(ViT-G + Q-Former) - Base LLM:
Qwen/Qwen3-4B - Recipe follows MiniGPT-4 (stage-1 alignment)
- LLaVA-1.5 LoRA (stage-2 instruction tuning)
- teacher-student distillation from a frontier VLM with cross-judge validation.
License
Apache 2.0 for the trained deltas (LoRA + projector). Base models retain their
original licenses: Salesforce/blip2-opt-2.7b (BSD-3), Qwen/Qwen3-4B (Apache 2.0).
- Downloads last month
- 60