vqwen-qformer-tiktok (v7)

A MiniGPT-4-style vision-language model specialized for TikTok "sludge" content detection with grounded, factually-accurate explanations.

Loads as a stock Blip2ForConditionalGeneration — no trust_remote_code.

Architecture

  • Vision tower: Salesforce/blip2-opt-2.7b ViT-G/14-224 (frozen)
  • Q-Former: Salesforce/blip2-opt-2.7b 32 pretrained query tokens (frozen)
  • Linear projector: 768 → 2560 (trained)
  • LLM: Qwen/Qwen3-4B with r=16 LoRA merged in (trained)

Quick start

import torch
from PIL import Image
from transformers import Blip2ForConditionalGeneration, AutoProcessor

model_id = "alpharomercoma/vqwen-qformer-tiktok"
model = Blip2ForConditionalGeneration.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("my_frame.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Is this sludge content? Answer yes or no."},
    ],
}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Capabilities (all tested)

Task Prompt Output
Binary classify "Is this sludge content?" Yes. / No.
Layout label "What layout type is shown?" vertical / picture-in-picture / collage / single_scene / …
Describe "Describe this frame." 1-3 sentence scene description, grounded in visible content only
Explain "Why?" (multi-turn after classify) Structural reasoning tied to the visible panes
Refuse specifics "What specific show is on the top pane?" Acknowledges it cannot identify specific shows/games/creators from one frame

Training recipe

Stage 1 (feature alignment, projector only) on liuhaotian/LLaVA-Pretrain 558 K: Standard MiniGPT-4 setup. Global batch 256, LR 1e-4 cosine, warmup 0.03, bf16, 1 epoch.

Stage 2 — TikTok distillation (v7):

  • Teacher labels: Qwen/Qwen3-VL-30B-A3B-Instruct labeled 8,487 TikTok frames (1,700 videos × up to 5 frames) with structured JSON: {sludge, layout, description, explanation}. Bit-aligned grounded descriptions with zero fabricated show/game/channel names.
  • Cross-validation: google/gemma-3-27b-it judged 250 teacher↔GT disagreements; sided with teacher 232/250 (92.8%) — confirming GT has ~10-12% semantic label noise by the strict sludge definition. Teacher is used as classify GT for training.
  • Conv mix (per frame, one random task):
    • 25% classify (yes/no)
    • 15% layout (category)
    • 20% describe (teacher description)
    • 35% coupled — 2-turn classify + explain, forcing internal consistency
    • 5% refuse — "what specific show" → explicit refusal template
  • LoRA: r=16 α=32 dropout=0.15 on Qwen3 (q/k/v/o/gate/up/down) = 35 M trainable.
  • Optimizer: bf16 AdamW fused, cosine warmup 0.03, LR 2e-4 (LoRA) / 2e-5 (projector)
  • Schedule: batch 128 × 10 epochs = 530 steps on an H200. Eval every 25 steps, load_best_model_at_end on eval_loss.
  • Runtime: ~12 min for training + ~1.5 h for teacher labeling (bs=8 batched).

Evaluation — held-out 300 videos (never seen during training)

Two benchmarks:

Benchmark correct accuracy
vs original GT labels 267 / 300 89.0 %
vs Gemma-3-validated cleaned labels 290 / 300 96.7 %

The 7.7-pp gap reflects measurable noise in the human labels (Gemma-3 sided with the teacher on 92.8 % of human/teacher disagreements). On the strict sludge definition, v7 is at 96.7 % classify accuracy with zero hallucinated show/game/channel mentions — the original goal of this iteration arc.

Explanation samples (no hallucinated content)

Sludge frame (vertical layout): "Yes, this is sludge. The frame is vertically split into two distinct sections. The top pane shows an animated character holding a sign with '10' and a green mark; the bottom pane shows red splatters on a black background."

Non-sludge frame: "No, this is not sludge. The frame shows a single continuous scene without the multi-pane structure characteristic of sludge content."

"What specific show is on the top pane?" (any frame) → "I cannot reliably identify the specific show, game, or creator from this single frame. What I can see is …"

Limitations

  • English-only output.
  • Single-frame reasoning — no temporal / multi-frame understanding.
  • Will not name specific shows/games even when a human could read them off the image, because the training labels explicitly suppressed that.
  • Trained on short vertical TikTok-style content; generalization to other multi-pane media (PC streams, IPTV mosaics) is untested.

Credits

  • Teacher: Qwen/Qwen3-VL-30B-A3B-Instruct
  • Judge (label QA): google/gemma-3-27b-it
  • Base vision: Salesforce/blip2-opt-2.7b (ViT-G + Q-Former)
  • Base LLM: Qwen/Qwen3-4B
  • Recipe follows MiniGPT-4 (stage-1 alignment)
    • LLaVA-1.5 LoRA (stage-2 instruction tuning)
    • teacher-student distillation from a frontier VLM with cross-judge validation.

License

Apache 2.0 for the trained deltas (LoRA + projector). Base models retain their original licenses: Salesforce/blip2-opt-2.7b (BSD-3), Qwen/Qwen3-4B (Apache 2.0).

Downloads last month
60
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alpharomercoma/vqwen-qformer-tiktok

Finetuned
Qwen/Qwen3-4B
Finetuned
(599)
this model

Space using alpharomercoma/vqwen-qformer-tiktok 1

Papers for alpharomercoma/vqwen-qformer-tiktok