YOLOv8n Face Detection — Bengali Video Caption Pipeline

Model Summary

Fine-tuned YOLOv8n for face/person detection in Bengali video frames.
Part of a Bengali video auto-captioning system that combines:

This model → detects visible faces in frames
Tmanna/whisper-bengali-final → Bengali speech-to-text
Rule-based visibility filter → shows caption only when face is on screen

Pipeline Architecture

Video
  ↓
Audio ──→ Whisper (Tmanna/whisper-bengali-final) ──→ Bengali caption text
  ↓
Frames ──→ THIS MODEL (YOLOv8 Face Detection)
  ↓
Visibility Filter (rule-based: len(boxes) > 0)
  ↓
IF face visible → overlay Bengali caption
ELSE            → skip caption for this frame

Training Details

Parameter	Value
Base Model	`yolov8n.pt` (Ultralytics)
Dataset	lylmsc/wider-face-for-yolo-training
Classes	1 (`face`)
Epochs	50
Image Size	640 × 640
Batch Size	32
Hardware	2× NVIDIA T4 (Kaggle)
Optimizer	AdamW + Cosine LR decay
Early Stop	patience=10

Evaluation Results

Metric	Score
mAP@0.5	0.6994
mAP@0.5:0.95	0.3665
Precision	0.8556
Recall	0.6104

Usage

from ultralytics import YOLO

model = YOLO("Tmanna/yolov8-face-bengali-video")

# On a video frame (numpy array or image path)
results = model("frame.jpg", conf=0.25)

faces = results[0].boxes
if len(faces) > 0:
    print("Face visible → show Bengali caption")
else:
    print("No face → skip caption")

Visibility Filter (no extra model needed)

def should_show_caption(frame_path, face_model, conf=0.25):
    results = face_model(frame_path, conf=conf, verbose=False)
    return len(results[0].boxes) > 0   # True = show, False = hide

What This Model Does NOT Handle

Module	Status	Reason
Bengali STT	✅ Handled by `Tmanna/whisper-bengali-final`	Pre-existing model
Translation	❌ Not needed	English→Bengali not required
Speaker Diarization	⚠️ Optional	Not in core pipeline
Face Tracking (DeepSORT)	⚠️ Optional	Not in core pipeline
Active Speaker Detection	🔴 Hard/Optional	Not in core pipeline

Downloads last month: 71