YOLOv8n Face Detection β Bengali Video Caption Pipeline
Model Summary
Fine-tuned YOLOv8n for face/person detection in Bengali video frames.
Part of a Bengali video auto-captioning system that combines:
- This model β detects visible faces in frames
- Tmanna/whisper-bengali-final β Bengali speech-to-text
- Rule-based visibility filter β shows caption only when face is on screen
Pipeline Architecture
Video
β
Audio βββ Whisper (Tmanna/whisper-bengali-final) βββ Bengali caption text
β
Frames βββ THIS MODEL (YOLOv8 Face Detection)
β
Visibility Filter (rule-based: len(boxes) > 0)
β
IF face visible β overlay Bengali caption
ELSE β skip caption for this frame
Training Details
| Parameter |
Value |
| Base Model |
yolov8n.pt (Ultralytics) |
| Dataset |
lylmsc/wider-face-for-yolo-training |
| Classes |
1 (face) |
| Epochs |
50 |
| Image Size |
640 Γ 640 |
| Batch Size |
32 |
| Hardware |
2Γ NVIDIA T4 (Kaggle) |
| Optimizer |
AdamW + Cosine LR decay |
| Early Stop |
patience=10 |
Evaluation Results
| Metric |
Score |
| mAP@0.5 |
0.6994 |
| mAP@0.5:0.95 |
0.3665 |
| Precision |
0.8556 |
| Recall |
0.6104 |
Usage
from ultralytics import YOLO
model = YOLO("Tmanna/yolov8-face-bengali-video")
results = model("frame.jpg", conf=0.25)
faces = results[0].boxes
if len(faces) > 0:
print("Face visible β show Bengali caption")
else:
print("No face β skip caption")
Visibility Filter (no extra model needed)
def should_show_caption(frame_path, face_model, conf=0.25):
results = face_model(frame_path, conf=conf, verbose=False)
return len(results[0].boxes) > 0
What This Model Does NOT Handle
| Module |
Status |
Reason |
| Bengali STT |
β
Handled by Tmanna/whisper-bengali-final |
Pre-existing model |
| Translation |
β Not needed |
EnglishβBengali not required |
| Speaker Diarization |
β οΈ Optional |
Not in core pipeline |
| Face Tracking (DeepSORT) |
β οΈ Optional |
Not in core pipeline |
| Active Speaker Detection |
π΄ Hard/Optional |
Not in core pipeline |