SerendibLLM Writing Style Classifier πŸ†

Sinhala writing style classifier that massively outperforms SinLlama β€” the previous state-of-the-art.

πŸ“Š Benchmark Results

Model Macro F1
SerendibLLM-writing-head (ours) 92.887%
SinLlama (baseline) 58.893%

βœ… +34.0% above SinLlama

πŸ“° Per-Class Results

Category Precision Recall F1
ACADEMIC 0.95 0.96 0.96
BLOG 0.95 0.90 0.92
NEWS 0.90 0.94 0.92
CREATIVE 0.91 0.91 0.91

πŸ—οΈ Architecture

  • Base model: Chamaka8/Serendip-LLM-CPT-SFT-v2 (8B LLaMA, Sinhala CPT+SFT)
  • LoRA: r=32, alpha=64, all projection layers
  • Classifier head: Linear(4096 β†’ 4) saved as classifier_head.pt
  • Training: 8 epochs, cosine LR schedule, MAX_LEN=256

πŸš€ Quick Start

import torch
import torch.nn as nn
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from huggingface_hub import hf_hub_download

BASE   = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA   = "Chamaka8/SerendibLLM-v2-writing-head"
LABELS = ["ACADEMIC", "BLOG", "NEWS", "CREATIVE"]
TOKEN  = "your_hf_token"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = PreTrainedTokenizerFast.from_pretrained(LORA, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
               quantization_config=bnb, device_map={"": "cuda:0"})
model = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()

head_path = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 4).to("cuda:0").float()
classifier.load_state_dict(torch.load(head_path, map_location="cpu"))
classifier.eval()

def classify(text):
    enc = tokenizer(
        f"Classify writing style as ACADEMIC, BLOG, NEWS or CREATIVE: {text}",
        return_tensors="pt", truncation=True, max_length=256, padding="max_length"
    ).to("cuda:0")
    with torch.no_grad():
        out = model(**enc, output_hidden_states=True)
    hidden  = out.hidden_states[-1]
    seq_len = enc["attention_mask"].sum(dim=1) - 1
    last    = hidden[torch.arange(1), seq_len].float()
    return LABELS[classifier(last).argmax(dim=-1).item()]

print(classify("ΰΆ…ΰΆ― ΰΆ΄ΰ·ΰΆ»ΰ·ŠΰΆ½ΰ·’ΰΆΈΰ·šΰΆ±ΰ·ŠΰΆ­ΰ·”ΰ·€ΰ·š ΰΆ―ΰ·“ ΰΆœΰΆ±ΰ·” ࢽැࢢූ ΰΆ­ΰ·“ΰΆ»ΰΆ«ΰΆΊ..."))  # NEWS

πŸ“‹ Training Details

Parameter Value
Base model Chamaka8/Serendip-LLM-CPT-SFT-v2
LoRA rank 32
LoRA alpha 64
Epochs 8
Batch size 8
Classifier LR 2e-4
LoRA LR 5e-5
Max sequence length 256

πŸ“š Dataset

hf-tuner/Sinhala-writing-styles

  • 4 classes: Academic, Blog, News, Creative
  • 7596 train / 844 test (balanced)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Chamaka8/SerendibLLM-v2-writing-head

Dataset used to train Chamaka8/SerendibLLM-v2-writing-head

Space using Chamaka8/SerendibLLM-v2-writing-head 1