hf-tuner/Sinhala-writing-styles
Viewer β’ Updated β’ 8.44k β’ 5
Sinhala writing style classifier that massively outperforms SinLlama β the previous state-of-the-art.
| Model | Macro F1 |
|---|---|
| SerendibLLM-writing-head (ours) | 92.887% |
| SinLlama (baseline) | 58.893% |
β +34.0% above SinLlama
| Category | Precision | Recall | F1 |
|---|---|---|---|
| ACADEMIC | 0.95 | 0.96 | 0.96 |
| BLOG | 0.95 | 0.90 | 0.92 |
| NEWS | 0.90 | 0.94 | 0.92 |
| CREATIVE | 0.91 | 0.91 | 0.91 |
classifier_head.ptimport torch
import torch.nn as nn
from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
from huggingface_hub import hf_hub_download
BASE = "Chamaka8/Serendip-LLM-CPT-SFT-v2"
LORA = "Chamaka8/SerendibLLM-v2-writing-head"
LABELS = ["ACADEMIC", "BLOG", "NEWS", "CREATIVE"]
TOKEN = "your_hf_token"
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
tokenizer = PreTrainedTokenizerFast.from_pretrained(LORA, token=TOKEN)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(BASE, token=TOKEN,
quantization_config=bnb, device_map={"": "cuda:0"})
model = PeftModel.from_pretrained(base_model, LORA, token=TOKEN)
model.eval()
head_path = hf_hub_download(repo_id=LORA, filename="classifier_head.pt", token=TOKEN)
classifier = nn.Linear(4096, 4).to("cuda:0").float()
classifier.load_state_dict(torch.load(head_path, map_location="cpu"))
classifier.eval()
def classify(text):
enc = tokenizer(
f"Classify writing style as ACADEMIC, BLOG, NEWS or CREATIVE: {text}",
return_tensors="pt", truncation=True, max_length=256, padding="max_length"
).to("cuda:0")
with torch.no_grad():
out = model(**enc, output_hidden_states=True)
hidden = out.hidden_states[-1]
seq_len = enc["attention_mask"].sum(dim=1) - 1
last = hidden[torch.arange(1), seq_len].float()
return LABELS[classifier(last).argmax(dim=-1).item()]
print(classify("ΰΆ
ΰΆ― ΰΆ΄ΰ·ΰΆ»ΰ·ΰΆ½ΰ·ΰΆΈΰ·ΰΆ±ΰ·ΰΆΰ·ΰ·ΰ· ΰΆ―ΰ· ΰΆΰΆ±ΰ· ΰΆ½ΰ·ΰΆΆΰ· ΰΆΰ·ΰΆ»ΰΆ«ΰΆΊ...")) # NEWS
| Parameter | Value |
|---|---|
| Base model | Chamaka8/Serendip-LLM-CPT-SFT-v2 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Epochs | 8 |
| Batch size | 8 |
| Classifier LR | 2e-4 |
| LoRA LR | 5e-5 |
| Max sequence length | 256 |
hf-tuner/Sinhala-writing-styles
Base model
Chamaka8/serendib-llm-cpt-llama3-8b