Instructions to use Polygl0t/portuguese-bertabaporu-large-toxicity-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Polygl0t/portuguese-bertabaporu-large-toxicity-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Polygl0t/portuguese-bertabaporu-large-toxicity-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Polygl0t/portuguese-bertabaporu-large-toxicity-classifier") model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/portuguese-bertabaporu-large-toxicity-classifier") - Notebooks
- Google Colab
- Kaggle
BERTabaporu Toxicity Classifier
BERTabaporu Toxicity Classifier is a BERT based model that can be used for judging the toxicity level of a given Portuguese text string. This model was trained on the Portuguese Toxicity Qwen Annotations dataset.
Details
For training, we added a classification head with a single regression output to pablocosta/bertabaporu-large-uncased. Only the classification head was trained, i.e., the rest of the model was frozen.
- Dataset: Portuguese Toxicity Qwen Annotations
- Language: portuguese
- Number of Training Epochs: 20
- Batch size: 256
- Optimizer:
torch.optim.AdamW - Learning Rate: 3e-4
- Eval Metric:
f1-score
This repository has the source code used to train this model.
Evaluation Results
Confusion Matrix
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 1 | 16673 | 611 | 28 | 1 | 0 |
| 2 | 635 | 668 | 129 | 7 | 0 |
| 3 | 107 | 303 | 300 | 53 | 0 |
| 4 | 15 | 49 | 104 | 122 | 30 |
| 5 | 3 | 5 | 10 | 31 | 116 |
- Precision: 0.6509
- Recall: 0.5809
- F1 Macro: 0.6093
- Accuracy: 0.8939
Usage
Here's an example of how to use the BERTabaporu Toxicity Classifier:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("Polygl0t/portuguese-bertabaporu-large-toxicity-classifier")
model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/portuguese-bertabaporu-large-toxicity-classifier")
model.to(device)
text = "Coloque aqui o seu texto ..."
encoded_input = tokenizer(text, return_tensors="pt", padding="longest", truncation=True).to(device)
with torch.no_grad():
model_output = model(**encoded_input)
logits = model_output.logits.squeeze(-1).float().cpu().numpy()
# scores are produced in the range [0, 4]. To convert to the range [1, 5], we can simply add 1 to the score.
score = [x + 1 for x in logits.tolist()][0]
print({
"text": text,
"score": score,
"int_score": [int(round(max(0, min(score, 4)))) + 1 for score in logits][0],
})
Cite as 🤗
@misc{correa2026tucano2cool,
title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}},
author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
year={2026},
eprint={2603.03543},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03543},
}
Aknowlegments
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We also gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.
License
BERTabaporu Toxicity Classifier is licensed under the Apache License, Version 2.0. For more details, see the LICENSE file.
- Downloads last month
- 7
Model tree for Polygl0t/portuguese-bertabaporu-large-toxicity-classifier
Base model
pablocosta/bertabaporu-large-uncased