roberta-finetune-slangs

Fine-tuned RoBERTa model for sentiment analysis of internet slang, abbreviations, and short words, based on the research paper:

Sahil Kamath, Vaishnavi Padiya, Sonia D’Silva, Nilesh Patil, Meera Narvekar. TeenSenti – A novel approach for sentiment analysis of short words and slangs.

Model description

This model is fine-tuned from a pre-trained RoBERTa transformer to classify the sentiment of sentences containing informal internet expressions such as slang, abbreviations, and short forms. The goal is to address the gap in existing sentiment analysis models, which often fail to interpret modern linguistic nuances.

Key features:

Handles slang and short words with contextual understanding.
Trained using a custom slang dictionary integrated into the dataset.
Outperforms the base twitter-roberta-base-sentiment model on slang-heavy datasets.
Designed for social media, product reviews, and informal text analysis.

Intended uses & limitations

Intended uses

Sentiment classification for texts containing slang or abbreviations.
Social media monitoring, brand sentiment analysis, or content moderation where informal language is common.

Limitations

Optimized for slang/abbreviation-heavy English text; performance may degrade on formal or domain-specific corpora.
Slang evolves rapidly — periodic retraining is recommended for sustained accuracy.

Training and evaluation data

Dataset: Custom-curated TeenSenti dataset of ~20,000 sentences.
Each slang term has both positive and negative example sentences generated and verified.
Dataset split: 80% training, 20% testing (per slang term to avoid overlap).
Examples include terms like "ftw" ("for the win") and "h8" ("hate").

Training procedure

Preprocessing

Custom tokenizer preserving slang and short words from the slang dictionary.
Tokenization and text processing using Hugging Face transformers.

Training hyperparameters

Optimizer: AdamW
Learning rate schedule: triangular policy
Batch size: 16
Epochs: 4
Max sequence length: 128
Mixed precision: float32

Evaluation

Compared to the base twitter-roberta-base-sentiment model:

Example sentence	Base model prediction	Fine-tuned prediction
"Team India ftw"	Neutral	Positive
"I h8 that person"	Negative (low confidence)	Negative (high confidence)

Fine-tuned model achieves:

Accuracy: 0.93
F1-score: 0.925
Precision: 0.92
Recall: 0.93

Framework versions

transformers: 4.35.2
torch: 2.x
tokenizers: 0.15.0
datasets: 2.x

Citation

If you use this model, please cite:

@INPROCEEDINGS{10582077,
  author={Kamath, Sahil and Padiya, Vaishnavi and D'Silva, Sonia and Patil, Nilesh and Narvekar, Meera},
  booktitle={2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)}, 
  title={TeenSenti - A novel approach for sentiment analysis of short words and slangs}, 
  year={2024},
  volume={},
  number={},
  pages={1-8},
  keywords={Deep learning;Sentiment analysis;Dictionaries;Accuracy;Reviews;Navigation;Oral communication;Sentiment Analysis;Slang;Short Words;NLP;FastText Embeddings},
  doi={10.1109/AMATHE61652.2024.10582077}}

Downloads last month: 11

Space using spectre0108/roberta-finetune-slangs 1

Evaluation results

Accuracy on TeenSenti Slang Dataset
test set self-reported

0.930
Precision on TeenSenti Slang Dataset
test set self-reported

0.920
Recall on TeenSenti Slang Dataset
test set self-reported

0.930
F1 on TeenSenti Slang Dataset
test set self-reported

0.925

View on Papers With Code