updates
Browse files
README.md
CHANGED
|
@@ -84,10 +84,10 @@ Please reach out to Alexandra DeLucia (aadelucia at jhu.edu) or open an issue if
|
|
| 84 |
The language of Twitter differs significantly from that of other domains commonly included in large language model training.
|
| 85 |
While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
|
| 86 |
language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
|
| 87 |
-
or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language
|
| 88 |
model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
|
| 89 |
and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
|
| 90 |
-
adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is
|
| 91 |
more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
|
| 92 |
|
| 93 |
## Training data
|
|
|
|
| 84 |
The language of Twitter differs significantly from that of other domains commonly included in large language model training.
|
| 85 |
While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained
|
| 86 |
language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter,
|
| 87 |
+
or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language
|
| 88 |
model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual
|
| 89 |
and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models
|
| 90 |
+
adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is
|
| 91 |
more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
|
| 92 |
|
| 93 |
## Training data
|