NeoDictaBERT-bilingual: Pushing the Frontier of BERT models in Hebrew
Following the success of ModernBERT and NeoBERT, we set out to train a Hebrew version of NeoBERT.
Introducing NeoDictaBERT-bilingual: A Next-Generation BERT-style model trained on a mixture of Hebrew and English data, technical report coming soon.
Supported Context Length: 4,096 (~2,700 Hebrew words)
Trained on a total of 612B tokens with a context length of 1,024, and another 122B tokens with a context length of 4,096.
This is the base model pretrained on both English and Hebrew. You can access the base model pretrained only on Hebrew here.
Sample usage:
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dicta-il/neodictabert-bilingual')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/neodictabert-bilingual', trust_remote_code=True)
model.eval()
sentence = 'בשנת 1948 השלים אפרים קישון את [MASK] בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 7th token (including [CLS])
import torch
top_2 = torch.topk(output.logits[0, 7, :], 2)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print לימודיו / הכשרתו
Performance
Please see our technical report for performance metrics. The model outperforms previous SOTA models on almost all benchmarks, with a noticeable jump in the QA scores which indicate a much deeper semantic understanding.
In addition the model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. See technical report here for more details.
Citation
If you use NeoDictaBERT in your research, please cite NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
BibTeX:
@misc{shmidman2025neodictabertpushingfrontierbert,
title={NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew},
author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
year={2025},
eprint={2510.20386},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.20386},
}
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
- Downloads last month
- 333
