NeoDictaBERT-bilingual: Pushing the Frontier of BERT models in Hebrew

Following the success of ModernBERT and NeoBERT, we set out to train a Hebrew version of NeoBERT.

Introducing NeoDictaBERT-bilingual: A Next-Generation BERT-style model trained on a mixture of Hebrew and English data, technical report coming soon.

Supported Context Length: 4,096 (~2,700 Hebrew words)

Trained on a total of 612B tokens with a context length of 1,024, and another 122B tokens with a context length of 4,096.

This is the base model pretrained on both English and Hebrew. You can access the base model pretrained only on Hebrew here.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/neodictabert-bilingual')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/neodictabert-bilingual', trust_remote_code=True)

model.eval()

sentence = 'בשנת 1948 השלים אפרים קישון את [MASK] בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'

output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 7th token (including [CLS])
import torch
top_2 = torch.topk(output.logits[0, 7, :], 2)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print לימודיו / הכשרתו

Performance

Please see our technical report for performance metrics. The model outperforms previous SOTA models on almost all benchmarks, with a noticeable jump in the QA scores which indicate a much deeper semantic understanding.

In addition the model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. See technical report here for more details.

Citation

If you use NeoDictaBERT in your research, please cite NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

BibTeX:

@misc{shmidman2025neodictabertpushingfrontierbert,
      title={NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew}, 
      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
      year={2025},
      eprint={2510.20386},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.20386}, 
}

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
333
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dicta-il/neodictabert-bilingual

Finetunes
1 model

Paper for dicta-il/neodictabert-bilingual