Model Card for Model ID

DA-Bert_Old_News_V3 is the third version of a transformer trained on Danish historical texts from the period during Danish Absolutism (1660-1849). It is developed by researchers at Aalborg University. The aim of the model is to create a domain-specific model to capture meaning from texts that are far enough removed in time that they no longer read like contemporary Danish.

Fine-tuned DanskBERT model on MLM task. Training data: ENO (Enevældens Nyheder Online) – a corpus of news articles, announcements and advertisements from Danish and Norwegian newspapers from the period 1762 to 1848. The model was fine-tuned on version 1.0 of the ENO-dataset which consits of 4.9 million texts amouting to 474 million words. The data was created using a tailored Transkribus Pylaia-model and has an error rate of around 5% on word level.

Model Details

Architecture: DanskBERT

Fine-tuning Objective: Masked Language Modeling (MLM)

Sequence Length: 512 tokens

Tokenizer: Custom WordPiece tokenizer, trained on the ENO dataset

Developed by: CALDISS
Shared by: JohanHeinsen
Model type: BERT
Language(s) (NLP): Danish
License: MIT

Model Description

Third installment of the DA-Old-News models. This time trained on the full ENO-dataset. Still utilizes a custom tokenizer to enhance the models vocabulary coverage, OOV rates, and orthographic sensitivity. Consequently, the pretrained DanskBERT tokenizer was not employed, as its modern lexical inventory would have resulted in substantial token fragmentation and poor coverage of the archaic language inherent in the training data.

  • Developed by: CALDISS, AAU
  • Shared by: Johan Heinsen
  • Model type: BERT-type architechture with custom tokenizer
  • Language(s) (NLP): Danish
  • License: MIT
  • Finetuned from model: Vesteinn/DanskBERT

Model Sources

Uses

This model is designed for...

Domain-specific masked token prediction

Embedding extraction for semantic search

Further fine-tuning. More fine-tuning is needed to address specific use-cases.

The model is mostly intended for research purposes in the historical domain. Although not excluded to history.

The model can also serve as a baseline for further fine-tuning a historical BERT-based language model for either Danish or Scandinavian languages for textual or literary purposes.

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

The model is heavily limited to the historical period the training data is from. Using this model for masked token prediction on modern Danish or even other scandinavian languages the performance of the model will vary. Further fine-tuning is therefore needed. Training data is from newspapers. A bias towards this type of material and therefore a particular manner of writing is inherent to the model. Newspapers are defined by highly literal language. The model's performance will therefore also vary if using it on more materials defined by figurative language. Small biases and risks also exists in the model based on the errors from the creation of the corpus. As mentioned there is an approximate 5% error on word level which continues onto the pre-trained model. Further work on addressing these biases and risks is planned further down the road.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

You can use this model directly with the transformers library:

from transformers import pipeline

model = "Da-BERT_Old_News_V3"
generator = pipeline("fill-mask", model=model)
generator(
    "I Tiisdags er hertil lykkelig hiemkommen den ene Grønlands Farer,som førtes af Commandeur Tagholm, samme har gjort en fordeelagtig [MASK],
    og skal have hiembragt henved 130 Qvardeeler Robbe=Spæk; Denne Fangst har de
    paa en Tid af 11 Dage gjort, fra den 14 til d. 25. April Syv andre fremmede Skibe
    som tillige med dem vare den indtrofne, have ogsaa gjort god Reyse."
)

Training Details

Training Data

Trained on the

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training parameters: eval_strategy="steps", overwrite_output_dir=False, per_device_train_batch_size=16, # Training batch size gradient_accumulation_steps=4, # Accumulating the gradients before updating the weights per_device_eval_batch_size=32, # Evaluating batch size logging_steps=500, learning_rate=5e-5, save_steps=10000, max_steps=120000, save_total_limit=10, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, fp16=torch.cuda.is_available(), warmup_ratio=0.06, weight_decay=0.01, lr_scheduler_type="cosine", dataloader_num_workers=4, dataloader_pin_memory=True, save_on_each_node=False, ddp_find_unused_parameters=False, optim="adamw_torch",

  • Training regime: fp16 mixed precision was used

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Authors [optional]

SirMappel JohanHeinsen

Model Card Contact

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CALDISS-AAU/DA-BERT_Old_News_V3

Base model

vesteinn/DanskBERT
Finetuned
(15)
this model
Finetunes
2 models

Dataset used to train CALDISS-AAU/DA-BERT_Old_News_V3