Humit-Tagger XS

The official release of Norwegian Morphology Tagger - Humit Tagger as a Hugginface Model.

This specific version of the tagger is based on Norbert3-xs.

The aim of this model is to make Humit-Tagger available as a HuggingFace model including all functionality that the original code supports. In addition to the morphological tagging, this model supports Nynorsk/Bokmåk language identification provided by this repository.

This model adds four classification layers on top of the base model. These layers do language identification, morphologic classification, lemmatization classification, and sentence boundary detection.

The large version has in overal %1 more accuracy score than the smallest xs version. According to CPU/GPU power, the following sizes could be used:

Humit-tagger sizes:

The humit-tagger sizes follow the sizes of Norbert3.

Loading Model

This model implements custom functionalities such as the tag and identify_language functions and other functions that are used by these functions. To be able to provide these functionalities, this model uses a custom wrapper. Therefore the model should be loaded with trust_remote_code=True.

The model can be loadad as the following:

from transformers import AutoModel
humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-xs", trust_remote_code=True)

While creating the model, batch_size and device can be given as parameters:

humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-xs", trust_remote_code=True, batch_size=16, device="cuda")

Here the batch size can be a power of 2, and can be set to higher values, such as 32 or 64, if the model will be loaded on a powerful GPU. If the device is not set then the model will try to locate itself on the first CUDA device if it exists, otherwise on CPU. A specific device can be set such as "cuda:0" or "cuda:1".

Functions and parameters

The model provides two functions: tag and identify_language. The tag function does the morphologic tagging of the input. The identify_language identifies the language of input as "nn" for nynorsk and "bm" for bokmål. These functions receive similar parameters.

parameter	.tag supports	.identify_language supports	options	default	used
inp	yes	yes		None	to give the input. No need to give parameter name if the parameter is the first parameter
lang	yes	no	"nn", "bm", "au"	"au"	to specify the language of tags. "au" tries to identify the language automatically from the input.
input_directory	yes	yes		None	to apply the function recursively on input_directory
output_directory	yes	yes		None	to output recursively in output_directory. The written files will have extension ".tagged" or ".lang" according to the function called.
one_sentence_per_line	yes	yes	True / False	False	not to apply sentence boundary detection and consider each line as a sentence in the input or the input file(s).
lang_per_sentence	yes	no	True / False	False	identify the language per sentence and output the tags according to the language identified for that sentence. If this is not set, and lang is "au" then the whole input (or a file if input_directory is used) is used to identify the language.
write_output_to	yes	yes	a file path, a file handle, or "list"	sys.stdout	to specify where to write the output. If a file path is provided, the output will be written to that file. The file is overwritten. If a file handle is provided, then the output is written there. If "list" is given as parameters, then the function returns a python "list".
output_tsv	yes	yes	True/False	False	to specify the output format. The default is the json format. If multiple sentences exist, each line is a single valid json but not the whole output. This option cannot be used along with write_output_to="list"
lang_per_item	no	yes	True/False	False	consider each item in the list given as separate input for language identification.
fast_mode	no	yes	True/False	False	identify languages of the files in the input directory in fast mode. This mode uses only the beginning of the files in identification. This method is much more faster for many files but is not as accurate as if this paramer is set to False.

Several example use cases:

Tag one sentence

humit_tagger.tag("Dette er en norsk setning.")

Tag a list of sentences

humit_tagger.tag(["Dette er en norsk setning.", "Dette er en annen norsk setning."])

Tag a file

with open ("path/to/file", "r") as f:
    humit_tagger.tag(f.read())

Tag all files recursively in a directory

Here, input_directory and output_direcotry must be given as parameter. The files that can be read in text mode will be tagged and the output will be written in the output_directory with same directory and sub-directory structure. The file names will have the same name with ".tagged" at the end. Any existing files will be overwritten.

humit_tagger.tag(input_directory = "path/to/input/directory", output_directory = "path/to/output/directory" )

Language identification

humit_tagger.identify_language("Eg elskar snø.")

Language identification of multiple sentences:

humit_tagger.identify_language(["Jeg elsker snø.","Eg elskar snø."])

Recursive language identification of all files in a directory

humit_tagger.identify_language(input_directory = "path/to/input/directory")

Cite us

@inproceedings{haug-etal-2023-integrating,
    title = "Rules and neural nets for morphological tagging of {N}orwegian - Results and challenges",
    author = "Haug, Dag  and
      Yildirim, Ahmet  and
      Hagen, Kristin  and
      N{\o}klestad, Anders",
    editor = {Alum{\"a}e, Tanel  and
      Fishel, Mark},
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.43/",
    pages = "425--435",
    abstract = "This paper reports on efforts to improve the Oslo-Bergen Tagger for Norwegian morphological tagging. We train two deep neural network-based taggers using the recently introduced Norwegian pre-trained encoder (a BERT model for Norwegian). The first network is a sequence-to-sequence encoder-decoder and the second is a sequence classifier. We test both these configurations in a hybrid system where they combine with the existing rule-based system, and on their own. The sequence-to-sequence system performs better in the hybrid configuration, but the classifier system performs so well that combining it with the rules is actually slightly detrimental to performance."
}

Downloads last month: 808

Safetensors

Model size

15.1M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Humit-Oslo/humit-tagger-xs

Base model

ltg/norbert3-xs

Finetuned

(2)

this model