Humit-Tagger XS
The official release of Norwegian Morphology Tagger - Humit Tagger as a Hugginface Model.
This specific version of the tagger is based on Norbert3-xs.
The aim of this model is to make Humit-Tagger available as a HuggingFace model including all functionality that the original code supports. In addition to the morphological tagging, this model supports Nynorsk/Bokmåk language identification provided by this repository.
This model adds four classification layers on top of the base model. These layers do language identification, morphologic classification, lemmatization classification, and sentence boundary detection.
The large version has in overal %1 more accuracy score than the smallest xs version. According to CPU/GPU power, the following sizes could be used:
Humit-tagger sizes:
The humit-tagger sizes follow the sizes of Norbert3.
Loading Model
This model implements custom functionalities such as the tag and identify_language functions and other functions that are used by these functions.
To be able to provide these functionalities, this model uses a custom wrapper.
Therefore the model should be loaded with trust_remote_code=True.
The model can be loadad as the following:
from transformers import AutoModel
humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-xs", trust_remote_code=True)
While creating the model, batch_size and device can be given as parameters:
humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-xs", trust_remote_code=True, batch_size=16, device="cuda")
Here the batch size can be a power of 2, and can be set to higher values, such as 32 or 64, if the model will be loaded on a powerful GPU. If the device is not set then the model will try to locate itself on the first CUDA device if it exists, otherwise on CPU. A specific device can be set such as "cuda:0" or "cuda:1".
Functions and parameters
The model provides two functions: tag and identify_language. The tag function does the morphologic tagging of the input. The identify_language identifies the language of input as "nn" for nynorsk and "bm" for bokmål. These functions receive similar parameters.
| parameter | .tag supports | .identify_language supports | options | default | used |
|---|---|---|---|---|---|
| inp | yes | yes | None | to give the input. No need to give parameter name if the parameter is the first parameter | |
| lang | yes | no | "nn", "bm", "au" | "au" | to specify the language of tags. "au" tries to identify the language automatically from the input. |
| input_directory | yes | yes | None | to apply the function recursively on input_directory | |
| output_directory | yes | yes | None | to output recursively in output_directory. The written files will have extension ".tagged" or ".lang" according to the function called. | |
| one_sentence_per_line | yes | yes | True / False | False | not to apply sentence boundary detection and consider each line as a sentence in the input or the input file(s). |
| lang_per_sentence | yes | no | True / False | False | identify the language per sentence and output the tags according to the language identified for that sentence. If this is not set, and lang is "au" then the whole input (or a file if input_directory is used) is used to identify the language. |
| write_output_to | yes | yes | a file path, a file handle, or "list" | sys.stdout | to specify where to write the output. If a file path is provided, the output will be written to that file. The file is overwritten. If a file handle is provided, then the output is written there. If "list" is given as parameters, then the function returns a python "list". |
| output_tsv | yes | yes | True/False | False | to specify the output format. The default is the json format. If multiple sentences exist, each line is a single valid json but not the whole output. This option cannot be used along with write_output_to="list" |
| lang_per_item | no | yes | True/False | False | consider each item in the list given as separate input for language identification. |
| fast_mode | no | yes | True/False | False | identify languages of the files in the input directory in fast mode. This mode uses only the beginning of the files in identification. This method is much more faster for many files but is not as accurate as if this paramer is set to False. |
Several example use cases:
Tag one sentence
humit_tagger.tag("Dette er en norsk setning.")
Tag a list of sentences
humit_tagger.tag(["Dette er en norsk setning.", "Dette er en annen norsk setning."])
Tag a file
with open ("path/to/file", "r") as f:
humit_tagger.tag(f.read())
Tag all files recursively in a directory
Here, input_directory and output_direcotry must be given as parameter. The files that can be read in text mode will be tagged and the output will be written in the output_directory with same directory and sub-directory structure. The file names will have the same name with ".tagged" at the end. Any existing files will be overwritten.
humit_tagger.tag(input_directory = "path/to/input/directory", output_directory = "path/to/output/directory" )
Language identification
humit_tagger.identify_language("Eg elskar snø.")
Language identification of multiple sentences:
humit_tagger.identify_language(["Jeg elsker snø.","Eg elskar snø."])
Recursive language identification of all files in a directory
humit_tagger.identify_language(input_directory = "path/to/input/directory")
Cite us
@inproceedings{haug-etal-2023-integrating,
title = "Rules and neural nets for morphological tagging of {N}orwegian - Results and challenges",
author = "Haug, Dag and
Yildirim, Ahmet and
Hagen, Kristin and
N{\o}klestad, Anders",
editor = {Alum{\"a}e, Tanel and
Fishel, Mark},
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = may,
year = "2023",
address = "T{\'o}rshavn, Faroe Islands",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2023.nodalida-1.43/",
pages = "425--435",
abstract = "This paper reports on efforts to improve the Oslo-Bergen Tagger for Norwegian morphological tagging. We train two deep neural network-based taggers using the recently introduced Norwegian pre-trained encoder (a BERT model for Norwegian). The first network is a sequence-to-sequence encoder-decoder and the second is a sequence classifier. We test both these configurations in a hybrid system where they combine with the existing rule-based system, and on their own. The sequence-to-sequence system performs better in the hybrid configuration, but the classifier system performs so well that combining it with the rules is actually slightly detrimental to performance."
}
- Downloads last month
- 808
Model tree for Humit-Oslo/humit-tagger-xs
Base model
ltg/norbert3-xs