NLP pipeline for Croatian and Serbian

A Python module comprising of a tokeniser, a part-of-speech/MSD tagger, a lemmatiser, a dependency parser, and a named entity recognizer for most South Slavic languages. For Croatian and Serbian there are models for processing standard and Internet non-standard texts. The estimated accuracy of morphosyntactic tagging for this tool is ~94%, while for lemmatisation the accuracy is ~99%. Dependency parsing has an labeled attachment score of ~0.9, while named entity recognition achieves a micro-F1 of ~0.9.

Author
Nikola Ljubešić
Publications
The experiments yielding this pipeline have been described in the following paper: Nikola Ljubešić and Kaja Dobrovoljc (2019). What Does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy. pp. 29-34. [Link] [.bib]