A Python module comprising of a tokeniser, a part-of-speech/MSD tagger, a lemmatiser, a dependency parser, and a named entity recognizer for most South Slavic languages. For Croatian and Serbian there are models for processing standard and Internet non-standard texts. The estimated accuracy of morphosyntactic tagging for this tool is ~94%, while for lemmatisation the accuracy is ~99%. Dependency parsing has an labeled attachment score of ~0.9, while named entity recognition achieves a micro-F1 of ~0.9.
Category: Croatian
-
NLP pipeline for Croatian and Serbian
AuthorNikola LjubešićPublicationsThe experiments yielding this pipeline have been described in the following paper: Nikola Ljubešić and Kaja Dobrovoljc (2019). What Does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy. pp. 29-34. [Link] [.bib] -
ReLDI-NormTagNER-hr 2.1
ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).
AuthorsNikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja SamardžićAvailabilityFor local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository.PublicationThe corpus construction is (partially) described in the following paper:
Miličević, M. and N. Ljubešić (2016). Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0 4(2) link -
Stemmers for Serbian and Croatian: SCStemmers
This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian:
- The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka
- A refinement of the greedy subsumption-based stemmer, by Nikola Milošević
- A “Simple stemmer for Croatian v0.1”, by Nikola Ljubešić and Ivan Pandžić
All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded.
AuthorVuk BatanovićAvailabilityThe package and a more extensive documentation can be downloaded from the SCStemmers GitHub repository.PublicationsThe SCStemmers package was introduced in:
Vuk Batanović, Boško Nikolić, Milan Milosavljević (2016). Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia. [Link] [.bib]
The original papers describing each implemented stemming algorithm are:
- For the greedy and the optimal subsumption-based stemmer for Serbian: Vlado Kešelj, Danko Šipka (2008). A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. Infotheca 9(1-2), pp. 23a-33a. [Link]
- For the refinement of the greedy subsumption-based stemmer: Nikola Milošević (2012). Stemmer for Serbian language. arXiv preprint arXiv:1209.4471. [Link]
- For the “Simple stemmer for Croatian v0.1”: Nikola Ljubešić, Damir Boras, Ozren Kubelka (2007). Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer. Digital Information and Heritage, pp. 313–320. [Link]
-
Croatian and Serbian lemmatiser [legacy]
This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.
A tool for automatic lemmatisation (returning the base or dictionary form of an inflected word). The tool looks up the hrLex/srLex lexicons and uses a predictive model for lemmatising OOVs (out of vocabulary words) which was trained on available corpora and lexicons.
AuthorNikola LjubešićAvailabilityThe lemmatiser is freely available in three forms:- For local use, the code and models of the lemmatiser can be downloaded from this GitHub repository.
- The lemmatiser web service can be used online, via our web interface that can be found here.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.
-
Croatian annotated corpus: hr500k
hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+.
The corpus is manually annotated on the following levels:- Token, sentence, and document segmentation
- Morphosyntax
- Lemmas
- Dependency syntax
- Semantic roles
- Named entities
The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2) and covers the first two fifths of the hr500k, i.e. the first 197 028 tokens of the corpus.
Semantic roles are annotated in the oldest part of the corpus, namely the first 163 documents / 83 630 tokens, which come from the original SETimes.HR corpus.
Named entity annotations cover the entire hr500k and are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).AuthorsNikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž ErjavecAvailabilityFor local use, a full-text version of hr500k can be downloaded from the CLARIN.SI repository. The corpus can also be accessed via the NoSketch Engine, as well as via KonText.PublicationsThe compilation of the corpus is described in the following paper:
Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, and Tomaž Erjavec (2018). hr500k – A Reference Training Corpus of Croatian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 154-161, Ljubljana, Slovenia. [Link] -
Croatian lexicon: hrLex
hrLex is an inflectional lexicon of Croatian.
The size of the lexicon is 164,206 lemmas, or 6,427,709 4,970,520 surface forms.
Each entry in the lexicon consists of a (word form, lemma, MSD, MSD features, UPOS, morphological features, absolute frequency, in-million frequency) 8-tuple. The frequencies were estimated on the Croatian web corpus hrWaC.The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V6 tagset for Serbo-Croatian macro-language, available here.
AuthorsNikola LjubešićAvailabilityFor local use, hrLex can be downloaded as a raw text file here.
hrLex can also be accessed and queried via our web services, which can also be used as an API (application programming interface).PublicationsThe lexicon and its construction process have been described in detail in the following paper:
Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib] -
Croatian web corpus: hrWaC
hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2.
The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
AuthorsNikola Ljubešić, Filip KlubičkaAvailabilityFor local use, a full-text version of hrWaC can be downloaded here.
hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.Publications -
Diacritic restoration tool
A tool for automatic diacritic restoration on text with potentially missing diacritics (e.g. it turns kuca into kuća if necessary). Reported accuracy of the tool: 99.5% on standard language and 99.2% on non-standard language.
AuthorsNikola Ljubešić, Tomaž Erjavec, Darja FišerAvailabilityThe tool is freely available in two forms:- The code and models of the tool can be downloaded from this GitHub repository.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The second option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.
Publications -
Croatian and Serbian tokeniser [legacy]
This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.
A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language.
AuthorsNikola Ljubešić, Tomaž ErjavecAvailabilityThe tokeniser is freely available in three forms:- For local use, the tokeniser can be downloaded from this GitHub repository.
- The tokeniser can be used online, via our web interface that can be found here.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.
-
Croatian and Serbian part of speech (POS) and morphosyntactic (MSD) tagger [legacy]
This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.
A tool for automatic annotation on the morphosyntactic level. It is capable of tagging both Croatian and Serbian as models for both languages are present in the tool.
The tagger is based on the CRF algorithm trained on a 500,000-token Croatian training corpus and the hrLex/srLex lexicons for each respective language.The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
Accuracies calculated on test sets for each language:- Croatian: 92.53%
- Serbian: 92.33%
AuthorNikola LjubešićAvailabilityThe tagger is freely available in three forms:- For local use, the code and models of the tagger can be downloaded from this GitHub repository.
- The tagger web service can be used online, via our web interface that can be found here.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.
PublicationsThe tagger and its construction process have been described in detail in the following paper:
Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib]