Category: Resource type

  • Croatian annotated corpus: hr500k

    hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+.
    The corpus is manually annotated on the following levels:

    • Token, sentence, and document segmentation
    • Morphosyntax
    • Lemmas
    • Dependency syntax
    • Semantic roles
    • Named entities

    The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
    Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2) and covers the first two fifths of the hr500k, i.e. the first 197 028 tokens of the corpus.
    Semantic roles are annotated in the oldest part of the corpus, namely the first 163 documents / 83 630 tokens, which come from the original SETimes.HR corpus.
    Named entity annotations cover the entire hr500k and are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).

    Authors
    Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec
    Availability
    For local use, a full-text version of hr500k can be downloaded from the CLARIN.SI repository. The corpus can also be accessed via the NoSketch Engine, as well as via KonText.
    Publications
    The compilation of the corpus is described in the following paper:
    Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, and Tomaž Erjavec (2018). hr500k – A Reference Training Corpus of Croatian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 154-161, Ljubljana, Slovenia. [Link]
  • Serbian lexicon: srLex

    srLex is an inflectional lexicon of Serbian.
    The size of the lexicon is 169,328 lemmas, or 6,905,941 surface forms.
    Each entry in the lexicon consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, absolute frequency, in-million frequency) 8-tuple. The frequencies were estimated on the Serbian web corpus srWaC.

    The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V6 tagset for Serbo-Croatian macro-language, available here.

    Authors
    Nikola Ljubešić
    Availability
    For local use, srLex can be downloaded as a raw text file here.
    srLex can also be accessed and queried via our web services, which can also be used as an API (application programming interface).
    Publications
    The lexicon and its construction process have been described in detail in the following paper:
    Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib]
  • Croatian lexicon: hrLex

    hrLex is an inflectional lexicon of Croatian.
    The size of the lexicon is 164,206 lemmas, or 6,427,709 4,970,520 surface forms.
    Each entry in the lexicon consists of a (word form, lemma, MSD, MSD features, UPOS, morphological features, absolute frequency, in-million frequency) 8-tuple. The frequencies were estimated on the Croatian web corpus hrWaC.

    The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V6 tagset for Serbo-Croatian macro-language, available here.

    Authors
    Nikola Ljubešić
    Availability
    For local use, hrLex can be downloaded as a raw text file here.
    hrLex can also be accessed and queried via our web services, which can also be used as an API (application programming interface).
    Publications
    The lexicon and its construction process have been described in detail in the following paper:
    Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib]
  • Serbian web corpus: srWaC

    srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2.

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

    Authors
    Nikola Ljubešić, Filip Klubička
    Availability
    For local use, a full-text version of srWaC can be downloaded here.
    srWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
    Publications
    The compilation of the 1.0 version of the corpus is described in the following paper:
    Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]
  • Croatian web corpus: hrWaC

    hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2.

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

    Authors
    Nikola Ljubešić, Filip Klubička
    Availability
    For local use, a full-text version of hrWaC can be downloaded here.
    hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
    Publications
    The compilation of the 1.0 version of the corpus is described in the following paper:
    Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]
  • Diacritic restoration tool

    A tool for automatic diacritic restoration on text with potentially missing diacritics (e.g. it turns kuca into kuća if necessary). Reported accuracy of the tool: 99.5% on standard language and 99.2% on non-standard language.

    Authors
    Nikola Ljubešić, Tomaž Erjavec, Darja Fišer
    Availability
    The tool is freely available in two forms:
    1. The code and models of the tool can be downloaded from this GitHub repository.
    2. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)

    The second option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.

    Publications
    Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer (2016). Corpus-based diacritic restoration for south slavic languages. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib]
  • Croatian and Serbian tokeniser [legacy]

    This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.

    A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language.

    Authors
    Nikola Ljubešić, Tomaž Erjavec
    Availability
    The tokeniser is freely available in three forms:
    1. For local use, the tokeniser can be downloaded from this GitHub repository.
    2. The tokeniser can be used online, via our web interface that can be found here.
    3. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)

    The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.

  • Croatian and Serbian part of speech (POS) and morphosyntactic (MSD) tagger [legacy]

    This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.

    A tool for automatic annotation on the morphosyntactic level. It is capable of tagging both Croatian and Serbian as models for both languages are present in the tool.
    The tagger is based on the CRF algorithm trained on a 500,000-token Croatian training corpus and the hrLex/srLex lexicons for each respective language.

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

    Accuracies calculated on test sets for each language:
    • Croatian: 92.53%
    • Serbian: 92.33%
    Author
    Nikola Ljubešić
    Availability
    The tagger is freely available in three forms:
    1. For local use, the code and models of the tagger can be downloaded from this GitHub repository.
    2. The tagger web service can be used online, via our web interface that can be found here.
    3. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)

    The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.

    Publications
    The tagger and its construction process have been described in detail in the following paper:
    Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec (2016). New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia. [Link] [.bib]