Croatian annotated corpus: hr500k

hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+.
The corpus is manually annotated on the following levels:

  • Token, sentence, and document segmentation
  • Morphosyntax
  • Lemmas
  • Dependency syntax
  • Semantic roles
  • Named entities

The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2) and covers the first two fifths of the hr500k, i.e. the first 197 028 tokens of the corpus.
Semantic roles are annotated in the oldest part of the corpus, namely the first 163 documents / 83 630 tokens, which come from the original SETimes.HR corpus.
Named entity annotations cover the entire hr500k and are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).

Authors
Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec
Availability
For local use, a full-text version of hr500k can be downloaded from the CLARIN.SI repository. The corpus can also be accessed via the NoSketch Engine, as well as via KonText.
Publications
The compilation of the corpus is described in the following paper:
Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, and Tomaž Erjavec (2018). hr500k – A Reference Training Corpus of Croatian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 154-161, Ljubljana, Slovenia. [Link]