Serbian annotated corpus: SETimes.SR

SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus.
It contains 163 documents divided into 3891 sentences, or 86 726 tokens.
The corpus is manually annotated on the following levels:

  • Token, sentence, and document segmentation
  • Morphosyntax
  • Lemmas
  • Dependency syntax
  • Named entities

The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here.
Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2).
Named entity annotations are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).
Further information about the corpus can be found on its GitHub repository.

Authors
Vuk Batanović, Nikola Ljubešić, Tanja Samardžić
Availability
For local use, a full-text version of SETimes.SR can be downloaded from the CLARIN.SI repository. SETimes.SR is also available on the Serbian UD treebank repository. In addition, the corpus can be accessed via the NoSketch Engine, as well as via KonText.
Publications
The compilation of the corpus is described in the following paper:
Vuk Batanović, Nikola Ljubešić, and Tanja Samardžić (2018). SETimes.SR – A Reference Training Corpus of Serbian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 11-17, Ljubljana, Slovenia. [Link]

Additional information regarding the UD annotation of this corpus are available in the following paper:
Tanja Samardžić, Mirjana Starović, Željko Agić, Nikola Ljubešić (2017). Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain. [Link] [.bib]