SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus.
It contains 163 documents divided into 3891 sentences, or 86 726 tokens.
The corpus is manually annotated on the following levels:
- Token, sentence, and document segmentation
- Morphosyntax
- Lemmas
- Dependency syntax
- Named entities
The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here.
Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2).
Named entity annotations are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).
Further information about the corpus can be found on its GitHub repository.
Vuk Batanović, Nikola Ljubešić, and Tanja Samardžić (2018). SETimes.SR – A Reference Training Corpus of Serbian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 11-17, Ljubljana, Slovenia. [Link]
Additional information regarding the UD annotation of this corpus are available in the following paper:
Tanja Samardžić, Mirjana Starović, Željko Agić, Nikola Ljubešić (2017). Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain. [Link] [.bib]