Category: Corpora

  • Serbian short-text sentiment analysis dataset: SentiComments.SR

    The SentiComments.SR dataset includes the following three corpora:

    • The main SentiComments.SR corpus, consisting of 3490 movie-related comments
    • The movie verification corpus, consisting of 464 movie-related comments
    • The book verification corpus, consisting of 173 book-related comments

    The main SentiComments.SR corpus was constructed out of the comments written by visitors on the kakavfilm.com movie review website in Serbian. The movie verification corpus comments were sourced from two other Serbian movie review websites – gledajme.rs and happynovisad.com. The book verification corpus comments were also sourced from the happynovisad.com website. Comments containing more than a predefined upper bound for token count (using basic whitespace tokenization), were discarded, as were the comments not written in Serbian.

    Six sentiment labels were used in dataset annotation: +1, -1, +M, -M, +NS, and -NS, with the addition of an ‘s’ label suffix denoting the presence of sarcasm. The annotation principles used to assign sentiment labels to items in SentiComments.SR are described in the papers listed in the Publications section. The main SentiComments.SR corpus was annotated by two annotators working together, and therefore contains a single, unified sentiment label for each comment. The verification corpora were used to evaluate the quality, efficiency, and cost-effectiveness of the annotation framework, which is why they contain separate sentiment labels for six annotators.

    Author
    Vuk Batanović
    Availability
    The corpus and its documentation can be found on the SentiComments.SR GitHub repository.
    Publications
    Vuk Batanović, Miloš Cvetanović, Boško Nikolić (2020). A versatile framework for resource-limited sentiment articulation, annotation and analysis of short texts. PLoS ONE 15(11): e0242050. [Link]
    Vuk Batanović (2020). A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources. PhD thesis, University of Belgrade – School of Electrical Engineering. [Link]  (contains the full annotation guidelines in Serbian)
  • Serbian semantic textual similarity news corpus: STS.news.sr

    The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators.

    The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally followed the one established in the SemEval STS shared tasks (2012-2017). Annotation instructions used in the creation of STS.news.sr corpus are available here. The STSAnno tool was used in the annotation process.

    The average annotator self-agreement score, expressed in terms of the Pearson correlation coefficient r, is 0.93. The average inter-rater correlation between an annotator and the averaged scores of all other annotators is 0.92, which is effectively the upper bound for STS model performance on this dataset.

    Author
    Vuk Batanović
    Availability
    The corpus and its documentation can be found on the STS.news.sr GitHub repository.
    Publications
    Vuk Batanović, Miloš Cvetanović, Boško Nikolić (2018). Fine-grained Semantic Textual Similarity for Serbian. Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pp. 1370-1378, Miyazaki, Japan. [Link][.bib]
  • Serbian paraphrase corpus: paraphrase.sr

    The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number).

    Author
    Vuk Batanović
    Availability
    The corpus and its documentation can be found on the paraphrase.sr GitHub repository.
    Publications
    • Vuk Batanović, Bojan Furlan, Boško Nikolić (2011). A software system for determining the semantic similarity of short texts in Serbian. Proceedings of the 19th Telecommunications forum (TELFOR 2011), pp. 1249-1252, Belgrade, Serbia. [Link]
    • Bojan Furlan, Vuk Batanović, Boško Nikolić (2013). Semantic similarity of short texts in languages with a deficient natural language processing support. Decision Support Systems, Vol. 55, No. 3, pp. 710-719. [Link]
  • ReLDI-NormTagNER-sr 2.1

    ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).

    Authors
    Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić
    Availability
    For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository.
    Publication
    The corpus construction is (partially) described in the following paper:
    Miličević, M. and N. Ljubešić (2016). Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0 4(2) link
  • ReLDI-NormTagNER-hr 2.1

    ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).

    Authors
    Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić
    Availability
    For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository.
    Publication
    The corpus construction is (partially) described in the following paper:
    Miličević, M. and N. Ljubešić (2016). Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0 4(2) link
  • Serbian movie review dataset: SerbMR

    The Serbian Movie Review Dataset collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis:

    • Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) – an imbalanced collection of 4725 movie reviews in Serbian.
    • SerbMR-2C – The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) – a two-class balanced dataset that contains 1682 movie reviews (841 positive and 841 negative).
    • SerbMR-3C – The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) – a three-class balanced dataset that contains 2523 movie reviews (841 positive, 841 neutral, and 841 negative).
    Author
    Vuk Batanović
    Availability
    All corpora with an extensive documentation can be downloaded from the SerbMR GitHub repository.
    Publications

    Vuk Batanović, Boško Nikolić, Milan Milosavljević (2016). Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia. [Link] [.bib]

  • Serbian annotated corpus: SETimes.SR

    SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus.
    It contains 163 documents divided into 3891 sentences, or 86 726 tokens.
    The corpus is manually annotated on the following levels:

    • Token, sentence, and document segmentation
    • Morphosyntax
    • Lemmas
    • Dependency syntax
    • Named entities

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here.
    Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2).
    Named entity annotations are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).
    Further information about the corpus can be found on its GitHub repository.

    Authors
    Vuk Batanović, Nikola Ljubešić, Tanja Samardžić
    Availability
    For local use, a full-text version of SETimes.SR can be downloaded from the CLARIN.SI repository. SETimes.SR is also available on the Serbian UD treebank repository. In addition, the corpus can be accessed via the NoSketch Engine, as well as via KonText.
    Publications
    The compilation of the corpus is described in the following paper:
    Vuk Batanović, Nikola Ljubešić, and Tanja Samardžić (2018). SETimes.SR – A Reference Training Corpus of Serbian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 11-17, Ljubljana, Slovenia. [Link]

    Additional information regarding the UD annotation of this corpus are available in the following paper:
    Tanja Samardžić, Mirjana Starović, Željko Agić, Nikola Ljubešić (2017). Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain. [Link] [.bib]

  • Croatian annotated corpus: hr500k

    hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+.
    The corpus is manually annotated on the following levels:

    • Token, sentence, and document segmentation
    • Morphosyntax
    • Lemmas
    • Dependency syntax
    • Semantic roles
    • Named entities

    The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
    Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2) and covers the first two fifths of the hr500k, i.e. the first 197 028 tokens of the corpus.
    Semantic roles are annotated in the oldest part of the corpus, namely the first 163 documents / 83 630 tokens, which come from the original SETimes.HR corpus.
    Named entity annotations cover the entire hr500k and are encoded in the IOB2 format, with five NE types considered – people (PER), person derivatives (DERIV-PER), locations (LOC), organizations (ORG), and miscellaneous entities (MISC).

    Authors
    Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec
    Availability
    For local use, a full-text version of hr500k can be downloaded from the CLARIN.SI repository. The corpus can also be accessed via the NoSketch Engine, as well as via KonText.
    Publications
    The compilation of the corpus is described in the following paper:
    Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, and Tomaž Erjavec (2018). hr500k – A Reference Training Corpus of Croatian. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018 (JT-DH 2018), pp. 154-161, Ljubljana, Slovenia. [Link]
  • Serbian web corpus: srWaC

    srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2.

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

    Authors
    Nikola Ljubešić, Filip Klubička
    Availability
    For local use, a full-text version of srWaC can be downloaded here.
    srWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
    Publications
    The compilation of the 1.0 version of the corpus is described in the following paper:
    Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]
  • Croatian web corpus: hrWaC

    hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2.

    The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

    Authors
    Nikola Ljubešić, Filip Klubička
    Availability
    For local use, a full-text version of hrWaC can be downloaded here.
    hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
    Publications
    The compilation of the 1.0 version of the corpus is described in the following paper:
    Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]