Serbian paraphrase corpus: paraphrase.sr

The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number).

Author
Vuk Batanović
Availability
The corpus and its documentation can be found on the paraphrase.sr GitHub repository.
Publications
  • Vuk Batanović, Bojan Furlan, Boško Nikolić (2011). A software system for determining the semantic similarity of short texts in Serbian. Proceedings of the 19th Telecommunications forum (TELFOR 2011), pp. 1249-1252, Belgrade, Serbia. [Link]
  • Bojan Furlan, Vuk Batanović, Boško Nikolić (2013). Semantic similarity of short texts in languages with a deficient natural language processing support. Decision Support Systems, Vol. 55, No. 3, pp. 710-719. [Link]