Freely available data sets for training and testing language models created as outcomes of the ReLDI Center projects.
Text parsing (upstream NLP)

NEWS
Serbian SETimes.SR 2.0
Croatian hr500k 2.0
Data sets created as part of the initial ReLDI project. In addition to manually annotated lemmas and morphosyntactic labels following the MULTEXT-East specifications, these corpora also contain syntactic annotation of universal dependencies (UD), as well as basic categories of named entities. The newspaper texts were taken from the SETimes portal, the creation of the corpus is described in the following papers for the Serbian and Croatian corpus. This paper describes the process of syntactic annotation.

LEGAL TEXTS
Serbian variants
(Ekavian and Ijekavian)
Data set created in cooperation with the School of Electrical Engineering Innovation Center at the University of Belgrade, as part of the COMtext.SR project. This project is funded by a local community of companies and foundations. The corpus contains representative legal/administrative texts collected with the help of the Karanović & Partners law office. These texts were manually annotated with lemmas, morphosyntactic labels following the MULTEXT-East specifications. This data set features manual annotation of named entities following a detailed, custom scheme.

TWITTER / X
Serbian ReLDI-NormTagNER-sr 3.0
Croatian ReLDI-NormTagNER-hr 3.0
Data sets created as part of the initial ReLDI project, in cooperation with the JANES project. The corpora contain manually annotated lemmas, morphosyntactic labels following the MULTEXT-East specifications, basic categories of named entities, as well as normalisation of non-standard writing. Text samples were collected using the TweetCaT tool. The construction of the corpus is described in this paper.
Natural Language Understanding

SEMANTIC SIMILARITY
Serbian, news articles STS.news.sr
Data set created as part of Vuk Batanović’s doctoral dissertation with the support of the initial ReLDI project. It contains pairs of sentences that were manually assigned semantic similarity scores from 0 to 5, according to the SemEval annotation scheme. The construction of this data set is described in this paper.
Serbian, news articles CLSS.news.sr
Data set created as part of the AVANTES project. It contains pairs of texts of different lengths (phrase/sentence, sentence/paragraph) that were manually assigned semantic similarity scores from 0 to 4, according to the SemEval annotation scheme. The construction of the data set is described in this paper.

SENTIMENT ANALYSIS
Serbian, film reviews SentiComments.SR
Data set created as part of Vuk Batanović‘s doctoral dissertation, with the support of the initial ReLDI project. It contains short film reviews manually assigned a sentiment score encoding polarity (positive/negative), subjectivity (objective/subjective), mixed sentiment/ambiguity, as well as the presence of sarcasm. The construction of this data set is described in this paper.
Serbian, film reviews SerbMR
Data set created as part of Vuk Batanović’s doctoral dissertation. Contains file reviews that have been automatically assigned a polarity rating (positive/neutral/negative) based on the reviewers’ 1-10 ratings. The construction of the data set is described in this paper.

COMMON SENSE REASONING
Serbian, translation of COPA
Translation of the English COPA data set, created in cooperation with CLASSLA as one of the priority data sets for the development of artificial intelligence in Serbian. It contains triplets of sentences where each premise is assigned one true and one false conclusion. The ReLDI centre also helped the creation of the Macedonian version of this set and some of the dialect versions included in DIALECT-COPA.
Speech data

JUŽNE VESTI
Serbian, audio and transcript
Data set for fine-tuning speech-to-text models created by CLASSLA with the help of the ReLDI centre. It contains speech samples from the “15 Minutes” broadcast of the Juzne Vesti portal. The original transcripts are automatically aligned with the audio signal.

MAK NA KONAC
Serbian and Croatian, audio
Data set intended for objective comparison of the modern speech-to-text models, developed in cooperation with CLASSLA It currently contains audio samples of radio and video broadcasts from the portals Peščanik, Južne vesti and Radio Student Zagreb for a total duration of 15 hours. More information in this publication.

Data look-up and processing via CLARIN.SI
A text parsing Interface using large language models trained on our data for Serbian and Croatian. The current annotator is a newer and improved version of the former ReLDIanno.