On 28-30 September 2023 the ReLDI Centre and several members of the ReLDI network will be in Rijeka for the CLARC 2023 conference, organised by the Center for Language Research of the Faculty of Humanities and Social Sciences at the University of Rijeka. The conference topic is “Language and Language Data”, to be discussed in ReLDI light through plenary talks by Maja Miličević Petrović and Nikola Ljubešić, a panel on the recently completed UPSKILLS project, and a roundtable on “Linguistics and Large Language Models”. In thanking the organisers for having us, we look forward to seeing many familiar faces and meeting new ones!
Author: admin
-
The story of ReLDI in a CLARIN book
The ventures of ReLDI are now described in a book chapter, available in open access!
Within the newly published book CLARIN. The Infrastructure for Language Resources, which marks CLARIN’s 10th anniversary as a European Research Infrastructure Consortium, Nikola Ljubešić, Tomaž Erjavec, Maja Miličević Petrović and Tanja Samardžić tell the story of how ReLDI came about, how it help(ed|s) the development of technologies for South Slavic languages, and how its initial contacts with colleagues from Slovenia grew into a long-standing collaboration with CLARIN.SI and the CLASSLA knowledge centre. Together we are stronger!
-
ReLDI @ JTDH 2022
On 15 and 16 September 2022 the ReLDI Centre and the ReLDI network had several representatives at the JTDH conference in Ljubljana. We are especially pround of two young researchers from Belgrade – Natalija Tomić and Ružica Farmakovski – who presented work born from ReLDI events and supervised by Tanja Samardžić and Nikola Ljubešić. Check out the proceedings for more details!
An important moment in the conference was also a ReLDI-CLASSLA get-together. The rather spontaneous meeting started as a get-to-meet-everyone-in-a-circle (first picture), and developed into an NLP 101 talk by Boshko Koloski (second picture) and a rather wide and interesting discussion. We hope that everyone’s impressions are as positive as ours!
-
ReLDI becomes an associate partner on a University of Rijeka project
ReLDI centre for linguistic data will participate, as an associate partner, in a project dedicated to language technologies and digital text processing (Cr. “Jezične tehnologije i digitalna obrada teksta”). This interesting project has been approved within the call “UNIRI CLASS – Open Personalised Education”, and its objective is to introduce a new minor through a collaboration of multiple UNIRI component units. The project coordinator is Benedikt Perak.
-
Collaboration with Classla and the “ReLDI effect”
The ReLDI centre is continuing the collaboration with CLARIN.SI’s Knowledge centre for South Slavic languages – CLASSLA. In fact, the recent success story published by CLASSLA is dedicated to the importance of collaborations and synergies, and the term “ReLDI effect” is used to describe synergistic effects between different projects. You can read the full story at https://www.clarin.si/info/k-centre/success-stories/.
In addition, the ReLDI centre has recently been involved in the development of the first open speech-to-text system for Croatian, coordinated by CLASSLA. The system is available at https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr, where it is possible to try out some examples, but also upload or record one’s own speech. The work continues, and ParlaSpeech-HR is planned to be published in early 2022. All this is the results of joint efforts by Nikola Ljubešić, Ivo-Pavao Jazbec, Vuk Batanović, Lenka Bajčetić, Danijel Korzinek and Peter Rupnik, but would not have been possible without a wider collaboration around the ParlaMint project, with Darja Fišer, Tomaž Erjavec, Maciej Ogrodniczuk and Petya Osenova.
-
Materials from the Workshop on regional markedness in text available
The materials from Regional variation in gender marking: a hands-on tutorial on extracting data from corpora are now available for download from https://github.com/clarinsi/workshop_reg_mark. These materials provide an introduction to the process of using corpora to study a linguistic (and not only linguistic) problem, with information on:
- how to find (comparable) South Slavic corpora in the CLARIN.SI repository
- how to explore corpora through the noSketchEngine and KonText concordancers
- how to study gender marking looking at frequencies of feminine and masculine nouns describing occupations, and at the distribution of feminine and masculine forms of different verbs
- how to draw conclusions about gender bias in society based on corpus results
The materials were prepared by Mirjana Starović and Tanja Samardžić as part of the online workshop held on 6 and 7 November 2021, organised by the University of Zurich – URPP “Language and Space”, the CLARIN knowledge centre for South Slavic languages – CLASSLA and the ReLDI centre. The programme also included a keynote talk by Yves Scherrer from the University of Helsinki, Darja Fišer’s presentation of opportunities for student presentations at the JTDH Language Technologies and Digital Humanities Conference, and an Interactive workshop on regional variation in text led by Sara Košutar, Larissa Schmidt and Leyla Feiner.
The workshop saw the participation of around 30 students and colleagues divided between GatherTown and Zoom, with lively and fun interactive sessions, and some surprising findings. A follow-up mentoring session for students took place on 16 December 2021.
For CLASSLA accounts of the workshop, see here:
-
Workshop on regional markedness in text, 6-7 November 2021
The University of Zurich – URPP “Language and Space”, the CLARIN knowledge centre for South Slavic languages – CLASSLA and the ReLDI centre are organising an online workshop dedicated to regional markedness in text. The workshop will take place on Zoom, 6-7 November 2021. The details, including the programme and the registration link, can be found here. Students are particularly encouraged to apply!
-
NLP pipeline for Croatian and Serbian
A Python module comprising of a tokeniser, a part-of-speech/MSD tagger, a lemmatiser, a dependency parser, and a named entity recognizer for most South Slavic languages. For Croatian and Serbian there are models for processing standard and Internet non-standard texts. The estimated accuracy of morphosyntactic tagging for this tool is ~94%, while for lemmatisation the accuracy is ~99%. Dependency parsing has an labeled attachment score of ~0.9, while named entity recognition achieves a micro-F1 of ~0.9.
AuthorNikola LjubešićPublicationsThe experiments yielding this pipeline have been described in the following paper: Nikola Ljubešić and Kaja Dobrovoljc (2019). What Does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing. Florence, Italy. pp. 29-34. [Link] [.bib] -
Serbian short-text sentiment analysis dataset: SentiComments.SR
The SentiComments.SR dataset includes the following three corpora:
- The main SentiComments.SR corpus, consisting of 3490 movie-related comments
- The movie verification corpus, consisting of 464 movie-related comments
- The book verification corpus, consisting of 173 book-related comments
The main SentiComments.SR corpus was constructed out of the comments written by visitors on the kakavfilm.com movie review website in Serbian. The movie verification corpus comments were sourced from two other Serbian movie review websites – gledajme.rs and happynovisad.com. The book verification corpus comments were also sourced from the happynovisad.com website. Comments containing more than a predefined upper bound for token count (using basic whitespace tokenization), were discarded, as were the comments not written in Serbian.
Six sentiment labels were used in dataset annotation: +1, -1, +M, -M, +NS, and -NS, with the addition of an ‘s’ label suffix denoting the presence of sarcasm. The annotation principles used to assign sentiment labels to items in SentiComments.SR are described in the papers listed in the Publications section. The main SentiComments.SR corpus was annotated by two annotators working together, and therefore contains a single, unified sentiment label for each comment. The verification corpora were used to evaluate the quality, efficiency, and cost-effectiveness of the annotation framework, which is why they contain separate sentiment labels for six annotators.
AuthorVuk BatanovićAvailabilityThe corpus and its documentation can be found on the SentiComments.SR GitHub repository.PublicationsVuk Batanović, Miloš Cvetanović, Boško Nikolić (2020). A versatile framework for resource-limited sentiment articulation, annotation and analysis of short texts. PLoS ONE 15(11): e0242050. [Link]Vuk Batanović (2020). A methodology for solving semantic tasks in the processing of short texts written in natural languages with limited resources. PhD thesis, University of Belgrade – School of Electrical Engineering. [Link] (contains the full annotation guidelines in Serbian) -
ReLDI @ INTERSLAVIC
On 26 February 2021, ReLDI will participate in the conference Internationalisms in Slavic as a window into the architecture of grammar – InterSlavic 2020/2021, organised by the University of Graz. The abstract and the presentation can be found here, and in case you wish to join us and the rest of the conference, the registration instructions are available here.