Tanja Samardžić, PD Dr.

I am a senior researcher at the IDSIA NLP Group, senior deputy lecturer at the University of Geneva and Privatdozentin at the University of Zurich. I am also one of the co-founders of ReLDI Centre Belgrade. I hold a PhD in Computational linguistics from the University of Geneva, where I studied in the group Computational Learning and Computational Linguistics (CLCL). After the PhD, I was the Head of the Text Group and a lab director (alternating) of the Language and Space Lab at the University of Zurich (2013-2024), a Visiting Scholar at the University of Cambridge (2024) and a Visiting Researcher at the IT University Copenhagen (2022).

With a background in linguistic theory and machine learning, I am committed to advancing the use of computational methods in the study of language. Currently, I serve as a ACL Rolling Review Senior Area Chair, EACL 2026 Faculty SRW Chair, a UniDive COST Action Managing Committee Member and an External Governing Board Member of the SMASH Postdoctoral Program.

PUBLICATIONS

2025                                                                                                                                                                  

Pelloni, O. R. van der Goot, P. Ranacher, I. Vulic and T. Samardžić (2025). “Subword symmetry in natural languages” . R. Soc. Open Sci.12250295.

Kanjirangat, V., T. Samardžić , Lj. Dolamic, and F. Rinaldi (2025). “Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks“. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China. Senior Area Chair Highlight

Van Der Goot, R., E. Ploeger, V. Blaschke, and T. Samardžić (2025). “DistaLs: a Comprehensive Collection of Language Distance Measures“. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Suzhou, China.

Hopton, Z. W., Y. Scherrer, and T. Samardžić (2025). “Functional Lexicon in Subword Tokenization.” In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL2025). Association for Computational Linguistics, Albuquerque, New Mexico. USA. Area Chair Nomination for the best paper

Goldman, O., L. Weissweiler, K. Acar, D. Alves, A. Baczkowska, G. Eryigit, L. Krippnerová, A. Pagano, T. Samardžić , L. Talamo, A. Wróblewska, D. Zeman, J. Nivre, and R. Tsarfay (2025). “Findings of the UniDive 2025 shared task on multilingual Morpho-Syntactic Parsing“. In Proceedings of The UniDive 2025 Shared Task on Multilingual Morpho-Syntactic Parsing. Association for Computational Linguistics, Ljubljana, Slovenia.

Samardžić, T. (2005). “Stable Mood-Tense-Aspect Patterns Observed in the CLARIN.SI Repository. In Donzé, A.E., T. Ihsane and E. Haeberli (Eds.) Generative Grammar in Geneva 12, Studies in honour of Genoveva Puskás. Department of Linguistics of the University of Geneva.

Samardžić, T. (2005). “Kako izmeriti performanse računarskih modela za konverziju govora u tekst?” In Gudurić, S. (Ed.) Lingvistički mozaik: primenjena lingvistika u čast Vesni Polovini, Primenjena lingvistika u čast 6. Belgrade, Novi Sad, Serbia.

2024                                                                                                                                                                  

Samardžić, T., X. Gutierrez-Vasques, C. Bentz, S. Moran, and O. Pelloni (2024). “A Measure for Transparent Comparison of Linguistic Diversity in Multilingual NLP Data Sets“. In Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics.  Mexico City, Mexico.

Attieh, J.,  Z. Hopton, Y. Scherrer, and T. Samardžić (2024). “System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task“. Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024). Association for Computational Linguistics,  Mexico City, Mexico. 
Overall Winner!

Kanjirangat, V., T. Samardžić, Lj. Dolamic, and F. Rinaldi (2024). ” NLP_DI at NADI 2024 shared task: Multi-label Arabic Dialect Classifications with an Unsupervised Cross-Encoder“.  In Proceedings of The Second Arabic Natural Language Processing Conference.  Association for Computational Linguistics, Bangkok, Thailand (online).

Samardžić, T., P. Rupnik, M. Starović, N. Ljubešić (2024). “Mak na konac: A Multi-Reference Speech-To-Text Benchmark for Croatian and Serbian“. In Conference on Language Technologies and Digital Humanities (JTDH), Ljubljana, Slovenia.

Bajčetić, L., T. Samardžić, and  V. Batanović (2024). “Lemmatizing Serbian and Croatian via String Edit Prediction“. In Conference on Language Technologies and Digital Humanities (JTDH), Ljubljana, Slovenia.

2023                                                                                                                                                                   

Gutierrez-Vasques,, X., C. Bentz, and T. Samardžić (2023). “Languages through the Looking Glass of BPE Compression“. Computational Linguistics 49(4).

Plüss, M., J. Deriu, Y. Schraner, C. Paonessa, J. Hartmann, L. Schmidt, C. Scheller, M. Hürlimann, T. Samardžić, M. Vogel, and M. Cieliebak (2023). “STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions“. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL2023). Association for Computational Linguistics, Toronto, Canada.

Kanjirangat, V., T. Samardžić, F. Rinaldi and Lj. Dolamic (2022). “Optimizing the Size of Subword Vocabularies in Dialect Classification“.  In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). Association for Computational Linguistics, Dubrovnik, Croatia.

2022                                                                                                                                                                   

Samardžić, T., X. Gutierrez-Vasques, R. van der Goot, M. Müller-Eberstein, O. Pelloni and B. Plank (2022). “On language spaces, scales and cross-lingual transfer of UD parsers“. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, Abu Dhabi, UAE.

Pelloni, O., A. Shaitarova and T. Samardžić (2022). “Subword evenness (SuE) as a predictor of cross-lingual transfer to low-resource languages“. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Abu Dhabi, UAE.

Kanjirangat, V., T. Samardžić, F. Rinaldi and Lj. Dolamic (2022). “Early guessing for dialect identification“. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, UAE.

Bentz, C., X. Gutierrez-Vasques, O. Sozinova and T. Samardžić (2022). “Complexity trade-offs and equi-complexity in natural languages: A meta-analysis“. Linguistic Vanguard.

Ljubešić, N., M. Miličević Petrović, T. Erjavec and T. Samardžić (2022). “Together we are stronger: Bootstrapping language technology infrastructure for South Slavic languages with CLARIN.SI“. Monographic Publication about CLARIN ERIC (CLARIN Book 2022). In D. Fišer and A. Witt (eds.) CLARIN: The Infrastructure for Language Resources. Berlin, Boston: De Gruyter, pp. 429-456.

Moran S., C. Bentz, X. Gutierrez-Vasques, O. Sozinova and T. Samardžić (2022). “TeDDi Sample: Text data diversity sample for language comparison and multilingual NLP“. In Proceedings of The International Conference on Language Resources and Evaluation (LREC), Marseille, France, 1150–1158.

2021                                                                                                                                                                    

Samardžić, T. and N. Ljubešić (2021). “Data collection and representation for similar languages, varieties and dialects“. In M. Zampieri and P. Nakov (eds.) Similar Languages, Varieties, and Dialects: A Computational Perspective, Studies in Natural Language Processing. Cambridge University Press Pre-print

Ruzsics, T.,  O. Sozinova, X. Gutierrez-Vasques and  T. Samardžić (2021). “Interpretability for morphological inflection: from character-level predictions to subword-level rules“. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 3189–3201.

Gutierrez-Vasques, X., C. Bentz, O. Sozinova, and  T. Samardžić (2021). “From characters to words: the turning point of BPE merges“. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 3454–3468.

2020                                                                                                                                                                    

Nigmatulina, I., T. Kew, T. Samardžić (2020). “ASR for non-standardised languages with dialectal variation: the case of Swiss German“. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial2020), COLING 2020 Barcelona, Spain.

Kew, T., I. Nigmatulina, L. Nagele, T. Samardžić (2020). “UZH TILT: A Kaldi recipe for Swiss German speech to standard German text“. In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS). Zurich, Switzerland.

Schmidt, L., L. Linder, S. Djambazovska, A.Lazaridis, T. Samardžić, C. Musat (2020). “A Swiss German Dictionary: Variation in Speech and Writing“. Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). Marseille, France.

2019                                                                                                                                                                    

Ruzsics, T., M. Lusetti,  A. Göhring, T. Samardžić, and E. Stark (2019). “Neural text normalization with adapted decoding and PoS features“. Natural Language Engineering 25(5), 585-605.  Pre-print

Scherrer, Y., T. Samardžić, E. Glaser (2019). “Digitising Swiss German — How to process and study a polycentric spoken language“. Language Resources and Evaluation 53, 735-769.

Ljubešić, N., M. Miličević Petrović, and T. Samardžić (2019). “Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue“. Journal of Linguistic Geography 6(2), 100-124.   Pre-print

Ljubešić, N., M. Miličević Petrović, and T. Samardžić (2019). “Language accommodation on Twitter: The case of Serbian“. Slavistična revija 67(1), 87-106.  (In Croatian)

2018                                                                                                                                                                    

Samardžić, T.,   and P. Merlo (2018). “Probability of external causation: an empirical account of cross-linguistic variation in lexical causatives“. Linguistics 56(5), 895-939. Pre-print (PDF, 2873 KB).

Lusetti, M., T. Ruzsics, A. Göhring, T. Samardžić, and E. Stark (2018).  “Encoder-decoder methods for text normalization“. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), COLING 2018, Santa Fe, NM, USA, 18-28. bib

Samardžić, T.,  M. Cieliebak, and J. M. Deriu (2018). “Future Actions for Swiss German — Workshop Results at SwissText 2018“. In Proceedings of the 3rd Swiss Text Analytics Conference (SwissText 2018), Winterthur, Switzerland, 95-99.

Zampieri, M. S. Malmasi, P. Nakov, A. Ali, S. Shon, J. Glass, Y. Scherrer, T. Samardžić, N. Ljubešić, J. Tiedemann, C. van der Lee, S. Grondelaers, N. Oostdijk, A. van den Bosch, R. Kumar, B. Lahiri, and M. Jain (2018). “Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign“. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), COLING 2018. Santa Fe, NM, USA, 1-17. bib

Batanović, V., N. Ljubešić, and T. Samardžić (2018). “SETimes.SR – A reference training corpus of Serbian“. In Proceedings of the Conference on Language Technologies & Digital Humanities 2018, Ljubljana, Slovenia, 11-18.

Vuković, T. and T. Samardžić (2018). “Areal distribution of the post-positive article in Timok dialect of Torlak”. In Timok: Field Research in Folklore and Language 2015-2017, Knjaževac, Serbia: Public Library Knjaževac, 181-201 (In Serbian).

2017                                                                                                                                                                    

Ruzsics, T. and T. Samardžić (2017). “Neural sequence-to-sequence learning of internal word structure“. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada, 184-194.  bib

Derungs, C. and T. Samardžić (2017). “Are prominent mountains frequently mentioned in text? Exploring the spatial expressiveness of text frequency“. International Journal of Geographical Information Science 32(5), 856-873. free eprint

Bentz, C., D. Alikaniotis, T. Samardžić, and P. Buttery (2017).”Variation in word frequency distributions: Definitions, measures and implications for a corpus-based language typology’“. Journal of Quantitative Linguistics 24(2-3), 128-162.

Samardžić, T., M. Starović, Ž. Agić, and N. Ljubešić (2017). “Universal dependencies for Serbian in comparison with Croatian and other Slavic languages“. In  Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing. Valencia, Spain, 39-44. bib

2016                                                                                                                                                                      

Ljubešić, N., T. Samardžić, and C. Derungs (2016). “TweetGeo  — A tool for collecting, processing and analysing geo-encoded linguistic data“. In Proceedings of the 26th International Conference on Computational Linguistics (COLING2016). Osaka, Japan.

Bentz, C., T. Ruzsics,  A. Koplenig, and T. Samardžić (2016). “A comparison between morphological complexity measures: Typological data vs. language corpora“. In Proceedings of the Workshop Computationat.l Linguistics for Linguistic Complexity (CL4LC). Osaka, Japan.

Samardžić, T., Y. Scherrer, and E. Glaser (2016) “ArchiMob – A corpus of spoken Swiss German”. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

Samardžić, T. and M. Miličević (2016) “A framework for automatic acquisition of Croatian and Serbian verb aspect from corpora”. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

Ljubešić, N., T. Erjavec, D. Fišer, T. Samardžić, M. Miličević, F. Klubička, and F. Petkovski (2016). “Easily accessible language technologies for Slovene, Croatian and Serbian“. In Proceedings of the Conference on Language Technologies & Digital Humanities. Ljubljana, Slovenia.

OLDER publications on ORCID

TEACHING

2022 – presentLinguistics Department, Computer Science Department, University of Geneva (replacing Prof. Paola Merlo):
Natural language processing (MSc, MA) 
– Empirical methods in language processing (MSc, MA) 

2014 – 2023Institute of Computational Linguistics, University of Zurich:
– Techniques of semantic processing (MA)
2022Doctoral Programme in Applied Linguistics, ZHAW School of Applied Linguistics and the USI Faculty of Communication, Culture and Society- Designing empirical studies in linguistics 
– Career outlook with a PhD in linguistics
2020 – 2021Institute of Computational Linguistics, University of Zurich:
 Processing non-standard language (BA, MA)
2018 – 2019Linguistics Department, Computer Science Department, University of Geneva (replacing Paola Merlo):
– Natural language processing (MA)
– Empirical methods in language processing (Neural sequence-to-sequence methods, MSc, MA) 
2019Institute of Computational Linguistics, University of Zurich:
– Programming for linguists (Python and R) (MA)
2019German Department, University of Zurich:
– Automatic text processing for the study of Swiss German (BA, MA)
2017 – 2018Institute of Computational Linguistics, University of Zurich:
– Linked and multilingual resources (MA)
2014 – 2017Institute of Computational Linguistics, University of Zurich:
– Cross-linguistic transfer of lexical semantic representations  (MA)
2014 – 2015Institute of Slavic Languages, University of Bern:
– Automatic analysis of the languages of former Yugoslavia (BA)
2012 – 2013Linguistics Department, LATL, University of Geneva:
– Empirical methods and script languages (Python) (MA) 
– Artificial intelligence (BA)
2004 – 2012Department of General Linguistics, University of Belgrade:
– Introduction to general linguistics (BA)
– Applied linguistics (Awk) (BA)
– Introduction to mark-up languages (BA)
– Discourse analysis (BA)
– Pragmatics (BA)
– Methodology of linguistic research (BA)
2000 – 2004Department of Serbian Language, University of Belgrade:
– Contemporary Serbian III — Syntax (BA)
– Computational and mathematical linguistics (BA) 

GRANTS AND SCHOLARSHIPS

2020 – 2023Movetia grant 2020-01MT-1-KA203074246a5  UPgrading the SKIlls of Linguistics and Language Students — UPSKILLS(PI)
2018 – 2022SNSF grant 176305 Non-randomness in morphological diversity: A computational approach based on multilingual corpora(PI)
2018 – 2019Movetia grant 0012 Revisiting research training in linguistics: theory, logic, method (PI)
2016 – 2017Hasler foundation grant 16038 Basic natural language processing for Swiss German texts (PI)
2015 – 2017SNSF grant 160501 Regional linguistic data initiative (PI)
2008Scholarship of the Department of General Linguistics, University of Geneva
2006 – 2008Scholarship of the Swiss Federal Commission for Foreign Students
2002Sasakawa scholarship for young leaders, realized at the universities of Birmingham, Duisburg, and Belgrade
2000Serbian Ministry of Science and Technology research scholarship, realized at the Institute for Serbian Language
1995 – 1999Serbian Ministry of Education students’ scholarship

INVITED TALKS

Jun 2025


Nov 2024

University of Exeter, NLP and vision online seminars, Understanding text tokenisation across diverse languages

Ontario Tech University, Lee Language Lab (online), Measuring Linguistic diversity with LangDive
Jun 2023 University of Lausanne, QUALICO 2023, Subword tokenization as a method for discovering and comparing linguistic structures 
Feb 2023 University of Bologna, Languages as geometric shapes 
Oct 2022 VarDial Workshop (Gyeongju, Republic of Korea), Data-centric vs. model-centric solutions for dialect identification 
Apr 2022 University of Helsinki, Text-based measures of language similarity
Feb 2022 IT University Copenhagen, Language families and similarity
Sep 2021SIGTYP lecture series, Language sampling 
Jun 2021Mexican NLP Summer School 2021, Language (de)standardisation and NLP
Nov 2020University of Milano-Bicocca, Searching for subword units in language processing and linguistic theory
Feb 2020University of Geneva, Interpretable word splits in language processing
May 2017University of Zurich, The impact of world knowledge on the use and the morphology of verbs
Mar 2017University of Munich, Verb aspect as linguistic encoding of time: a computational cross-linguistic approach
Feb 2017University of Geneva, The impact of geography on toponym frequency
Nov 2015University of Zagreb, Steps in building a corpus of Swiss German
Sep 2015BSNLP workshop (Hissar, Bulgaria), Aspect-based learning of event duration using parallel corpora
Jun 2015ACQDIV Project kickoff workshop (Kappel Abbey), The Bayesian learning framework
Nov 2013University of Stuttgart, Likelihood of external causation and the cross-linguistic variation in lexical causatives

EDUCATION

2008 – 2013PhD in Computational linguistics, University of Geneva
Thesis: Dynamics, causation, duration in the predicate-argument structure of verbs: A computational approach based on parallel corpora. Supervised by Prof. Paola Merlo.
2006 – 2008Postgraduate Studies in Computational linguistics, DEA, University of Geneva
Thesis: Light verbs and the lexical category bias of their complements. Supervised by Prof. Paola Merlo.
1999 – 2004Graduate Studies in Linguistics, MA Degree, University of Belgrade
Thesis:Reflexivization of transitive three valence verbs in novištokavski standard language diasystem. Supervised by Prof. Ljubomir Popović. (in Serbian)
1994 – 1999Diploma in Serbian Language, Literature and Linguistics, Faculty of Philology, University of Belgrade

TECHNICAL SKILLS

Programming: Python, Perl, Awk
Shell: Unix/Linux
Data analysis: R, Python
Mark-up: XML, XHTML, LaTeX

LANGUAGES

Serbian (native), English, French (fluency), German, Italian (medium), Slovenian (passive)