Stemmers for Serbian and Croatian: SCStemmers

This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian:

  • The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka
  • A refinement of the greedy subsumption-based stemmer, by Nikola Milošević
  • A “Simple stemmer for Croatian v0.1”, by Nikola Ljubešić and Ivan Pandžić

All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded.

Author
Vuk Batanović
Availability
The package and a more extensive documentation can be downloaded from the SCStemmers GitHub repository.
Publications

The SCStemmers package was introduced in:

Vuk Batanović, Boško Nikolić, Milan Milosavljević (2016). Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia. [Link] [.bib]

The original papers describing each implemented stemming algorithm are:

  • For the greedy and the optimal subsumption-based stemmer for Serbian: Vlado Kešelj, Danko Šipka (2008). A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. Infotheca 9(1-2), pp. 23a-33a. [Link]
  • For the refinement of the greedy subsumption-based stemmer: Nikola Milošević (2012). Stemmer for Serbian language. arXiv preprint arXiv:1209.4471. [Link]
  • For the “Simple stemmer for Croatian v0.1”: Nikola Ljubešić, Damir Boras, Ozren Kubelka (2007). Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer. Digital Information and Heritage, pp. 313–320. [Link]