This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian:
- The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka
- A refinement of the greedy subsumption-based stemmer, by Nikola Milošević
- A “Simple stemmer for Croatian v0.1”, by Nikola Ljubešić and Ivan Pandžić
All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded.
Author
Vuk Batanović
Availability
The package and a more extensive documentation can be downloaded from the SCStemmers GitHub repository.
Publications
The SCStemmers package was introduced in:
Vuk Batanović, Boško Nikolić, Milan Milosavljević (2016). Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2688-2696, Portorož, Slovenia. [Link] [.bib]
The original papers describing each implemented stemming algorithm are:
- For the greedy and the optimal subsumption-based stemmer for Serbian: Vlado Kešelj, Danko Šipka (2008). A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. Infotheca 9(1-2), pp. 23a-33a. [Link]
- For the refinement of the greedy subsumption-based stemmer: Nikola Milošević (2012). Stemmer for Serbian language. arXiv preprint arXiv:1209.4471. [Link]
- For the “Simple stemmer for Croatian v0.1”: Nikola Ljubešić, Damir Boras, Ozren Kubelka (2007). Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer. Digital Information and Heritage, pp. 313–320. [Link]