This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet.
A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language.
Authors
Nikola Ljubešić, Tomaž Erjavec
Availability
The tokeniser is freely available in three forms:
- For local use, the tokeniser can be downloaded from this GitHub repository.
- The tokeniser can be used online, via our web interface that can be found here.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.