ReLDI-NormTagNER-hr 2.1

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).

Authors
Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić
Availability
For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository.
Publication
The corpus construction is (partially) described in the following paper:
Miličević, M. and N. Ljubešić (2016). Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0 4(2) link