Serbian web corpus: srWaC

srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2.

The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.

Authors
Nikola Ljubešić, Filip Klubička
Availability
For local use, a full-text version of srWaC can be downloaded here.
srWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
Publications
The compilation of the 1.0 version of the corpus is described in the following paper:
Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]