srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2.
The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
Authors
Nikola Ljubešić, Filip Klubička
Availability
For local use, a full-text version of srWaC can be downloaded here.
srWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
srWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
Publications