hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2.
The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
Authors
Nikola Ljubešić, Filip Klubička
Availability
For local use, a full-text version of hrWaC can be downloaded here.
hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
Publications