BasiLex-corpus[edit | edit source]
The Basilex corpus is an annotated collection of texts written for children in the age from four to twelve years.
- version 1.0 (2015)
- Tellings, A., Hulsbosch, M., Vermeer, A. & van den Bosch, A. (2015). BasiLex: an 11.5-million words corpus of Dutch texts written for children. Computational Linguistics in the Netherlands Journal 4, 191-208
- Download page
BasiScript-corpus[edit | edit source]
The BasiScript Corpus is an annotated collection of texts written by children in the age from four to twelve years.
Dutch Audio Description Corpus[edit | edit source]
The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety).
Dutch Book Reviews Dataset (DBRD)[edit | edit source]
The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch.