Other corpora

From K-Dutch ATO
Jump to navigation Jump to search

BasiLex-corpus[edit | edit source]

The Basilex corpus is an annotated collection of texts written for children in the age from four to twelve years.

BasiScript-corpus[edit | edit source]

The BasiScript Corpus is an annotated collection of texts written by children in the age from four to twelve years.

Dutch Audio Description Corpus[edit | edit source]

The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety).

Dutch Book Reviews Dataset (DBRD)[edit | edit source]

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch.