Other corpora

From K-Dutch ATO
Revision as of 14:21, 28 January 2022 by Griet (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

BasiLex-corpus

The Basilex corpus is an annotated collection of texts written for children in the age from four to twelve years.

BasiScript-corpus

The BasiScript Corpus is an annotated collection of texts written by children in the age from four to twelve years.

Dutch Audio Description Corpus

The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety).

Dutch Book Reviews Dataset (DBRD)

The DBRD (pronounced dee-bird) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch.