Reference corpora

From Clarin K-Centre
Jump to navigation Jump to search

Corpus Hedendaags Nederlands[edit]

A collection of more than 800,000 texts taken from newspapers, magazines, news broadcasts and legal writings (1814-2013).

The corpus is a combination of the 5, 27 and 38 Million Words Corpora and the PAROLE Corpus, supplemented with newspaper texts from NRC and De Standaard (until 2013).

Lassy Large[edit]

The Lassy Large Corpus is a collection written texts consisting of approximately 700 million words with automatically generated annotations. The lemmas and POS-tags were generated with Tadpole (now Frog) and the syntactical dependency structures were generated with Alpino.

SoNaR corpus[edit]

The SoNaR corpus is a text corpus consisting of two parts, namely SoNaR-500 and SoNaR-1.

SoNaR-500 contains more than 500 million words of text from various domains and genres. All texts were tokenized, POS tagged and lemmatized. The named entities were also labeled. All SoNaR-500 annotations were generated automatically.

SoNaR-1 is largely a subset of SoNaR-500 and contains 1 million words. SoNaR-1 was provided with different types of semantic annotations, namely named entity labeling, co-reference annotation and the annotation of spatial and temporal relationships. All SoNaR-1 annotations were manually verified.

The new media texts (tweets, chats and text messages), which were also collected within the framework of the STEVIN project SoNaR, are not part of the SoNaR corpus 1.0. and are available separately as the SoNaR New Media Corpus.