Spoken corpora

From Clarin K-Centre
Jump to navigation Jump to search

Spoken corpora are corpora that consist of spoken data or material based on spoken data.

Boarnsterhim Corpus (BHC)[edit]

The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.

  • 42.6 MB
  • version 1.0 (2020)
  • data set from 1982-1984 + replication 35 years later
  • Download page

Spoken Dutch Corpus - Corpus Gesproken Nederlands[edit]

Almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.

The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.

IFA Spoken Language Corpus[edit]

The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.

JASMIN Speech Corpus[edit]

A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction