RUV TV unknown speakers About the RUV TV unknown speakers corpus --------------------------- The RUV TV unknown speakers corpus is 281 hours of TV data from six RÚV TV shows. The data continas 221,759 utterrances from various unlabelled speakers. The text is normalized. The data is aligned and segmented, ready for ASR training. Audio conditions vary between recordings. This data set is published by the Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) and made by both RÚV and Reykjavik University. This work is licensed under the Creative Commons Attribution 4.0 International License. This is a broadcast dataset collected from RÚV by Rekjavík University in 2019-2020. So all episodes within this dataset aired in 2019 at the latest. All episodes were recorded as digital originals. The text originates from RÚV subtitle (.vtt) and teletext (888). Audio files are 16kHz one channel flac created from the original .mp4 episodes. The alignment was done using The Kaldi Speech Recognition Toolkit (https://github.com/kaldi-asr/kaldi) and the scripts from our alignment repository (https://github.com/cadia-lvl/alignment-and-segmentation). This dataset was released in the year 2022 in February (2022-02). The dataset contains data from the following 6 shows: Fréttir kl. 19:00 - prime time news Kastljós - news commentary Kiljan - literature discussion Krakkafréttir - news for children Menningin - arts and culture show Stundin Okkar - children's variety show This dataset complements the RÚV TV data. There are no overlapping episodes: Helgadottir, Inga Run; Fong, Judy Yum; Gudnason, Jon; et al., 2020, RÚV TV data, CLARIN-IS, http://hdl.handle.net/20.500.12537/93. The structure of the corpus --------------------------- | . - docs/ | . - README.txt | . - data/ | . - metadata.tsv | . - text | . - audio/ | . - Frettirkl1900/ | . - 4942689/ | . - 4942689-00000.flac | . - ... | . - Kastljos/ | . - Kiljan/ | . - Krakkafrettir/ | . - Menningin/ | . - StundinOkkar/ | . - filename.filetype - metadata.tsv - This is a tab separated file containing utterance_id, episode_id, show_id, and duration(seconds). Path of the audio file can be constructed from the show_id, episode_id, and utterance_id (data/audio/show_id/episode_id/utterance_id.flac) Within each show, the episode numbers are sequential, meaning episode 4813755 of Kiljan aired before 4813757. - text - This is a text file like needed for Kaldi's data directories. It contains the utterance_id followed by the text spoken within the utterance. Unrecognized words are represented with UNK Statistics ---------- 6 TV shows 281 hrs 221766 utterances Authors ------- Reykjavík University Judy Y Fong - judy@judyyfong.xyz Inga Run Helgadottir Helga Svala Sigurðardóttir Michal Borsky Ragnheiður Þórhallsdóttir Jon Gudnason - jg@ru.is The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) Helga Lara Thorsteinsdottir Acknowledgements ---------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. License ------- This dataset is licensed under Creative Commons - Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/