dc.contributor.author | Pálsson, Ragnar |
dc.contributor.author | Sigurðardóttir, Helga Svala |
dc.contributor.author | Þórhallsdóttir, Ragnheiður |
dc.contributor.author | Borsky, Michal |
dc.contributor.author | Þorsteinsdóttir, Helga Lára |
dc.contributor.author | Helgadóttir, Inga Rún |
dc.contributor.author | Guðnason, Jón |
dc.contributor.author | Daníelsson, Hilmar |
dc.contributor.author | Fong, Judy Y |
dc.contributor.author | Gunnarsson, Þorsteinn Daði |
dc.date.accessioned | 2022-02-22T14:31:36Z |
dc.date.available | 2022-02-22T14:31:36Z |
dc.date.issued | 2022-02-17 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/193 |
dc.description | [ENGLISH] The Icelandic broadcast speech corpus is 193 hours of radio and TV data from RÚV. The radio data consists of episodes of Spegillinn, morning news, evening news, Morgunútvarpið, Morgunvaktin and Samfélagið. The TV data consists of episodes of Kastljós. All the data is from episodes broadcast in the period from January 2020 to August 2021. The data contains 40,746 utterances from 1,360 speakers. The data is aligned and segmented, ready for ASR training. The data set includes both prompted speech (e.g. from the News) and conversational speech (e.g. Morgunvaktin and Kastljósið). This data set is published by RÚV, transcribed by Creditinfo and aligned at Reykjavik University with the help of Tiro's automatic speech recognizer. Special thanks to Tiro for supplying transcriptions with per-word timestamps, which were essential in the alignment process. This work is licensed under the Creative Commons Attribution 4.0 International License. [ÍSLENSKA] Íslenskt fjölmiðlatal gagnasafnið er 193 klukkustundir af útvarps- og sjónvarpsefni frá RÚV. Útvarpsgögnin samanstanda af þáttum af Speglinum, Morgunfréttum, kvöldfréttum, morgunútvarpinu, Morgunvaktinni og Samfélaginu. Sjónvarpsgögnin samanstanda af þáttum af Kastljósi. Öll gögnin eru úr þáttum sem voru sendir út á tímabilinu janúar 2020 til ágúst 2021. Gagnasafnið inniheldur 40.746 yrðingar frá 1.360 málhöfum. Textinn er staðlaður. Gögnin eru samröðuð og sneidd, tilbúin til þjálfunar talgreina. Gögnin innihalda bæði umbeðið tal (t.d. Fréttir) og samtöl (t.d. Morgunvaktin og Kastljós). Gögnin voru gefin út af RÚV, afrituð af Creditinfo og samröðuð hjá Háskólanum í Reykjavík með hjálp frá Tiro-talgreininum. Sérstakar þakkir fær Tiro fyrir að útvega afrit með tímastimpli á hverju orði. Þessar afritanir voru ómissandi í samröðunarferlinu. Þetta verk er gefið út með Creative Commons 4.0 alþjóðlegu afnotaleyfi. |
dc.language.iso | isl |
dc.publisher | Reykjavík University |
dc.publisher | The Icelandic National Broadcasting Service - Ríkisútvarpið (RÚV) |
dc.publisher | Creditinfo Fjölmiðlavaktin ehf. |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | broadcast |
dc.subject | radio broadcast |
dc.subject | tv broadcast |
dc.subject | automatic speech recognition |
dc.subject | ruv radio |
dc.subject | ruv tv |
dc.title | Icelandic broadcast speech |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Ragnar Pálsson ragnarp@ru.is Reykjavík University |
contact.person | Jón Guðnason jg@ru.is Reykjavík University |
sponsor | Ministry of Education, Science and Culture Transcribe and align radio and TV material (H2) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 66 gb |
size.info | 193 hours |
size.info | 40746 utterances |
size.info | 1966745 words |
files.size | 27663814912 |
files.count | 10 |
Files in this item
Download all files in item (25.76 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- README.txt
- Size
- 3.83 KB
- Format
- Text file
- Description
- Unknown
- MD5
- bee3bc2439355fee77156ab87078562a
Icelandic broadcast speech About the Icelandic broadcast speech corpus --------------------------- The Icelandic broadcast speech corpus is 193 hours of radio and TV data from RÚV. The radio data consists of episodes of Spegillinn, morning news, evening news, Morgunútvarpið, Morgunvaktin and Samfélagið. The TV data consists of episodes of Kastljós. All the data is from episodes broadcast in the period the period from January 2020 to August 2021. The data contains 40,746 utterances from 1,360 speakers. The data is aligned and segmented, ready for ASR training. The data set includes both prompted speech (e.g. from the News) and conversational speech (e.g. Morgunvaktin and Kastljósið). This data set is published by RÚV, transcribed by Creditinfo and aligned at Reykjavik University with the help of Tiro's automatic speech recognizer. Special thanks to Tiro for supplying transcriptions with per-word timestamps from their automatic speech recognizer, which were essential in the alig . . .
- Name
- metadata.tsv
- Size
- 14.31 MB
- Format
- Unknown
- Description
- Unknown
- MD5
- 43c1705145d20497f3bad1560502206a
- Name
- speaker_information.tsv
- Size
- 49.51 KB
- Format
- Unknown
- Description
- Unknown
- MD5
- 512657b60add3e71843b85ab0d7ce0c7
- Name
- cut_audio_RELEASE.z01
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 6b3ab042fa8ad276fced07352676f406
- Name
- cut_audio_RELEASE.z02
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 459be745d78ac230ed815eb5f00d3e06
- Name
- cut_audio_RELEASE.z03
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- aa5de9a1fef84ae23f3782584d124c9b
- Name
- cut_audio_RELEASE.z04
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 436e5ccac3187b72d021888739979eac
- Name
- cut_audio_RELEASE.z05
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 141bdc0612d9d9cfcae1aafb553a507f
- Name
- cut_audio_RELEASE.z06
- Size
- 3.91 GB
- Format
- Unknown
- Description
- Unknown
- MD5
- 420de02e960b61d27b7045a8a1c1c45c
- Name
- cut_audio_RELEASE.zip
- Size
- 2.31 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 096ae66512323726f3b3ecaffd9673af