Spjallromur - Icelandic Conversational Speech About the Spjallrómur corpus ---------------------------- Spjallromur is an open source conversational speech corpus for speech technology development. The corpus is 21 hrs and 20 mins long, with 54 total conversations, 102 speakers. The data was collected for one year (September 2020 - September 2021) by Reykjavík University. There are two parts, the first part has full conversations, while the second part has half conversations. The dataset was primarily created for automatic speech recognition but due to the nature of the dataset, it can also be used for other speech technology fields such as: speaker identification, speaker diarization, and conversational language modeling. Spjallrómur was collected using a custom made online chatting platform called spjall, which is Icelandic for chat. Each speaker used their own microphones (some picked up background noise like the neighboring speakers or other speakers). and devices. The audio from each microphone was saved to a separate audio file, .WAV. There are two speakers per conversation. The speaker set contains both native and non-native Icelandic speakers. All speakers are adults. Each conversation has two sets of demographics metadata, audio file, and transcript, one file for each speaker. Due to some network lag there is sometimes a small difference in length of the two audio files in a conversation. As there were a limited number of participants , some speakers may be in more than one conversation. The text has not been aligned with the audio. The full conversations contain 19 hrs of 48 full conversations, 96 speakers. The half conversations contain 2 hrs 20 mins of 6 partial conversations, 6 speakers. Personally identifiable information has been redacted in the audio with a 400Hz beep and replaced with XXX in the transcript. Non words are marked with () or []. Partial words are marked with [HIK: ..]. The structure of the corpus --------------------------- | . - docs/ | . - spjallromur_README.txt | . - manual_transcripts.json | . - data/ | . - half_conversations/ | . - / | . - speaker_x_convo__demographics.json | . - speaker_x_convo__transcript.json | . - speaker_x_convo_.wav . - full_conversations/ | . - / | . - speaker_a_convo__demographics.json | . - speaker_a_convo__transcript.json | . - speaker_a_convo_.wav | . - speaker_b_convo__demographics.json | . - speaker_b_convo__transcript.json | . - speaker_b_convo_.wav Session IDs are 8-character hexadecimal identifiers (e.g. 0f2c315c, 3ac74ae1). Half conversations have one speaker per session (speaker_a or speaker_b). Full conversations have two speakers per session (speaker_a and speaker_b). * speaker_x_convo_.wav - Each audio file is 16 bit, 16000 Hz, single channel WAVE. It contains the voice of speaker x in that conversation. * speaker_x_convo__transcript.json - JSON file of the transcript and corresponding metadata, generated within the Tiro text editor (https://tal.tiro.is). Word-level segments with timing. Metadata includes: name (filename), fileType, languageCode, recordingDuration (float seconds), speakers, audio_file, session_id. * speaker_x_convo__demographics.json - Contains session_id (8-char identifier), the speaker's age, gender, and audio duration in seconds. Gender and age are in Icelandic. Here's the mapping to English: gender id: kona, name: 'female' id: karl, name: 'male' id: annad, name: 'other' age group id: 'unglingur', name: '18-19' id: 'tvitugt', name: '20-29' id: 'thritugt', name: '30-39' id: 'fertugt', name: '40-49' id: 'fimmtugt', name: '50-59' id: 'sextugt', name: '60-69' id: 'sjotugt', name: '70-79' id: 'attraett', name: '80-89' id: 'niraett', name: '90+' Transcript formats ------------------ Two transcript formats are provided: 1. Automatic transcripts (*_transcript.json) - One file per speaker per session, in each session folder. Word-level granularity with segments containing timed words. All 54 sessions (48 full + 6 half) have individual transcripts. 2. Manual transcripts (docs/manual_transcripts.json) - A subset of full conversations (21 sessions, ~7.9 hrs) has been manually transcribed by three transcribers (Baldur, Hrafnhildur, Selma). Seven sessions (~3.0 hrs) have two or more transcriptions by different transcribers; use transcript_id (session_id_transcriber) to distinguish entries. Each conversation entry contains: session_id, transcript_id, speakers, speaker_a and speaker_b (each with recordingDuration, audio_file, fileType), transcriber, and turns (speaker, text, startTime, endTime in float seconds). When multiple transcribers transcribed the same session, they appear as separate entries in the consolidated file. Transcription methodology differed by transcriber: - Selma transcribed by listening to each channel (speaker A, speaker B) separately. Speaker identity was thus known in advance. - Baldur and Hrafnhildur transcribed from a mix of the two channels. They performed transcription and speaker diarization at the same time. There is no word-level segmentation; turn boundaries reflect each transcriber's own judgment of where one utterance ends and the next begins. Authors ------- Reykjavík University Judy Y Fong - judy@judyyfong.xyz Staffan Hedström Ólafur Helgi Jónsson Lára Margrét H. Hólmfriðardóttir Sunneva Þorsteinsdóttir Málfriður Anna Eiríksdóttir David Erik Mollberg Eydís Huld Magnúsdóttir Ragnheiður Þórhallsdóttir Jon Gudnason - jg@ru.is Acknowledgements ---------------- Special thanks to the other members of the Language and Voice Lab (https://lvl.ru.is), the student employees, Róbert Kjaran, and Magnús Teitsson. This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. This project was funded in part by the the Icelandic Directorate of Labour's student summer job program in 2021. Citations --------- @misc{fong-spjallromur, title = {Spjallromur - Icelandic Conversational Speech}, author = {Fong, Judy Y and Hedstr{\"o}m, Staffan and J{\'o}nsson, {\'O}lafur Helgi and H{\'o}lmfri{\dh}ard{\'o}ttir, L{\'a}ra Margr{\'e}t H. and {\TH}orsteinsd{\'o}ttir, Sunneva and Eir{\'{\i}}ksd{\'o}ttir, M{\'a}lfri{\dh}ur Anna and Mollberg, David Erik and Magn{\'u}sd{\'o}ttir, Eyd{\'{\i}}s Huld and {\TH}{\'o}rhallsd{\'o}ttir, Ragnhei{\dh}ur and Gudnason, Jon}, url = {}, note = {{CLARIN}-{IS}}, copyright = {Creative Commons - Attribution 4.0 International ({CC} {BY} 4.0)}, year = {2022} } License ------ This dataset is released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. (https://creativecommons.org/licenses/by/4.0/)