-------------------------------------------------------------------------------- Parallel Speech Recordings for Icelandic L1 and L2 speakers -------------------------------------------------------------------------------- Language : Icelandic Authors : Caitlin Richter, Þorsteinn Daði Gunnarsson, Bjarni Barkarson, Kolbrún Friðriksdóttir, Branislav Bédi, Jón Guðnason -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- This release of data is a parallel learner speech corpus containing the same material read aloud by both native and non-native Icelandic speakers. The corpus contains 28,747 (26.2 hours) of mostly un-verified speech recordings in Icelandic. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at Reykjavik University, in cooperation with University of Iceland and Árni Magnússon Institute for Icelandic Studies. The recording process took place in October 2021 through October 2024. The present edition of the corpus has been authorized for release in November 2024. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology and Computer Assisted Language Learning. The corpus consists of audio recordings and a metadata file containing the prompts read by the participants. To see more open resources developed by the Language and Voice Lab (LVL) see the github and huggingface repositories at https://github.com/cadia-lvl/samromur-asr and https://huggingface.co/language-and-voice-lab -------------------------------------------------------------------------------- Corpus Characteristics -------------------------------------------------------------------------------- - Speech in the corpus has not been validated for adherence to text prompts. - The utterances were recorded by a smartphone or web app. - Participants self-reported their age group, gender, native language, and Icelandic proficiency level. - Participants' ages are from 8 and up to 70-79 years. 26 speakers are under 18 while 144 are 18+. - The corpus contains 28,747 utterances from 170 speakers, totalling 26.2 hours. - The number of female speakers is 112, and the number of male speakers is 58. No speakers in this collection had other or unknown gender information. - The amount of utterances from female speakers are 17994, and the utterances from male speakers are 10753. - The corpus is NOT split into train, dev, and test sets. - If any of the information in the metadata is unavailable this will is indicated with a NAN in the metadata file. -------------------------------------------------------------------------------- Collection Procedure -------------------------------------------------------------------------------- The data was collected using the website https://samromur.is, code of which is available at https://github.com/cadia-lvl/samromur. The collection Procedure is well described in "Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition" [1]. Each time a device visits the website for the first time they are assigned a client id, this client id together with a combination of gender, age and native language was used to assign the speaker id. If any of these variables were changed, a new speaker id was also created. The corpus is distributed with a metadata file with detailed information on each utterance and speaker. The metadata file is encoded as UTF-8 Unicode. The original audio was collected at 16, 44.1, 48, or 96 kHz sampling rate as _.wav files, according to participants' devices. Each recording contains one read prompt from a script. The script contains 483 unique prompts consisting of 1901 tokens and 897 word types. The corpus contains at least 50 recordings from each speaker, and up to 461. There are at least 2 recordings of each prompt, and up to 117. The prompts were produced by linguists and teachers of Icelandic as a second language at University of Iceland and Árni Magnússon Institute of Icelandic Studies. The prompt list covers main phonetic exercises currently presented to students at University of Iceland and has been considered both with respect to phonetics and pedagogy. -------------------------------------------------------------------------------- Data Format Specifics -------------------------------------------------------------------------------- - Text : The corpus does not contain separate transcription or prompt files. The metadata file contains the prompts in their original text form, as the participants saw them, and also in their normalized form. - Audio: The distributed audio files are encoded as 16 bit linear PCM, 1 channel, \*.wav format. Sampling rates vary across 16 kHz, 44.1 kHz, 48 kHz, or 96 kHz, depending on the device used for recording; the original audio is distributed in this release. The audio for the utterances is located in the audio folder and contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.wav. -------------------------------------------------------------------------------- Citation -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Richter et al. "Parallel Speech Recordings for Icelandic L1 and L2 speakers". Web Download. Reykjavik University: Language and Voice Lab, 2024. Contact: Jon Gudnason (jg@ru.is) License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode) -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- This project was funded by the Ministry of Culture and Business Affairs Project code: Parallel Speech Recordings for Icelandic L1 and L2 speakers Project name: Language Technology for Icelandic Special thanks to the assisting LVL members and summer students for all the hard work. -------------------------------------------------------------------------------- Stats for the dataset -------------------------------------------------------------------------------- Language background and gender split: | | Speakers | Recordings | | ---------------- | -------- | ---------- | | L1-Icelandic: | 79 | 13987 | | L2-Icelandic: | 91 | 14760 | | ---------------- | -------- | ---------- | | Female: | 112 | 17994 | | Male: | 58 | 10753 | | ---------------- | -------- | ---------- | Detailed language background of speakers: | -------- | ---------- | -------------- | | Speakers | Recordings | Language | | -------- | ---------- | -------------- | | 79 | 13987 | Icelandic | | 24 | 3706 | English | | 7 | 1374 | Russia | | 5 | 1106 | French | | 8 | 1098 | Polish | | 4 | 985 | Italian | | 8 | 902 | Spanish | | 7 | 888 | German | | 4 | 830 | Hungarian | | 2 | 486 | Latvian | | 2 | 474 | Persian | | 1 | 460 | other | | 1 | 425 | Ukrainian | | 3 | 392 | Serbo-Croatian | | 2 | 354 | Czech | | 1 | 297 | Romania | | 2 | 210 | Filipino | | 2 | 188 | Slovak | | 1 | 172 | Norwegian | | 2 | 124 | Danish | | 1 | 70 | Vietnamese | | 1 | 60 | Portuguese | | 1 | 56 | katalonska | | 1 | 53 | Greek | | 1 | 50 | Turkish | | -------- | ---------- | -------------- | Total speakers and utterances: Speakers: 170 Utterances: 28,747 Average utterance length: 3.28s -------------------------------------------------------------------------------- References -------------------------------------------------------------------------------- [1] Mollberg et al. "Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition," 12th International Conference on Language Resources and Evaluation (LREC), France, 2020. --------------------------------------------------------------------------------