Show simple item record

 
dc.contributor.author Mollberg, David Erik
dc.contributor.author Jónsson, Ólafur Helgi
dc.contributor.author Þorsteinsdóttir, Sunneva
dc.contributor.author Guðmundsdóttir, Jóhanna Vigdís
dc.contributor.author Steingrímsson, Steinþór
dc.contributor.author Magnúsdóttir, Eydís Huld
dc.contributor.author Fong, Judy Y
dc.contributor.author Borsky, Michal
dc.contributor.author Gudnason, Jon
dc.date.accessioned 2022-02-01T09:11:16Z
dc.date.available 2022-02-01T09:11:16Z
dc.date.issued 2021-05
dc.identifier.uri http://hdl.handle.net/20.500.12537/189
dc.description This is the first release of the data from the Samrómur collection. The corpus that contains 100,000 (145 hours) validated speech-recordings in Icelandic. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording has started in October 2019 and continues to this day (May 2021). This release has been authorized for release in May 2021. The corpus contains 8,392 different speakers and is split into train, dev, and test subsets with no speaker overlap. Lengths of the sets are: train = 114h, test = 15h, dev = 15h. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. The average recording length is 5.2 seconds. Þetta er fyrsta útgáfan af gögnum úr safni Samróms. Útgáfan inniheldur 100.000 (145 klst.) staðfestar talupptökur á íslensku. Málheildin er afrakstur lýðvistunar (e. crowd sourcing) á vegum Mál- og raddtæknistofu (LVL) við Gervigreindarsetur Háskólans í Reykjavík, í samvinnu við Almannaróm, miðstöð máltækni á Íslandi. Upptökur hófust í október 2019 og standa enn yfir (mai 2021). Hópurinn inniheldur 8.392 mismunandi raddir og hefur verið skipt í þjálfunar- (train), þróunar- (dev) og prófunarsett (test). Lengd settanna er: þjálfunarsett = 114 klst, prófunarsett = 15 klst, þróunarsett = 15 klst. Hvert sett inniheldur möppur sem samsvara auðkenni raddar. Hljóðskrárnar nota eftirfarandi nafnareglur: {speaker_ID}-{utterance_ID}.flac. Meðallengd upptöku er 5,2 sekúnda.
dc.language.iso isl
dc.publisher Reykjavík University
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/cadia-lvl/samromur
dc.subject audio corpus
dc.subject speech recognition
dc.subject automatic speech recognition
dc.title Samromur 21.05
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding Clarin IS Repository
contact.person Jon Gudnason jg@ru.is Reykjavík University
sponsor Ministry of Education, Science and Culture Data recording using Eyra/Samrómur (H1) Language Technology for Icelandic 2019-2023 nationalFunds
size.info 145 hours
size.info 8 gb
size.info 100000 utterances
files.size 7046081539
files.count 4


 Files in this item

 Download all files in item (6.56 GB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Icon
Name
samromur_21.05.multi.zip
Size
719.67 MB
Format
application/zip
Description
main file for corpus
MD5
91c7fd4c7b97d53178a000f61d27d372
 Download file
Icon
Name
samromur_21.05.multi.z01
Size
1.95 GB
Format
Unknown
Description
part 1
MD5
f1b0cca62f6099b413dc5ebadc84cba9
 Download file
Icon
Name
samromur_21.05.multi.z02
Size
1.95 GB
Format
Unknown
Description
part 2
MD5
d6a4a56db8d85d79cddbaa3a900fbf0d
 Download file
Icon
Name
samromur_21.05.multi.z03
Size
1.95 GB
Format
Unknown
Description
part 3
MD5
35b90a6aa5ac4d40c58ded8a5f0ead5d
 Download file

Show simple item record