dc.contributor.author | Mollberg, David Erik |
dc.contributor.author | Jónsson, Ólafur Helgi |
dc.contributor.author | Þorsteinsdóttir, Sunneva |
dc.contributor.author | Guðmundsdóttir, Jóhanna Vigdís |
dc.contributor.author | Steingrímsson, Steinþór |
dc.contributor.author | Magnúsdóttir, Eydís Huld |
dc.contributor.author | Fong, Judy Y |
dc.contributor.author | Borsky, Michal |
dc.contributor.author | Gudnason, Jon |
dc.date.accessioned | 2022-02-01T09:11:16Z |
dc.date.available | 2022-02-01T09:11:16Z |
dc.date.issued | 2021-05 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/189 |
dc.description | This is the first release of the data from the Samrómur collection. The corpus that contains 100,000 (145 hours) validated speech-recordings in Icelandic. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording has started in October 2019 and continues to this day (May 2021). This release has been authorized for release in May 2021. The corpus contains 8,392 different speakers and is split into train, dev, and test subsets with no speaker overlap. Lengths of the sets are: train = 114h, test = 15h, dev = 15h. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. The average recording length is 5.2 seconds. Þetta er fyrsta útgáfan af gögnum úr safni Samróms. Útgáfan inniheldur 100.000 (145 klst.) staðfestar talupptökur á íslensku. Málheildin er afrakstur lýðvistunar (e. crowd sourcing) á vegum Mál- og raddtæknistofu (LVL) við Gervigreindarsetur Háskólans í Reykjavík, í samvinnu við Almannaróm, miðstöð máltækni á Íslandi. Upptökur hófust í október 2019 og standa enn yfir (mai 2021). Hópurinn inniheldur 8.392 mismunandi raddir og hefur verið skipt í þjálfunar- (train), þróunar- (dev) og prófunarsett (test). Lengd settanna er: þjálfunarsett = 114 klst, prófunarsett = 15 klst, þróunarsett = 15 klst. Hvert sett inniheldur möppur sem samsvara auðkenni raddar. Hljóðskrárnar nota eftirfarandi nafnareglur: {speaker_ID}-{utterance_ID}.flac. Meðallengd upptöku er 5,2 sekúnda. |
dc.language.iso | isl |
dc.publisher | Reykjavík University |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/cadia-lvl/samromur |
dc.subject | audio corpus |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.title | Samromur 21.05 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Jon Gudnason jg@ru.is Reykjavík University |
sponsor | Ministry of Education, Science and Culture Data recording using Eyra/Samrómur (H1) Language Technology for Icelandic 2019-2023 nationalFunds |
size.info | 145 hours |
size.info | 8 gb |
size.info | 100000 utterances |
files.size | 7046081539 |
files.count | 4 |
Files in this item
Download all files in item (6.56 GB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- samromur_21.05.multi.zip
- Size
- 719.67 MB
- Format
- application/zip
- Description
- main file for corpus
- MD5
- 91c7fd4c7b97d53178a000f61d27d372
- Name
- samromur_21.05.multi.z01
- Size
- 1.95 GB
- Format
- Unknown
- Description
- part 1
- MD5
- f1b0cca62f6099b413dc5ebadc84cba9
- Name
- samromur_21.05.multi.z02
- Size
- 1.95 GB
- Format
- Unknown
- Description
- part 2
- MD5
- d6a4a56db8d85d79cddbaa3a900fbf0d
- Name
- samromur_21.05.multi.z03
- Size
- 1.95 GB
- Format
- Unknown
- Description
- part 3
- MD5
- 35b90a6aa5ac4d40c58ded8a5f0ead5d