dc.contributor.author | Hedström, Staffan |
dc.contributor.author | Fong, Judy Y. |
dc.contributor.author | Þórhallsdóttir, Ragnheiður |
dc.contributor.author | Mollberg, David Erik |
dc.contributor.author | Guðmundsson, Smári Freyr |
dc.contributor.author | Jónsson, Ólafur Helgi |
dc.contributor.author | Þorsteinsdóttir, Sunneva |
dc.contributor.author | Magnúsdóttir, Eydís Huld |
dc.contributor.author | Gudnason, Jon |
dc.date.accessioned | 2022-09-26T14:15:57Z |
dc.date.available | 2022-09-26T14:15:57Z |
dc.date.issued | 2022-07-01 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/265 |
dc.description | This release of data from the Samrómur collection contains all the collected data not present in other releases. The data is mostly UNVERIFIED. It contains 2,159,314 (2233 hours) speech-recordings in Icelandic, of which 84,161 have been verified. Ca 700,000 utterances have been scored with marosijo, this score indicates how likely the audio is to match the transcript. For more information about marosijo, please see [1] and [2]. The corpus contains 17,984 unique speakers. The corpus is NOT split into train, dev and test subsets. For such subsets please look at other Samrómur releases. All demographics are self reported. The dataset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. Average utterance length is 3.7 seconds. Þessi útgáfa af gögnum úr safni Samróma eru öll söfnuð gögn sem ekki eru til í öðrum útgáfum. Gögnin eru að mestu óyfirfarin. Útgáfan inniheldur 2.159.314 (2.233 klst.) talupptökur á íslensku, þar af 84.161 sem hafa verið staðfest. Um 700.000 hafa verið skoruð með marosijo sem gefur til kynna hvort líklegt sé að það sé gilt eða ekki. Fyrir frekari upplýsingar um Marosijo sjá [1] og [2]. Málheildin inniheldur 17.984 mismunandi raddir og hefur EKKI verið skipt upp í þjálfunar- (train), þróunar- (dev) og prófunarsett (test). Allar lýðfræðilegar upplýsingar hafa notendur sjálfir slegið inn. Hvert sett inniheldur möppur sem samsvara auðkenni raddar. Hljóðskrárnar nota eftirfarandi nafnareglur: {speaker_ID}-{utterance_ID}.flac. Meðallengd upptöku er 3,7 sekúndur. [1] Gudason et al., "Building ASR corpora using Eyra" https://www.isca-speech.org/archive/pdfs/interspeech_2017/gunason17_interspeech.pdf [2] Guðmundsson, Smári Freyr, "Samrómur automated verification wrapup" https://github.com/cadia-lvl/samromur-tools/tree/master/QualityCheck |
dc.language.iso | isl |
dc.publisher | Reykjavik University |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/cadia-lvl/samromur |
dc.subject | audio |
dc.subject | corpus |
dc.subject | automatic speech recognition |
dc.subject | speaker verification |
dc.subject | speaker identification |
dc.title | Samromur Unverified 22.07 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | Clarin IS Repository |
demo.uri | http://openslr.org/128/ |
contact.person | Jon Gudnason jg@ru.is Reykjavík University |
sponsor | Ministry of Education, Science and Culture Data recording using Eyra/Samrómur (H1) Language Technology for Icelandic 2019-2022 nationalFunds |
size.info | 2233 hours |
size.info | 2159314 utterances |
size.info | 125 gb |
files.size | 116369268880 |
files.count | 12 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- README.txt
- Size
- 8.27 KB
- Format
- Text file
- Description
- Readme
- MD5
- 15c9ed914997d9b270c0d3a1a53f8199
-------------------------------------------------------------------------------- Samrómur Unverified 22.07 -------------------------------------------------------------------------------- Language : Icelandic Authors : Staffan Hedström, Judy Y. Fong, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Eydís Huld Magnúsdóttir, Jon Gudnason Recommended use : speech recognition, speaker verification, speaker identification and speaker enrollment -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- This release of data from the Samrómur collection contains all available utterances from native Icelandic speakers. Only parts of the data have been validated. The corpus contains . . .
- Name
- samromur_unverified_22.07.zip
- Size
- 8.38 GB
- Format
- application/zip
- Description
- main
- MD5
- 153e2b3a4b2df668a6fefddb33e0e7b1
- Name
- samromur_unverified_22.07.z01
- Size
- 10 GB
- Format
- Unknown
- Description
- part 1
- MD5
- 946eab26f38d64255ce1a1fb36279634
- Name
- samromur_unverified_22.07.z02
- Size
- 10 GB
- Format
- Unknown
- Description
- part 2
- MD5
- a6151869209ad44edf18d7c3ad6fc224
- Name
- samromur_unverified_22.07.z03
- Size
- 10 GB
- Format
- Unknown
- Description
- part 3
- MD5
- f6a738e32d7350adf4e15fea8a1a7371
- Name
- samromur_unverified_22.07.z04
- Size
- 10 GB
- Format
- Unknown
- Description
- part 4
- MD5
- 1e25d01ca1bef77c345b6b5059eabe53
- Name
- samromur_unverified_22.07.z05
- Size
- 10 GB
- Format
- Unknown
- Description
- part 5
- MD5
- ca82c4d87402dc2fdd03e1d019d08c37
- Name
- samromur_unverified_22.07.z06
- Size
- 10 GB
- Format
- Unknown
- Description
- part 6
- MD5
- 1984ca305f35faa5d802852884f70649
- Name
- samromur_unverified_22.07.z07
- Size
- 10 GB
- Format
- Unknown
- Description
- part 7
- MD5
- 60bf661650292e501f89dba14c2e7fb0
- Name
- samromur_unverified_22.07.z08
- Size
- 10 GB
- Format
- Unknown
- Description
- part 8
- MD5
- 97e2601e8222277636693668c2f41aca
- Name
- samromur_unverified_22.07.z09
- Size
- 10 GB
- Format
- Unknown
- Description
- part 9
- MD5
- dcc2ca46998ceec4ccbc3c54dba27188
- Name
- samromur_unverified_22.07.z10
- Size
- 10 GB
- Format
- Unknown
- Description
- part 10
- MD5
- 9f1a7cbac890f263d992cae85fa4ebf4