############################################################################# ######### Althingi Parliamentary speach corpus ######### ######### http://hdl.handle.net/20.500.12537/277 ######### ############################################################################# THE FILES The corpus contains two packages, delivered in two compressed files: A) althingi_texts.tar.gz: The file althingi_texts.zip contains the training-, evaluation- and development sets and two language models (pruned trigram model, used in decoding and a unpruned constant arpa 5-gram model, used for rescoring decoding results). The three sets are located in the folders train, dev and eval. Each folder contains file files: - segments: links each text segment to its place in the audio files - spk2gender: lists all the speakers and their gender - spk2utt: lists all the speakers and their utterances/segments - text: lists the ID of each segment and its text - utt2spk: lists each segment to a speaker The two models are located in the folders lang_3gsmall and lang_5glarge. The file metadata.csv contains a comma-separated list of speakers, timestamp of the beginning of speach, the timestamp of the end of the speach, the number of the parliamentary session, the status of speaker and ID of file. The file name_id_gender.tsv contains name of speaker, abbreviation and gender. The file pron_dict.txt contains the pronunciation dictionary that is based on an edited version of Hjal’s pronunciation dictionary (http://hdl.handle.net/ 20.500.12537/198), plus common words from the Althingi texts and from Málrómur. It currently contains ~181,000 words. B) althingi_upptokur.zip: The folder 'audio' contains the recordings of all the speaches. The folder 'text_bb' contains the texts of all the speaches in xml-format. The folder 'text_endanlegt' contains the texts after cleaning. The file metadata.csv contains the same information as the file metadata.csv in althingi_texts.tar.gz Due to the size of this package is was split into four parts during compression. In order to unzip it you first use the zip command to combine the split zip files into a single zip-archive: zip -F althingi_upptokur.zip --out althingi_upptokur.single-archive.zip Now we can use unzip to open the combined archive. unzip althingi_upptokur.single-archive.zip ABOUT THE ALTHINGI PARLIAMENTARY speach CORPUS This is an aligned and segmented corpus of 6493 Althingi recordings with 196 speakers. The recordings consist of 199,614 segments, with average duration of 9.8 s. A file called segments links each text segment to its place in the audio files. The total duration of the data set is 542 hours and 25 minutes of data and it contains 4,583,751 word tokens. The corpus is split up into a training-, development- and an evaluation set. The training set contains speaches from 2005 to 2015, with a total duration of 514.5 hours. The speaches from 2016 were split evenly between the development- and evaluation sets, with 14 hours in duration each. The evaluation set is cleaner than the development set, and both are cleaner than the training set. The pronunciation dictionary is based on an edited version of Hjal’s pronunciation dictionary (E. Rögnvaldsson, 2003), which is available at Málföng, plus common words from the Althingi texts and from Málrómur (J. Guðnason et al., 2012). It currently contains ~181,000 words. Sequitur’s grapheme to phoneme converter (M. Bisani et al., 2008), trained on the edited pronunciation dictionary from Hjal, plus the Málrómur data, was used to get the phonemes for the new words from the Althingi data. The language models were built using transcripts of Althingi speaches dating back to 2003, excluding speaches from 2016. One is a pruned trigram model, used in decoding. The other one is a unpruned constant arpa 5-gram model, used for rescoring decoding results. Using this data, pronunciation dictionary and language model, an automatic speach recognizer with a 10.23% word error rate has been developed. This error rate was obtained using an acoustic model based on lattice-free maximum mutual information neural network architecture with both time-delay and long short term memory layers. It is based on the Switchboard recipe in the Kaldi toolkit (D. Povey et al., 2011) (https://github.com/kaldi-asr/kaldi/tree/ master/egs/swbd). Our training recipe from start to finish will be made public soon. ----------- When publishing results based on the texts in the corpus please refer to: Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir og Jón Guðnason, 2017. Building an ASR corpus using Althingi’s Parliamentary speaches. Proceedings of Interspeach 2017. ____________________________________________________________ [ICELANDIC] SKJÖLIN: Sjáið kaflann FILES hér að ofan. UM ALÞINGISGÖGNIN Gögnin samanstanda af 6493 Alþingisræðum, frá 196 ræðumönnum. Þau eru samröðuð og skipt niður í hæfilega stórar einingar fyrir þjálfun. Meðallengd hverrar einingar er 9,8 s. Skrá sem kallast segments tengir hvern textabút við réttan stað í hljóðskránum. Heildarlengd hljóðgagnanna er 542 klst. og 25 min og textinn inniheldur tæplega 4,6 milljónir orða. Gagnasafninu er skipt upp í þjálfunarsett og tvö prófunarsett “dev” og “eval”. Þjálfunarsettið er 514,5 klst. að lengd og inniheldur gögn frá 2005-2015. Ræðubútunum frá 2016 var skipt jafnt upp á milli prófunarsafnanna, með 14 klst. í hvoru. “eval” safnið er hreinna en “dev” safnið og bæði eru þau hreinni en þjálfunargögnin. Framburðarorðabókin er endurbætt útgáfa af framburðarorðabók Hjal verkefnisins (E. Rögnvaldsson, 2003), sem er aðgengileg hér á Málföngum, auk þess sem algengum orðum úr Alþingisræðum frá 2003-2015 og frá Málróms gagnasafninu (J. Guðnason et al., 2012) er bætt við. Hún inniheldur um 181 þús. orð. Framburðarlýsing nýju orðanna fékkst með Sequitur G2P líkani (M. Bisani et al., 2008), sem þjálfað hafði verið á framburðarorðabók Hjal verkefnisins auk Málrómsgagnanna. Alþingistextar frá árunum 2003-2015 voru notaðir til að gera mállíkönin. Annað þeirra er lítið 3-gram líkan, notað í afkóðun. Hitt er stórt 5-gram líkan, í “constant-arpa formati”, notað til að leiðrétta upphaflegu afkóðunina. Þessi gögn, framburðarorðabók og mállíkön voru notuð til að þjálfa talgreini með 10.23% villutíðni orða. Notast var við hljóðlíkan sem byggði á samblandi af djúptauganetum með tímaseinkun (e. time-delay deep neural networks, TD- DNN) og djúptauganetum með lang-skammtímaminni (e. long short term memory DNN, LSTM-DNN). Notast var við Switchboard uppskriftina í Kalda tólinu (D. Povey et al., 2011) (https://github.com/kaldi-asr/kaldi/tree/master/egs/swbd) við þjálfun hljóðlíkansins. Alþingis uppskriftin mun fljótlega vera gerð aðgengileg almenningi. ------ Þegar birtar eru niðurstöður sem eru fengnar með því að nota gögnin vinsamlegast vitnið í: Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir og Jón Guðnason, 2017. Building an ASR corpus using Althingi’s Parliamentary speaches. Proceedings of Interspeach 2017.