dc.contributor.author | Friðriksdóttir, Steinunn Rut |
dc.contributor.author | Ingason, Anton Karl |
dc.date.accessioned | 2020-02-05T14:24:37Z |
dc.date.available | 2020-02-05T14:24:37Z |
dc.date.issued | 2020-01-28 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/13 |
dc.description | The Icelandic Confusion Set Corpus (ICoSC) is available under a CC-BY licence. It was compiled during the course of three months in 2019 by Steinunn Rut Friðriksdóttir and Anton Karl Ingason of the language technology department in the University of Iceland. Included in the ICoSC are CSV spreadsheets containing all collected confusion sets of each category and their frequencies. The spreadsheets are organized so that for each set, the total frequency of each candidate is calculated along with the frequency of each possible PoS tag for that candidate. The seventh and eight column of the tables contain binary values referring to whether the confusion set is grammatically disjoint (all PoS tags differ for the two candidates) or grammatically identical (all PoS tags are identical for the two candidates). The final column shows the frequency of the less frequent candidate of the set which can be used to determine which sets are viable in an experiment. Also included are text files containing the list of words from each category and text files containing all sentence examples from the IGC which contain the words for each category. As the n/nn examples are by far the most frequent confusion sets, the corpus also includes a word list and sentence examples for the 55 most frequent sets. All files have UTF-8 encoding. The ICoSC consists of the following categories of confusion sets, selected for their linguistic properties as homophones, separated orthographically by a single letter. The categories are: - 196 pairs containing y/i (leyti ’extent’ / leiti ’search’). - 150 pairs containing ý/í (sýn ’vision’ / sín ’theirs (possessive reflexive)’). - 1203 pairs containing nn/n (forvitinn ’curious(masc.)’ / forvitin ’curious (fem.)’). - 24 pairs containing hv/kv (hvað ’what’ / kvað ’chanted’). - 42 pairs containing rð/ðr (veðri ’weather (dative)’ / verði ’will become’). - 110 pairs containing rr/r (klárri ’smart (indef. fem. dative)’ / klári ’smart (def. masc. nominative)’). - 8 pairs commonly confused by Icelandic speakers, i.e. mig/mér (me (accusative) / me (dative)). |
dc.language.iso | isl |
dc.publisher | Háskóli Íslands |
dc.relation.isreferencedby | https://www.insticc.org/Primoris/Resources/PaperPdf.ashx?idPaper=93715 |
dc.relation.isreplacedby | http://hdl.handle.net/20.500.12537/19 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/steinunnfridriks/ICoSC |
dc.subject | homophones |
dc.subject | confusion sets |
dc.subject | context dependency |
dc.subject | rich morphology |
dc.subject | disambiguation |
dc.title | The Icelandic Confusion Set Corpus (ICoSC) 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Steinunn Rut Friðriksdóttir srf2@hi.is Háskóli Íslands |
files.size | 225339230 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- ICoSC.zip
- Size
- 214.9 MB
- Format
- application/zip
- Description
- zip folder containing word lists, sentence examples and frequency tables
- MD5
- c52924f781f87beccd60ecb518069cd7
- ICoSC
- Frequency tables
- rr_freq.csv11 kB
- y_freq.csv17 kB
- various_freq.csv1 kB
- nn_freq.csv101 kB
- y_elongated_freq.csv13 kB
- rddr_freq.csv4 kB
- nn_mostfreq_freq.csv7 kB
- hvkv_freq.csv2 kB
- Wordlists
- y_wordlist.txt3 kB
- y_elongated_wordlist.txt2 kB
- hvkv_wordlist.txt464 B
- rr_wordlist.txt2 kB
- rddr_wordlist.txt890 B
- nn_mostfreq_wordlist.txt836 B
- various_wordlist.txt99 B
- nn_wordlist.txt27 kB
- README.md4 kB
- Most common words - frequency tables
- Grammatically identical.csv652 B
- NeitherGInorGD.csv4 kB
- Grammaticaly disjoint.csv10 kB
- Sentence examples
- y_sent.txt107 MB
- nn_mostfreq_sent.txt147 MB
- hvkv_sent.txt16 MB
- various_sent.txt87 MB
- ylong_sent.txt96 MB
- nn_sent.txt389 MB
- rddr_sent.txt42 MB
- rr_sent.txt69 MB
- Frequency tables