dc.contributor.author | Ingólfsdóttir, Svanhvít Lilja |
dc.contributor.author | Guðjónsson, Ásmundur Alma |
dc.contributor.author | Loftsson, Hrafn |
dc.date.accessioned | 2021-09-29T11:49:19Z |
dc.date.available | 2021-09-29T11:49:19Z |
dc.date.issued | 2020-06-12 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/140 |
dc.description | This Icelandic named entity (NE) corpus, MIM-GOLD-NER, is a version of the MIM-GOLD 1.0 corpus tagged for NEs. Over 48 thousand NEs are tagged in this corpus of one million tokens, which can be used for training named entity recognizers for Icelandic. The MIM-GOLD-NER corpus was developed at Reykjavik University in 2018–2020, funded by the Strategic Research and Development Programme for Language Technology (LT). Two LT students were in charge of the corpus annotation and of training named entity recognizers using machine learning methods. A semi-automatic approach was used for annotating the corpus. Lists of Icelandic person names, location names, and company names were compiled and used for extracting and classifying as many named entities as possible. Regular expressions were then used to find certain numerical entities in the corpus. After this automatic pre-processing step, the whole corpus was reviewed manually to correct any errors. The corpus is tagged for eight named entity types: PERSON – names of humans, animals and other beings, real or fictional. LOCATION – names of locations, real or fictional, i.e. buildings, street and place names, both real and fictional. All geographical and geopolitical entities such as cities, countries, counties and regions, as well as planet names and other outer space entities. ORGANIZATION – companies and other organizations, public or private, real or fictional. Schools, churches, swimming pools, community centers, musical groups, other affiliations. MISCELLANEOUS – proper nouns that don’t belong to the previous three categories, such as products, books and movie titles, events, such as wars, sports tournaments, festivals, concerts, etc. DATE – absolute temporal units of a full day or longer, such as days, months, years, centuries, both written numerically and alphabetically. TIME – absolute temporal units shorter than a full day, such as seconds, minutes, or hours, both written numerically and alphabetically. MONEY – exact monetary amounts in any currency, both written numerically and alphabetically. PERCENT – percentages, both written numerically and alphabetically MIM-GOLD-NER is intended for training of named entity recognizers for Icelandic. It is in the CoNLL format, and the position of each token within the NE is marked using the BIO tagging format. The corpus is provided in the same format as MIM-GOLD, grouped by source, and additionally in an 80-10-10 train-validation-test split. It can be used in its entirety or by training on subsets of the text types that best fit the intended domain. The Named Entity Corpus is distributed with the same special user license as MIM-GOLD, which is based on the MIM license, since the texts in MIM-GOLD were sampled from the MIM corpus. Þessi nafnakennslamálheild, MIM-GOLD-NER, er nafnamerkt útgáfa af málheildinni MIM-GOLD. Búið er að merkja ríflega 48 þúsund sérnöfn og aðrar nafneiningar í þessari milljón tóka málheild, til notkunar við þjálfun nafnaþekkjara fyrir íslensku. MIM-GOLD-NER málheildin var unnin í Háskólanum í Reykjavík á árunum 2018-2020, með styrk frá Markáætlun í tungu og tækni. Tveir máltækninemar sáu um að merkja við einingar í málheildinni og þjálfa nafnaþekkjaralíkön. Hálfsjálfvirk aðferð var notuð til að merkja við nafneiningar í textanum. Listar með íslenskum mannanöfnum, staðaheitum og fyrirtækjaheitum voru notaðir til að finna og flokka sérnöfn, og reglulegum segðum beitt til að finna tölulegar einingar í málheildinni. Að þessu forvinnsluskrefi loknu var málheildin yfirfarin handvirkt til að leiðrétta villur. Merkt er við átta mismunandi flokka nafneininga í málheildinni: PERSON (mannanöfn) - nöfn á fólki, dýrum og öðrum verum, raunverulegum eða tilbúnum. LOCATION (staðir) - heiti á stöðum, bæði raunveruleg og tilbúin, s.s. byggingar, götuheiti og staðaheiti. Hvers kyns landfræðilegar og stjórnskipulegar einingar á borð við borgir, lönd, sýslur og önnur svæði, sem og plánetur og önnur fyrirbæri í geimnum. ORGANIZATION (fyrirtæki og stofnanir) - fyrirtæki og aðrar stofnanir, raunveruleg eða tilbúin. Skólar, kirkjur, sundlaugar, félagsheimili, hljómsveitir, önnur félög. MISCELLANEOUS (ýmislegt) - sérnöfn sem ekki eiga heima í framangreindu flokkunum þremur, svo sem vöruheiti, bókatitlar og kvikmyndir, viðburðir á borð við stríð, íþróttamót, hátíðir, tónleika o.s.frv. DATE (dagsetningar) - nákvæmar tímaeiningar sem ná yfir að minnsta kosti heilan dag, svo sem dagar, mánuðir eða aldir, ritað með tölustöfum eða bókstöfum. TIME (tímasetningar) - nákvæmar tímaeiningar sem eru minna en einn dagur, svo sem sekúndur, mínútur og klukkustundir, ritað með tölustöfum eða bókstöfum. MONEY (upphæðir) - nákvæmar upphæðir í hvaða gjaldmiðli sem er, ritaðar með tölustöfum eða bókstöfum. PERCENT (prósentutölur) - prósentutölur, ritaðar með tölustöfum eða bókstöfum. MIM-GOLD-NER er hugsað fyrir þjálfun á nafnaþekkjurum fyrir íslensku. Málheildin er á CoNLL-sniðinu, og staðsetning hvers tóka innan nafneiningarinnar er merkt með BIO-sniðinu. Málheildin er sett fram tilreidd á sama hátt og MIM-GOLD, eftir uppruna, en einnig skipt í þjálfunar-, þróunar- og prófunarsett, með 80-10-10 skiptingu. Hægt er að nota málheildina í heild sinni eða þjálfa á hlutum hennar eftir sviði. Nafnakennslamálheildin er sett fram með sama sérstaka notkunarleyfi og MIM-GOLD, sem er byggt á leyfi MIM-málheildarinnar, þar sem textarnir í MIM-GOLD eru fengnir úr henni. |
dc.language.iso | isl |
dc.publisher | Reykjavik University |
dc.relation.replaces | http://hdl.handle.net/20.500.12537/42 |
dc.relation.isreplacedby | http://hdl.handle.net/20.500.12537/230 |
dc.rights | Icelandic Mim Gold Standard for Named Entity Recognition (NER) |
dc.rights.uri | https://repository.clarin.is/repository/xmlui/page/license-mim-gold-ner |
dc.rights.label | PUB |
dc.subject | named entity recognition |
dc.subject | named entity corpus |
dc.subject | named entities |
dc.subject | information extraction |
dc.subject | gold standard |
dc.title | MIM-GOLD-NER – named entity recognition corpus (21.09) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Svanhvít Ingólfsdóttir svanhviti16@ru.is Reykjavik University |
sponsor | RANNÍS 180027-5301 The Strategic Research and Development Programme for Language Technology nationalFunds |
size.info | 1005688 tokens |
size.info | 48331 other |
files.size | 5161874 |
files.count | 2 |
Files in this item
Download all files in item (4.92 MB)This item is
Icelandic Mim Gold Standard for Named Entity Recognition (NER)
Publicly Available
and licensed under:Icelandic Mim Gold Standard for Named Entity Recognition (NER)
- Name
- train-valid-test-split.zip
- Size
- 2.48 MB
- Format
- application/zip
- Description
- train-valid-test-split
- MD5
- d523ed9fe22b8b4b350a578cd02455d1
- Name
- by-source.zip
- Size
- 2.45 MB
- Format
- application/zip
- Description
- by-source
- MD5
- ed3ed3ef4ae99b2a694e12e12b97f425
- __MACOSX
- ._written-to-be-spoken.txt-1 B
- ._fbl.txt-1 B
- ._laws.txt-1 B
- ._emails.txt-1 B
- ._blog.txt-1 B
- ._websites.txt-1 B
- ._radio_tv_news.txt-1 B
- ._webmedia.txt-1 B
- ._scienceweb.txt-1 B
- ._books.txt-1 B
- ._school_essays.txt-1 B
- ._adjudications.txt-1 B
- ._mbl.txt-1 B
- laws.txt-1 B
- radio_tv_news.txt-1 B
- mbl.txt-1 B
- fbl.txt-1 B
- websites.txt-1 B
- blog.txt-1 B
- school_essays.txt-1 B
- webmedia.txt-1 B
- scienceweb.txt-1 B
- emails.txt-1 B
- written-to-be-spoken.txt-1 B
- adjudications.txt-1 B
- books.txt-1 B