Show simple item record

 
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Guðjónsson, Ásmundur Alma
dc.contributor.author Loftsson, Hrafn
dc.date.accessioned 2022-06-13T10:29:09Z
dc.date.available 2022-06-13T10:29:09Z
dc.date.issued 2020-06-12
dc.identifier.uri http://hdl.handle.net/20.500.12537/230
dc.description This Icelandic named entity (NE) corpus, MIM-GOLD-NER, is a version of the MIM-GOLD 21.05 corpus (https://repository.clarin.is/repository/xmlui/handle/20.500.12537/113) tagged for NEs. Over 48 thousand NEs are tagged in this corpus of one million tokens, which can be used for training named entity recognizers for Icelandic. The MIM-GOLD-NER corpus was developed at Reykjavik University in 2018–2020, funded by the Strategic Research and Development Programme for Language Technology (LT). Two LT students were in charge of the corpus annotation and of training named entity recognizers using machine learning methods. A semi-automatic approach was used for annotating the corpus. Lists of Icelandic person names, location names, and company names were compiled and used for extracting and classifying as many named entities as possible. Regular expressions were then used to find certain numerical entities in the corpus. After this automatic pre-processing step, the whole corpus was reviewed manually to correct any errors. The corpus is tagged for eight named entity types: PERSON – names of humans, animals and other beings, real or fictional. LOCATION – names of locations, real or fictional, i.e. buildings, street and place names, both real and fictional. All geographical and geopolitical entities such as cities, countries, counties and regions, as well as planet names and other outer space entities. ORGANIZATION – companies and other organizations, public or private, real or fictional. Schools, churches, swimming pools, community centers, musical groups, other affiliations. MISCELLANEOUS – proper nouns that don’t belong to the previous three categories, such as products, books and movie titles, events, such as wars, sports tournaments, festivals, concerts, etc. DATE – absolute temporal units of a full day or longer, such as days, months, years, centuries, both written numerically and alphabetically. TIME – absolute temporal units shorter than a full day, such as seconds, minutes, or hours, both written numerically and alphabetically. MONEY – exact monetary amounts in any currency, both written numerically and alphabetically. PERCENT – percentages, both written numerically and alphabetically MIM-GOLD-NER is intended for training of named entity recognizers for Icelandic. It is in the CoNLL format, and the position of each token within the NE is marked using the BIO tagging format. The corpus is provided in the same format as MIM-GOLD, grouped by source. It can be used in its entirety or by training on subsets of the text types that best fit the intended domain. While previous versions of MIM-GOLD-NER (https://repository.clarin.is/repository/xmlui/handle/20.500.12537/140) were compatible with older versions of MIM-GOLD, the current version uses the same tokenization as MIM-GOLD 21.05. Hjalti Daníelsson took care of converting the corpus to the current version of MIM-GOLD. The Named Entity Corpus is distributed with the same special user license as MIM-GOLD, which is based on the MIM license, since the texts in MIM-GOLD were sampled from the MIM corpus.
dc.description Þessi nafnakennslamálheild, MIM-GOLD-NER, er nafnamerkt útgáfa af málheildinni MIM-GOLD 21.05 (https://repository.clarin.is/repository/xmlui/handle/20.500.12537/113). Búið er að merkja ríflega 48 þúsund sérnöfn og aðrar nafneiningar í þessari milljón tóka málheild, til notkunar við þjálfun nafnaþekkjara fyrir íslensku. MIM-GOLD-NER málheildin var unnin í Háskólanum í Reykjavík á árunum 2018-2020, með styrk frá Markáætlun í tungu og tækni. Tveir máltækninemar sáu um að merkja við einingar í málheildinni og þjálfa nafnaþekkjaralíkön. Hálfsjálfvirk aðferð var notuð til að merkja við nafneiningar í textanum. Listar með íslenskum mannanöfnum, staðaheitum og fyrirtækjaheitum voru notaðir til að finna og flokka sérnöfn, og reglulegum segðum beitt til að finna tölulegar einingar í málheildinni. Að þessu forvinnsluskrefi loknu var málheildin yfirfarin handvirkt til að leiðrétta villur. Merkt er við átta mismunandi flokka nafneininga í málheildinni: PERSON (mannanöfn) - nöfn á fólki, dýrum og öðrum verum, raunverulegum eða tilbúnum. LOCATION (staðir) - heiti á stöðum, bæði raunveruleg og tilbúin, s.s. byggingar, götuheiti og staðaheiti. Hvers kyns landfræðilegar og stjórnskipulegar einingar á borð við borgir, lönd, sýslur og önnur svæði, sem og plánetur og önnur fyrirbæri í geimnum. ORGANIZATION (fyrirtæki og stofnanir) - fyrirtæki og aðrar stofnanir, raunveruleg eða tilbúin. Skólar, kirkjur, sundlaugar, félagsheimili, hljómsveitir, önnur félög. MISCELLANEOUS (ýmislegt) - sérnöfn sem ekki eiga heima í framangreindu flokkunum þremur, svo sem vöruheiti, bókatitlar og kvikmyndir, viðburðir á borð við stríð, íþróttamót, hátíðir, tónleika o.s.frv. DATE (dagsetningar) - nákvæmar tímaeiningar sem ná yfir að minnsta kosti heilan dag, svo sem dagar, mánuðir eða aldir, ritað með tölustöfum eða bókstöfum. TIME (tímasetningar) - nákvæmar tímaeiningar sem eru minna en einn dagur, svo sem sekúndur, mínútur og klukkustundir, ritað með tölustöfum eða bókstöfum. MONEY (upphæðir) - nákvæmar upphæðir í hvaða gjaldmiðli sem er, ritaðar með tölustöfum eða bókstöfum. PERCENT (prósentutölur) - prósentutölur, ritaðar með tölustöfum eða bókstöfum. MIM-GOLD-NER er hugsað fyrir þjálfun á nafnaþekkjurum fyrir íslensku. Málheildin er á CoNLL-sniðinu, og staðsetning hvers tóka innan nafneiningarinnar er merkt með BIO-sniðinu. Málheildin er sett fram tilreidd á sama hátt og MIM-GOLD, eftir uppruna. Hægt er að nota málheildina í heild sinni eða þjálfa á hlutum hennar eftir sviði. Þessi útgáfa MIM-GOLD-NER er tilreidd á sama hátt og MIM-GOLD 21.05, en fyrri útgáfur (https://repository.clarin.is/repository/xmlui/handle/20.500.12537/140) eru samhæfar við eldri útgáfur af MIM-GOLD. Hjalti Daníelsson sá um að samræma málheildina yfir í nýja útgáfu MIM-GOLD (21.05). Nafnakennslamálheildin er sett fram með sama sérstaka notkunarleyfi og MIM-GOLD, sem er byggt á leyfi MIM-málheildarinnar, þar sem textarnir í MIM-GOLD eru fengnir úr henni.
dc.language.iso isl
dc.publisher Reykjavik University
dc.relation.isreferencedby https://en.ru.is/kennarar/hrafn/papers/Named_entity_recognition_SLSP.pdf
dc.relation.replaces http://hdl.handle.net/20.500.12537/140
dc.rights Icelandic Mim Gold Standard for Named Entity Recognition (NER)
dc.rights.uri https://repository.clarin.is/repository/xmlui/page/license-mim-gold-ner
dc.rights.label PUB
dc.subject named entity recognition
dc.subject named entity corpus
dc.subject named entities
dc.subject information extraction
dc.subject gold standard
dc.title MIM-GOLD-NER 2.0 – named entity recognition corpus (22.06) (2022-06-10)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Svanhvít Ingólfsdóttir svanhviti16@ru.is Reykjavik University
sponsor RANNÍS 180027-5301 The Strategic Research and Development Programme for Language Technology nationalFunds
size.info 1058643 tokens
files.size 2560426
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Icelandic Mim Gold Standard for Named Entity Recognition (NER)
Icon
Name
MIM-GOLD-2_0.zip
Size
2.44 MB
Format
application/zip
Description
Unknown
MD5
2110a6b863680ccc62c62d5698875abf
 Download file  Preview
 File Preview  
  • MIM-GOLD-2_0
    • laws.txt-1 B
    • radio_tv_news.txt-1 B
    • mbl.txt-1 B
    • README-1 B
    • fbl.txt-1 B
    • websites.txt-1 B
    • school_essays.txt-1 B
    • blog.txt-1 B
    • webmedia.txt-1 B
    • scienceweb.txt-1 B
    • emails.txt-1 B
    • written-to-be-spoken.txt-1 B
    • adjudications.txt-1 B
    • books.txt-1 B

Show simple item record