Show simple item record

 
dc.contributor.author Ingólfsdóttir, Svanhvít Lilja
dc.contributor.author Guðjónsson, Ásmundur Alma
dc.contributor.author Loftsson, Hrafn
dc.date.accessioned 2021-09-29T11:49:19Z
dc.date.available 2021-09-29T11:49:19Z
dc.date.issued 2020-06-12
dc.identifier.uri http://hdl.handle.net/20.500.12537/140
dc.description This Icelandic named entity (NE) corpus, MIM-GOLD-NER, is a version of the MIM-GOLD 1.0 corpus tagged for NEs. Over 48 thousand NEs are tagged in this corpus of one million tokens, which can be used for training named entity recognizers for Icelandic. The MIM-GOLD-NER corpus was developed at Reykjavik University in 2018–2020, funded by the Strategic Research and Development Programme for Language Technology (LT). Two LT students were in charge of the corpus annotation and of training named entity recognizers using machine learning methods. A semi-automatic approach was used for annotating the corpus. Lists of Icelandic person names, location names, and company names were compiled and used for extracting and classifying as many named entities as possible. Regular expressions were then used to find certain numerical entities in the corpus. After this automatic pre-processing step, the whole corpus was reviewed manually to correct any errors. The corpus is tagged for eight named entity types: PERSON – names of humans, animals and other beings, real or fictional. LOCATION – names of locations, real or fictional, i.e. buildings, street and place names, both real and fictional. All geographical and geopolitical entities such as cities, countries, counties and regions, as well as planet names and other outer space entities. ORGANIZATION – companies and other organizations, public or private, real or fictional. Schools, churches, swimming pools, community centers, musical groups, other affiliations. MISCELLANEOUS – proper nouns that don’t belong to the previous three categories, such as products, books and movie titles, events, such as wars, sports tournaments, festivals, concerts, etc. DATE – absolute temporal units of a full day or longer, such as days, months, years, centuries, both written numerically and alphabetically. TIME – absolute temporal units shorter than a full day, such as seconds, minutes, or hours, both written numerically and alphabetically. MONEY – exact monetary amounts in any currency, both written numerically and alphabetically. PERCENT – percentages, both written numerically and alphabetically MIM-GOLD-NER is intended for training of named entity recognizers for Icelandic. It is in the CoNLL format, and the position of each token within the NE is marked using the BIO tagging format. The corpus is provided in the same format as MIM-GOLD, grouped by source, and additionally in an 80-10-10 train-validation-test split. It can be used in its entirety or by training on subsets of the text types that best fit the intended domain. The Named Entity Corpus is distributed with the same special user license as MIM-GOLD, which is based on the MIM license, since the texts in MIM-GOLD were sampled from the MIM corpus. Þessi nafnakennslamálheild, MIM-GOLD-NER, er nafnamerkt útgáfa af málheildinni MIM-GOLD. Búið er að merkja ríflega 48 þúsund sérnöfn og aðrar nafneiningar í þessari milljón tóka málheild, til notkunar við þjálfun nafnaþekkjara fyrir íslensku. MIM-GOLD-NER málheildin var unnin í Háskólanum í Reykjavík á árunum 2018-2020, með styrk frá Markáætlun í tungu og tækni. Tveir máltækninemar sáu um að merkja við einingar í málheildinni og þjálfa nafnaþekkjaralíkön. Hálfsjálfvirk aðferð var notuð til að merkja við nafneiningar í textanum. Listar með íslenskum mannanöfnum, staðaheitum og fyrirtækjaheitum voru notaðir til að finna og flokka sérnöfn, og reglulegum segðum beitt til að finna tölulegar einingar í málheildinni. Að þessu forvinnsluskrefi loknu var málheildin yfirfarin handvirkt til að leiðrétta villur. Merkt er við átta mismunandi flokka nafneininga í málheildinni: PERSON (mannanöfn) - nöfn á fólki, dýrum og öðrum verum, raunverulegum eða tilbúnum. LOCATION (staðir) - heiti á stöðum, bæði raunveruleg og tilbúin, s.s. byggingar, götuheiti og staðaheiti. Hvers kyns landfræðilegar og stjórnskipulegar einingar á borð við borgir, lönd, sýslur og önnur svæði, sem og plánetur og önnur fyrirbæri í geimnum. ORGANIZATION (fyrirtæki og stofnanir) - fyrirtæki og aðrar stofnanir, raunveruleg eða tilbúin. Skólar, kirkjur, sundlaugar, félagsheimili, hljómsveitir, önnur félög. MISCELLANEOUS (ýmislegt) - sérnöfn sem ekki eiga heima í framangreindu flokkunum þremur, svo sem vöruheiti, bókatitlar og kvikmyndir, viðburðir á borð við stríð, íþróttamót, hátíðir, tónleika o.s.frv. DATE (dagsetningar) - nákvæmar tímaeiningar sem ná yfir að minnsta kosti heilan dag, svo sem dagar, mánuðir eða aldir, ritað með tölustöfum eða bókstöfum. TIME (tímasetningar) - nákvæmar tímaeiningar sem eru minna en einn dagur, svo sem sekúndur, mínútur og klukkustundir, ritað með tölustöfum eða bókstöfum. MONEY (upphæðir) - nákvæmar upphæðir í hvaða gjaldmiðli sem er, ritaðar með tölustöfum eða bókstöfum. PERCENT (prósentutölur) - prósentutölur, ritaðar með tölustöfum eða bókstöfum. MIM-GOLD-NER er hugsað fyrir þjálfun á nafnaþekkjurum fyrir íslensku. Málheildin er á CoNLL-sniðinu, og staðsetning hvers tóka innan nafneiningarinnar er merkt með BIO-sniðinu. Málheildin er sett fram tilreidd á sama hátt og MIM-GOLD, eftir uppruna, en einnig skipt í þjálfunar-, þróunar- og prófunarsett, með 80-10-10 skiptingu. Hægt er að nota málheildina í heild sinni eða þjálfa á hlutum hennar eftir sviði. Nafnakennslamálheildin er sett fram með sama sérstaka notkunarleyfi og MIM-GOLD, sem er byggt á leyfi MIM-málheildarinnar, þar sem textarnir í MIM-GOLD eru fengnir úr henni.
dc.language.iso isl
dc.publisher Reykjavik University
dc.relation.replaces http://hdl.handle.net/20.500.12537/42
dc.relation.isreplacedby http://hdl.handle.net/20.500.12537/230
dc.rights Icelandic Mim Gold Standard for Named Entity Recognition (NER)
dc.rights.uri https://repository.clarin.is/repository/xmlui/page/license-mim-gold-ner
dc.rights.label PUB
dc.subject named entity recognition
dc.subject named entity corpus
dc.subject named entities
dc.subject information extraction
dc.subject gold standard
dc.title MIM-GOLD-NER – named entity recognition corpus (21.09)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding Clarin IS Repository
contact.person Svanhvít Ingólfsdóttir svanhviti16@ru.is Reykjavik University
sponsor RANNÍS 180027-5301 The Strategic Research and Development Programme for Language Technology nationalFunds
size.info 1005688 tokens
size.info 48331 other
files.size 5161874
files.count 2


 Files in this item

 Download all files in item (4.92 MB)
This item is
Publicly Available
and licensed under:
Icelandic Mim Gold Standard for Named Entity Recognition (NER)
Icon
Name
train-valid-test-split.zip
Size
2.48 MB
Format
application/zip
Description
train-valid-test-split
MD5
d523ed9fe22b8b4b350a578cd02455d1
 Download file  Preview
 File Preview  
Icon
Name
by-source.zip
Size
2.45 MB
Format
application/zip
Description
by-source
MD5
ed3ed3ef4ae99b2a694e12e12b97f425
 Download file  Preview
 File Preview  
  • __MACOSX
    • ._written-to-be-spoken.txt-1 B
    • ._fbl.txt-1 B
    • ._laws.txt-1 B
    • ._emails.txt-1 B
    • ._blog.txt-1 B
    • ._websites.txt-1 B
    • ._radio_tv_news.txt-1 B
    • ._webmedia.txt-1 B
    • ._scienceweb.txt-1 B
    • ._books.txt-1 B
    • ._school_essays.txt-1 B
    • ._adjudications.txt-1 B
    • ._mbl.txt-1 B
    • laws.txt-1 B
    • radio_tv_news.txt-1 B
    • mbl.txt-1 B
    • fbl.txt-1 B
    • websites.txt-1 B
    • blog.txt-1 B
    • school_essays.txt-1 B
    • webmedia.txt-1 B
    • scienceweb.txt-1 B
    • emails.txt-1 B
    • written-to-be-spoken.txt-1 B
    • adjudications.txt-1 B
    • books.txt-1 B

Show simple item record