dc.contributor.author | Daðason, Jón Friðrik |
dc.contributor.author | Loftsson, Hrafn |
dc.contributor.author | Sigurðardóttir, Salome Lilja |
dc.contributor.author | Björnsson, Þorsteinn |
dc.date.accessioned | 2022-09-30T13:34:35Z |
dc.date.available | 2022-09-30T13:34:35Z |
dc.date.issued | 2021-01-11 |
dc.identifier.uri | http://hdl.handle.net/20.500.12537/285 |
dc.description | IceSum is a collection of 1,000 Icelandic news articles from mbl.is, which have been manually annotated with summaries. The corpus contains local (50%), world (26%), business (14%) and sports (10%) news articles which were published between 1998-2019. The summaries are extractive and consist of sentences and sentence fragments from the original articles. Each article in the corpus is listed with a unique ID, title, original text, extractive summary, category, publication date and source URL. This version of IceSum includes a script for generating training data for the TransformerSum library with the same training, validation and test splits that were used in the original paper. Jón Friðrik Daðason, Salome Lilja Sigurðardóttir and Þorsteinn Björnsson contributed to this project, under the supervision of Hrafn Loftsson. IceSum er safn 1.000 íslenskra frétta af vefmiðlinum mbl.is með handgerðum samantektum. Málheildin inniheldur innlendar fréttir (50%), erlendar fréttir (26%), viðskiptafréttir (14%) og íþróttafréttir (10%) sem gefnar voru út á tímabilinu 1998-2019. Samantektirnar samanstanda af setningum og setningarhlutum úr upphaflegu fréttatextunum. Hverri frétt fylgir auðkennisnúmer, fyrirsögn, upprunalegur texti, handgerð samantekt, efnisflokkur, útgáfudagsetning og vefslóð. Þessi útgáfa af IceSum inniheldur skriftu til að útbúa þjálfunargögn fyrir TransformerSum forritasafnið með sömu skiptingu á gögnunum og var notuð í upphaflegu greininni. Jón Friðrik Daðason, Salome Lilja Sigurðardóttir og Þorsteinn Björnsson unnu að verkefninu undir umsjón Hrafns Loftssonar. |
dc.language.iso | isl |
dc.publisher | Reykjavik University |
dc.relation.isreferencedby | https://aclanthology.org/2021.naacl-srw.2/ |
dc.relation.replaces | http://hdl.handle.net/20.500.12537/96 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/cadia-lvl/icesum |
dc.subject | text corpus |
dc.subject | summarization |
dc.subject | news articles |
dc.title | IceSum - Icelandic Text Summarization Corpus (22.09) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | Clarin IS Repository |
contact.person | Jón Friðrik Daðason jond19@ru.is Reykjavik University |
contact.person | Hrafn Loftsson hrafn@ru.is Reykjavik University |
sponsor | Strategic Research and Development Programme for Language Technology 180037-5301 Sjálfvirk samantekt íslensks texta ("Automatic Text Summarization for Icelandic") nationalFunds |
size.info | 1000 articles |
files.size | 784369 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- icesum.zip
- Size
- 765.99 KB
- Format
- application/zip
- MD5
- 188e54da0a1347d3d086e8c128b7a1a1
- icesum
- readme.txt4 kB
- requirements.txt22 B
- process_transformersum.py6 kB
- icesum.json2 MB
- splits.json15 kB