Guidelines for Data Submission
- Names of files and directories
- File extensions
- File compression
- Controlled values
- Language codes
- Dates and times
- Accepted binary formats
- Autovisual annotation
- Autovisual source language data
- Image source language data
- Lexical resource
- Statistical data
- Text annotation
- Textual source language data
- XML schemas
- Compression and packaging
- Encoding of textual files
- Character encoding
- Plain text files
- Tabular data
- HTML documents
- XML documents
- TEI documents
- Linguistic annotation vocabularies
The CLARIN-IS repository does usually not accept entries without data (i.e. without the bitstreams attached to the entry). Here below are the guidelines on the structure of the deposited language resources, which formats are accepted by the CLARIN-IS repository, and what standards should be used as the annotation formats in the textual language resource files.
Names of files and directories
Filenames and directory names should only contain ASCII letters, digits, the hyphen ("-") and period (".") characters. They should not contain spaces, underscores, brackets, quotes, dollars, slashes, colons, or other punctuation characters (except hyphen and period), nor accented letters or other non-ASCII characters. Examples of good filenames are "news.v1.zip", "ParlaMint-IS.xml", "TextNormalization-statistics.tsv".
Standard or commonly recognised file extensions should be used, such as ".txt", ".xml", ".jpg". Double extensions can be used (e.g. "igc.tei.xml" or "icg.TEI.zip") to indicate that the file is in a standard encoding, or that an archive file contains files of a certain type.
In the rest of this document, the preferred extensions are given next to the file types.
When resources deposited in the repository are compressed, a complete directory should be compressed, and the name of the compressed file should be the same as the directory it unpacks in. For example, the file "IGC-Parla.21.05.zip" should unpack into the directory " IGC-Parla.21.05/" which then contains the files and possibly subdirectories. It is recommended that the directory also contains a README text file, which gives the title of the resource and its handle as well as a short description.
CLARIN.IS prefers ZIP (.zip) files, but accepts TAR (.tgz) or, for single files, GNU ZIP (.gz).
When the data (or filename) needs to refer to a certain language, language codes should be used, rather than names of languages. When they exist, the two-letter ISO 639-1 language codes should be used, while for languages that do not have a two-letter code, the three-letter code from ISO 639-3 should be used.
Dates and times
All dates and times that appear in a machine-processable context should follow ISO 8601, i.e. “2020-12-28” for a date, “23:21:21” for a time, and “2020-12-28T23:21:12” for a combination of the two.
Accepted binary formats
Below is a list of the formats accepted by CLARIN.IS. The formats have been grouped into functional domains. Each item in the list is also a link to further information about the format, usually the one given on CLARIN‘s Standards Information System (SIS), accessible at https://standards.clarin.eu/.
- Audovisual Annotation
- Audovisual Source Language Data
- Image Source Language Data
- Lexical Resource
- Statistical Data
- Text Annotation
- Textual Source Language Data
- XML schemas
- Compression and packaging
Encoding of textual files
As most of the repository submissions involve files, which are essentially text files (including numeric data, source program files, XML files, etc.), we here explain how such files should be encoded in more detail.
CLARIN.IS accepts only Unicode files. We do not accept files with 8-bit encodings, such as ISO 8859 or Windows code pages. The Unicode files should be encoded in UTF-8, with exceptions being text files in non-Latin based scripts, such as Japanese, which can use UTF-16.
Plain text files
For unstructured text, we accept plain text files (.txt). Trivial formatting, such as the fact that a line break indicates a new paragraph or that text in square brackets indicates a transcriber comment can also be included, as long as the conventions used are explained in a README file.
For spreadsheet or database-like data, we accept commonly used formats such as tab (.txt/.tsv/.tab) and comma (.txt/.csv) separated values. The tabular files should contain a header row and the data should be accompanied by a README file, explaining the meaning of the columns.
Annotated corpora can be submitted in the CoNLL-U format (.connlu) used by the Universal Dependencies project.
We do not accept HTML (.html/.htm) documents as primary data, however, they can be used for documenting the entry, e.g. containing the explanation of the structure of the data or its linguistic annotation. Such HTML documents should be valid according to some version of HTML (preferably XHTML) and self-sufficient, i.e. if CSS is used, it should be, preferably, embedded in the HTML file(s) or stored together with them.
By far the most common format of submissions is XML (.xml), which allows for richly and hierarchically structured text data. CLARIN.IS accepts any valid XML documents, where:
- the schema, that is used to validate a document is well-known and publicly available from a stable location, which includes the documentation, e.g. RDF/XML (.rdf) or ELAN (.eaf);
- or the schema, including its documentation, is a part of the repository entry.
We accept the schemas in any XML schema definition languages, i.e. DTD (.dtd), RelaxNG (.rng/.rnc) and W3C XML schema (.xsd), as well as Schematron (.xml)
The preferred XML encoding of CLARIN.SI repository entries is TEI (.tei/.xml), i.e. using the Text Encoding Initiative Guidelines for encoding structured language resources, such as language corpora, machine-readable dictionaries, text-critical editions, etc.
When the type of the deposited language resource is covered by any of the standard or best-practice customisations of the TEI, such as ISO 24624:2016 for transcriptions of spoken language, TEI Lex0 for dictionaries, or Parla-CLARIN for encoding corpora of parliamentary debates, these schemas should be used in preference to using bespoke or generic TEI encodings
Linguistic annotation vocabularies
Most language corpora are annotated on various levels with linguistic categories. These categories must be documented, either on stable external URLs or together with the repository entry, i.e. in included files or, esp. with TEI-encoded corpora, as part of the corpus document itself.