YARN/Format

Материал из NLPub
Перейти к: навигация, поиск

YARN Data Formats

Notable feature of Yet Another RussNet is openness of data and formats for the data exchange on the input and output sides. Format specifications and examples can be found here: https://github.com/russianwordnet/yarn-formats.

Data export

There are few machine readable formats available for the thesaurus: CSV dump contains synsets and their metadata, XML dump contains annotated portion of the thesaurus. Dumps get generated daily at midnight Yekaterinburg Time and reflect the state of the resource on that moment.

The following notation is used for coding grammatical information: n — noun, a — adjective, v — verb. The thesaurus does not contain word surface forms, only word lemmas.

CSV Format

Synsets of the thesaurus in the CSV format are available here: http://russianword.net/yarn-synsets.csv. Column delimiters are commas, first line is the header. Each line contains the following fields:

  • id — unique identifier of a synset, which can be also used as a part of the URL path in the HTTP interface: http://russianword.net/synsets/<id>.
  • words — list of words in the synset, with (;) as a delimiter.
  • grammar — general grammatical descriptor of the synset.
  • domain — domain of the synset (where available).

Here is a sample from the file yarn-synsets.csv:

id,words,grammar,domain
1,автомашина;машина;колёса;драндулет;авто;автомобиль;тачка,n,транспортное

This sample contains information on the synset available through the HTTP interface: http://russianword.net/synsets/1 from the domain «транспортное» (transport/vehicles), containing nouns «автомашина», «машина», «колёса», «драндулет», «авто», «автомобиль», «тачка».

XML Format

Yet Another RussNet contains more than just words in synsets.

XML dump is available here: http://russianword.net/yarn.xml. The XML dump complies with the XSD: yarn.xsd. We'll be happy to accept a pull request with a translation of this schema to English.

Schematically, the tree of a document looks like this:

+ yarn
|--+ words
|  |--+ wordEntry(id)
|     |-- word    — lemma
|     |-- grammar — grammatical descriptor
|     |-- url     — URL of the source
|--+ synsets
   |--+ synsetEntry(id)
      |--+ word(ref → wordEntry)
         |-- mark       — dictionary label
         |-- definition — definition of the word in the synset
         |-- example    — word usage example

Here is the example of a document implementing this schema: yarn.xml.

Data import

The data import procedure can be found in detail here: YARN/Словари.

Lexicon

Lexicon format is XML, and largely reminds the format of the data export. The difference is that it does not contain synsets, only words. The lexicon complies with the schema: yarn-raw-lexicon.xsd. Example of a lexicon file: yarn-raw-lexicon.xml.

Synonyms

Lists of synonyms are written in the CSV format with two columns: word1 and word2, which represent pairs of synonyms. Comma is used as the delimiter. Example: yarn-raw-synonyms.csv. Sample of the file:

word1,word2
актёр,артист
актриса,артист

Frequencies

The following file is used as the format for representing the word frequencies: freqrnc2012.csv, and is based on the materials of the Russian National Corpus: http://hsemysql.wikispaces.com/aggregation.

Notes

The present page is a translation of the page YARN/Формат in Russian.