YARN Data Formats
Notable feature of Yet Another RussNet is openness of data and formats for the data exchange on the input and output sides. Format specifications and examples can be found here: https://github.com/russianwordnet/yarn-formats.
There are few machine readable formats available for the thesaurus: CSV dump contains synsets and their metadata, XML dump contains annotated portion of the thesaurus. Dumps get generated daily at midnight Yekaterinburg Time and reflect the state of the resource on that moment.
The following notation is used for coding grammatical information:
n — noun,
a — adjective,
v — verb. The thesaurus does not contain word surface forms, only word lemmas.
Synsets of the thesaurus in the CSV format are available here: http://russianword.net/yarn-synsets.csv. Column delimiters are commas, first line is the header. Each line contains the following fields:
id— unique identifier of a synset, which can be also used as a part of the URL path in the HTTP interface:
words— list of words in the synset, with (
;) as a delimiter.
grammar— general grammatical descriptor of the synset.
domain— domain of the synset (where available).
Here is a sample from the file
This sample contains information on the synset available through the HTTP interface: http://russianword.net/synsets/1 from the domain «транспортное» (transport/vehicles), containing nouns «автомашина», «машина», «колёса», «драндулет», «авто», «автомобиль», «тачка».
Yet Another RussNet contains more than just words in synsets.
Schematically, the tree of a document looks like this:
+ yarn |--+ words | |--+ wordEntry(id) | |-- word — lemma | |-- grammar — grammatical descriptor | |-- url — URL of the source |--+ synsets |--+ synsetEntry(id) |--+ word(ref → wordEntry) |-- mark — dictionary label |-- definition — definition of the word in the synset |-- example — word usage example
Here is the example of a document implementing this schema: yarn.xml.
The data import procedure can be found in detail here: YARN/Словари.
Lexicon format is XML, and largely reminds the format of the data export. The difference is that it does not contain synsets, only words. The lexicon complies with the schema: yarn-raw-lexicon.xsd. Example of a lexicon file: yarn-raw-lexicon.xml.
Lists of synonyms are written in the CSV format with two columns:
word2, which represent pairs of synonyms. Comma is used as the delimiter. Example: yarn-raw-synonyms.csv. Sample of the file:
word1,word2 актёр,артист актриса,артист
The following file is used as the format for representing the word frequencies:
freqrnc2012.csv, and is based on the materials of the Russian National Corpus: http://hsemysql.wikispaces.com/aggregation.
The present page is a translation of the page YARN/Формат in Russian.