Methods and Tools for Development of the Russian Reference Corpus Serge Sharoff University of Leeds
Talk map History of development of Russian corpora What is different from the BNC: the text typology (metatextual annotation) the proportion of domains and genres of texts the scheme of morphological annotations the query language
The history (Zasorina, 1977): a corpus-based frequency dictionary (Lönngren, 1993): Uppsala corpus The Computer Fund of Russian Language (1985-) Modern corpora (2002): –Modern fiction (500 kW) with morph. annotations –News wires (200 kW) with syntactic annotations –Newspapers (200 kW) with genre annotations
Differences from the BNC: Text typology EAGLES (Sinclair, 1996) and TEI guidelines Internal parameters I1 – domain I2 – style External parameters E1 – origin E2 – state E3 –aims (audience and outcome intended)
E1: the origin of a text the year of text creation the authorship (single|multiple|corporate) the author's age (child|teen|young|mid|senior) the author's sex (male|female) the place of author's origin
E2: the appearance of the text the mode (written|spoken|w-to-be-spoken|electronic) the hierarchy of types for written texts: printed books / newspapers / magazines / ephemera typed (all sorts of reports and documentation) correspondence official / personal
E3.1: the audience of the text the size of the audience private 2 / 3 / 5 / 6-20 / public small / medium / large / very large the age of the audience the constituency of the audience general / informed / specialist
E3.2: the intended outcome of the text discussion polemic / position statements / arguments recommendation reports / advice / legal documents recreation fiction (general, detective, scifi, love, humour, drama…) nonfiction (biography, memoirs, letters) information instruction (textbooks, practical books)
Internal parameters I1: domains (a BNC-derived list) I2: styles Fiction neutral / regional / lowly / official / individual Nonfiction neutral / formal / informal / academic
The Systemic Coder for annotating
The comparison of coverage DomainBNCBOKR Spoken10.7 %5 % Imaginative16.7 %30 % Politics (world affairs)18.9 %15 % Commerce7.6 %5 % Natural sciences3.8 %5 % Applied sciences7.2 %10 % Social sciences14.2 %12 % Art6.8 %5 % Leisure11.2 %10 % Belief and thought3.1 %3 %
Morphosyntactic annotation: facts Rich inflective morphology: 6 cases, 3 genders, 2 numbers: 36 feature bundles for adjectives (144 for participles) Many ambiguities horosho – adj,neutr,sing|adv|predicative znakomoj – adj|noun, gen|dat|loc, sing knigi – [sing,gen]|[plur,nom] Shallow parsing can get decrease the ambiguity horosho znakomoj knigi (well-known book) Reduction of the ambiguity: 60% -> 30% (gram) 30%-> 20% (lexical)
The annotation scheme Requirements representation of relevant morphosyntactic facts; compact representation of the ambiguity; easy indexing and searching The solution is the TEI scheme with some modifications: xxx
An example of the annotation Mne bylo ochen' zhalko svoih chasov, … (I was very sorry about loosing my watch, …) Мне было очень жалко своих часов
The query interface
Other activities a corpus of classic Russian ( ) a parallel corpus of translations from/into Russian a corpus of old Russian (X-XIII centuries) a Russian dependency treebank
Advertisements Russian Standard (existing 500 kW) A corpus of newspaper texts (200 kW) A frequency dictionary (from a 40 MW corpus) BOKR corpus description (100 MW)