Methods and Tools for Development of the Russian Reference Corpus Serge Sharoff University of Leeds.

Slides:



Advertisements
Similar presentations
1 Ethernet Wiring Qutaibah Malluhi CSE Department Qatar University.
Advertisements

ELibrary Curriculum Edition (CE) The ultimate K-12 curriculum and reference solution 2008.
ELIBRARY CURRICULUM EDITION The ultimate K-12 curriculum and reference solution.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment Eelco Mossel LSP 2007, Hamburg.
WVDE Online SB-IEP A System for Analyzing and Prioritizing Instruction for Students with Exceptionalities.
1  1 =.
Why is the Times Literary Supplement Historical Archive an essential resource? The TLS is the worlds leading newspaper for cultural studies Over 100 years.
Corpora in grammatical studies
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
[ 1 ] © 2011 iParadigms, LLC Benefits for Teaching. Impact on Learning. Introduction to Turnitin.
Lilian Blot TO PROGRAMMING & PYTHON Introduction Autumn 2012 TPOP 1.
Quranic Arabic Corpus Data Mining & Text Analytics By Ismail Teladia & Abdullah Alazwari.
UNIT 2: SOLVING EQUATIONS AND INEQUALITIES SOLVE EACH OF THE FOLLOWING EQUATIONS FOR y. # x + 5 y = x 5 y = 2 x y = 2 x y.
1 Cross-Correlations and Cleaning Up Data Jessica Ferguson.
1 Programming Languages (CS 550) Mini Language Interpreter Jeremy R. Johnson.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Elements of NONFICTION.  PURPOSE: reasons for writing  POINT OF VIEW: perspective or opinion about a subject  TONE: attitude projected by certain words.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Genre /ZHän’ rə/ a class or category of artistic endeavor having a particular form, content, technique, or the like: the genre of epic poetry; the genre.
Types of Theological Writing Seminar in Theological Research (Courtesy of Mrs. Sally Shelton )
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Channel Oral texts Written texts Intent of the Communicator Various types of texts (procedural, expository, persuasive, narrative, descriptive)
Online Scholarly Editions Introduction to Advanced Research Academic Technology Services.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Studying how English works, its history and its impact on others so we can better understand our linguistic identity and our heritage NAME DATE The Unit.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
Forms of Literature Grade 7. Nonfiction  Factual writing that is designed to explain, argue, describe, or instruct.
Elements of NONFICTION. WHAT IS NONFICTION?  The subject of nonfiction is real The author writes about actual persons, places and events. The writer.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
PRIMARY AND SECONDARY SOURCES.  These are actual accounts of events or the original documents  Diaries  Letters  Journals  Speeches  Interviews.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
The Genres. What is a genre?  A category of literature or non-fiction  Each genre has its own unique style, form, and content.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Key terms Text Semiotics Semantic Syntax Pragmatics Transcoding Specialized text Non-specialized text.
Nonfiction Learning to understand and appreciate forms of non- fiction.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Module 2 Research and Library Skills Part 1 Assessing information from primary sources Advice on acceptable primary sources Developed by Céline Benoit,
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
What is “Genre”? How can you tell to which genre a book belongs?
Genre Study Genre: A category used to classify literary works, usually by form, technique or content (e.g., prose, poetry).
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
DATABASES. Learning outcomes for today By the end of this session you will be able to: ◦ Use boolean operators ◦ Understand the structure of information.
What is the best way to find the truth?
What is the best way to find the truth?
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
How to Prepare an Annotated Bibliography
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
European Network of e-Lexicography
Searching the Literature
Using GOLD to Tracking L2 Development
Informational/Explanatory Writing
Informational Text.
Presentation transcript:

Methods and Tools for Development of the Russian Reference Corpus Serge Sharoff University of Leeds

Talk map History of development of Russian corpora What is different from the BNC: the text typology (metatextual annotation) the proportion of domains and genres of texts the scheme of morphological annotations the query language

The history (Zasorina, 1977): a corpus-based frequency dictionary (Lönngren, 1993): Uppsala corpus The Computer Fund of Russian Language (1985-) Modern corpora (2002): –Modern fiction (500 kW) with morph. annotations –News wires (200 kW) with syntactic annotations –Newspapers (200 kW) with genre annotations

Differences from the BNC: Text typology EAGLES (Sinclair, 1996) and TEI guidelines Internal parameters I1 – domain I2 – style External parameters E1 – origin E2 – state E3 –aims (audience and outcome intended)

E1: the origin of a text the year of text creation the authorship (single|multiple|corporate) the author's age (child|teen|young|mid|senior) the author's sex (male|female) the place of author's origin

E2: the appearance of the text the mode (written|spoken|w-to-be-spoken|electronic) the hierarchy of types for written texts: printed books / newspapers / magazines / ephemera typed (all sorts of reports and documentation) correspondence official / personal

E3.1: the audience of the text the size of the audience private 2 / 3 / 5 / 6-20 / public small / medium / large / very large the age of the audience the constituency of the audience general / informed / specialist

E3.2: the intended outcome of the text discussion polemic / position statements / arguments recommendation reports / advice / legal documents recreation fiction (general, detective, scifi, love, humour, drama…) nonfiction (biography, memoirs, letters) information instruction (textbooks, practical books)

Internal parameters I1: domains (a BNC-derived list) I2: styles Fiction neutral / regional / lowly / official / individual Nonfiction neutral / formal / informal / academic

The Systemic Coder for annotating

The comparison of coverage DomainBNCBOKR Spoken10.7 %5 % Imaginative16.7 %30 % Politics (world affairs)18.9 %15 % Commerce7.6 %5 % Natural sciences3.8 %5 % Applied sciences7.2 %10 % Social sciences14.2 %12 % Art6.8 %5 % Leisure11.2 %10 % Belief and thought3.1 %3 %

Morphosyntactic annotation: facts Rich inflective morphology: 6 cases, 3 genders, 2 numbers: 36 feature bundles for adjectives (144 for participles) Many ambiguities horosho – adj,neutr,sing|adv|predicative znakomoj – adj|noun, gen|dat|loc, sing knigi – [sing,gen]|[plur,nom] Shallow parsing can get decrease the ambiguity horosho znakomoj knigi (well-known book) Reduction of the ambiguity: 60% -> 30% (gram) 30%-> 20% (lexical)

The annotation scheme Requirements representation of relevant morphosyntactic facts; compact representation of the ambiguity; easy indexing and searching The solution is the TEI scheme with some modifications: xxx

An example of the annotation Mne bylo ochen' zhalko svoih chasov, … (I was very sorry about loosing my watch, …) Мне было очень жалко своих часов

The query interface

Other activities a corpus of classic Russian ( ) a parallel corpus of translations from/into Russian a corpus of old Russian (X-XIII centuries) a Russian dependency treebank

Advertisements Russian Standard (existing 500 kW) A corpus of newspaper texts (200 kW) A frequency dictionary (from a 40 MW corpus) BOKR corpus description (100 MW)