Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Michal Křen Institute of the Czech National Corpus Charles University, Prague SLAVICORP Warszawa, 22 November 2010 Accessing the.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
ELN – Natural Language Processing Giuseppe Attardi
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Researching language with computers Paul Thompson.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Tree-based Machine Translation using syntax and semantics
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
How Can Corpora Help Me To Be Successful in CO150?
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NSF PARTNERSHIP FOR RESEARCH AND EDUCATION : M EANING R EPRESENTATION FOR S TATISTICAL L ANGUAGE P ROCESSING 1 TectoMT TectoMT = highly modular software.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Corpus Linguistics Anca Dinu February, 2017.

Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Text Analytics Giuseppe Attardi Università di Pisa
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Korpuslinguistik mit und für Computerlinguistik
Using GOLD to Tracking L2 Development
Presentation transcript:

Language Data Resources About Corpora

J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten svůj jazyk? Řekl by přec člověk, že mám-li něco milovat, musím to znát. Vy však češtinu neznáte, a říkám-li to, není to ani obžaloba, ani vůbec výtka. Nemůžete ji znát a obsáhnout, to se dokonale nepodařilo ještě nikomu…“

Merriam-Webster OnLine:

Corpus F. Čermák: corpus – a structured, unified (and often also tagged) large collection of language data T.McEnery: Corpus data – the raw fuel of NLP

Corpus linguistics A study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use Corpus linguistics is not a language theory.

A. by medium: –printed, electronic text, digitized speech, video B. by design method: –balanced vs. special C. language variables: –monolingual vs. multilingual –original vs. translations –native speaker vs. Learner D. language evolution: –synchronic vs. diachronic E. Plain vs. annotated Corpora classification

Balanced corpora (?)‏ T.McEnery: “Sampling is inescapable.“ Proportions corresponding to the real language usage Is that possible? Criteria for choosing styles, genres, and eventually concrete texts? reception (a few authors, large audience) vs. perception (produkce of a large community of language users)‏ N. Chomsky: “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.“

Corpus size Brown Corpus – 1 MW (1964)‏ British Natural Corpus – 100 MW (1994)‏ – Cosmas – 1.6 GW (2004)‏ –

Exercise Could you estimate the amount of Czech texts (measured in running words) available on the Internet?

Antecedents of corpora Excerption tickets –For Czech systematically from 1911 Electronic corpus of Czech tests –1970s –around 500kW

Corpus annotation K.Pala: “Annotating consist of adding selected linguistic information to an existing corpus of written or spoken language. Typically, this is done by some kind of coding being attached (semi)automatically or manually to the electronic representation of the text.“ Raw texts: difficult to exploit solution: gradual „information adding“ (more exactly: adding information in an explicit, machine tractable form), Annotation  ease of exploitation + reusability

Criticism of corpus annotation Corpus annotations produce impure corpora –forced interpretations Consistency vs. Accuracy

Czech National Corpus ÚČNK (Institute of Czech National Corpus) founded in 1994 diachronous section th century - DIAKORP synchronous section – from around 1900 –written language – 100MW v SYN 2000 –spoken language – Prague spoken corpus (PMK), Brno spoken corpus (BMK)‏ –dialects

Czech National Corpus

SYN2000

Preprocessing Collect textual material –electronic form –scanning+OCR –trend: WWW as a corpus Conversion and cleaning –Unified format (problém: loosing some information)‏ –Unified encoding (problem: encoding detection)‏ Document classification Document segmentation –segmentation on sentence boundaries (problem: tables, direct speech…)‏ –Tokenization on word boundaries (problem: what is a word?)

(Morphological) Tagging (1) Morphological analysis –For each word form, list all possible lemma+tag pairs (or list of sequences of such pairs, if tokenization is not straightforward)‏ (2) Disambiguation –choose one lemma+tag pair

Parallel corpora texts and their translations into another language (or into more languages)‏ added value - alignment –explicit pairing of corresponding chunks of text –ideally diagonal –often just sentence-level alignment –automatized alignment? anchor points, word-pairs, …

MULTEXT-EAST Multilingual Text Tools and Corpora for Central and Eastern European Languages Lexical resources –Entry: word form + lemma + MSD –MSD – morphosyntactic descriptions (Ncms – Noun common masculine singular)‏ Annotated multilingual corpus –Translations of George Orwell's "1984", about 100kW –Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as for English (hub language)‏ –(and recently also Croatian, Lithuanian, Resian, Romanian, Russian, Slovene)‏ –Hand-validated sentence alignment version 3 released in 2004 (publically available)‏ TEI P4 XML

Prague Czech-English Dependency Treebank Czech translation of 21,600 English sentences from the Wall Street Journal part of Penn Treebank 3 corpusPenn Treebank 3 Czech-English corpus of plain text from Reader's Digest consisting of 53,000 parallel sentences automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures Available via LDC PCEDT

E. Brill: “More data is more important than better algorithms“ E. Charniak: “Future is in statistics.“