Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison.

Slides:



Advertisements
Similar presentations
The Robert Gordon University School of Engineering Dr. Mohamed Amish
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
An Introduction to the new course: Language and Literature A1.2.
Uses of a Corpus “[E]xplore actual patterns of language use”
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.
© Marta Gómez Palou, Ottawa, Canada, 2006 A guide through the unknown: using corpora to translate into a non-native dialect Marta Gómez Palou New Research.
Information Retrieval in Practice
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty.
LELA English Corpus Linguistics
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
1 DEVELOPING ASSESSMENT TOOLS FOR ESL Liz Davidson & Nadia Casarotto CMM General Studies and Further Education.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Using corpora for bespoke language teaching
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi.
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Researching language with computers Paul Thompson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1. 2 Content The Romanische Bibliographie Online is the only comprehensive specialist bibliography for Romance language and literature studies –available.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Department of Chemical Engineering Project IV Lecture 3: Literature Review.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information Michael RothSabine Schulte im Walde.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
ENeL WG3 meeting: Automatic Knowledge Acquisition for Lexicography Herstmonceux, August 2015 STARTS AT 2:30 PM.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Levels of Linguistic Analysis
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Towards a Translation Assessment Assistant Tom Cheesman.
INTRODUCTION TO APPLIED LINGUISTICS
LAB: Linguistics Annotated Bibliography – A searchable Portal for Normed Database Information Erin M. Buchanan, Kathrene D. Valentine, Marilee L. Teasley,
Mitglied der Leibniz-Gemeinschaft German lexicography today Annette Klosa (Institut für Deutsche Sprache, Mannheim)
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Multilingual Biomedical Dictionary
European Network of e-Lexicography
Morphoogle - A Multilingual Interface to a Web Search Engine
Levels of Linguistic Analysis
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison and dictionary development. We present desiderata and suggestions as well as methods from computational linguistics to systematically apply variety corpora for the enrichment, i.e. confirmation, extension and generation, of lexical entries in distinctive variant dictionaries for German. Examples are those variant dictionaries developed by Ammon et al. (2004) and Abfalterer (2007), where we focus on the South Tyrolean German language. On the one hand, we conducted a systematic frequency analysis in newspaper variety corpora for approved lists of South Tyrolean special vocabulary in order to possibly refine corresponding dictionary entries with corpus evidence. On the other hand, we filtered the list of words of our South Tyrolean corpus which could not be lemmatised by a tool developed for the variety in Germany. After Approaches to Computational Lexicography for German Varieties * Approaches to Computational Lexicography for German Varieties * Andrea Abel, Stefanie Anstein - - LCT day FUB - May 15th, 2008 Related Work Variety Corpora * Paper to be presented at Euralex 2008 German: DWDS-Korpus (DE), Austrian Academy Corpus (AT), Schweizer Text Korpus (CH), Korpus Südtirol (IT)  ‘C4’ platform English: International Corpus of English (ICE), London-Lund Corpus, ICAME etc. French: Trésor de la Langue Française Informatisé (au Quebec) etc. Spanish: Corpus del Español etc.... German variant dictionaries German variety in South Tyrol Studies on language contact phenomena and particularities on lexical and partly morpho-syntactical level (e.g. Rizzo-Bauer 1962, Riedmann 1972, Pernstich 1984, Forer/Moser 1988, Lanthaler 1995, Ammon et al. 2004, Abfalterer 2007) hardly on syntagmatic (e.g. collocations, idioms), textual level (e.g. Riehl 1997) or on translated texts (e.g. Putzer 1984) Interpretation of language contact phenomena shift: research based on criticism of contact phenomena as impairment of language (e.g. Riedmann 1972)  description of “special vocabularies” (e.g. Ammon 2004, Abfalterer 2007) on purely lexical level: less particularities than assumed (see e.g. Ammon 2001) Methods manual examination and excerption of references (e.g. Riedmann 1972, Riehl 1997); consultation of informants, relevant literature and dictionaries (e.g. Abfalterer 2007) Internet as resource for additional evidence (e.g. Abfalterer 2007, Bickel 2000) now: corpus linguistics (Korpus Südtirol, ‚C4‘ initiative) Requirements Desiderata for corpus lexicography content (confirmation and enrichment of existing data, addition of new data) and data modelling (e.g. special notes, frequency labels) methods for data acquisition (improvement and refinement of existing tools as well as development of new specific tools) data presentation (e.g. online dictionaries with direct links to corpus data) Research requirements on South Tyrolean German large-scale investigations on a lexical, syntagmatic and textual level intralinguistic comparison to other German varieties use of state of the art corpus linguistic methods and technologies Methods © 2. Tagger ‚unknowns‘ filtering of the ‘unknowns’ in the Dolomiten corpus yielding new special vocabulary ‘candidates’ © © 3. Continuous and discontinuous cooccurrences: Adj+N, Prep+N; Subj+Pred, Pred+Obj extraction and comparison of cooccurrences in the two corpora... Outlook enhance corpora to be compared and their annotation develop more tools for the semi-automatic comparison of varieties on the basis of corpora systematize exemplary findings on South Tyrolean variety investigate ‚South Tyrolisms‘ and their collocators, phraseologisms compare synthetical and analytical constructions analyse ‘cause’ and ‘origin’ for certain phenomena (e.g. language contact, language variation over time) removing special vocabulary collected for the South Tyrolean variety in other projects (e.g. legal terms), the remaining list was manually checked for possible new variant dictionary entries, thus - as an innovative variety corpus lexicographic approach - also automatically filtering a huge amount of data to extract only relevant data to be investigated in detail. In addition, we semi-automatically extracted lexical cooccurrences of our two newspaper corpora and compared their frequencies – with the assumption that those cooccurrences are worth being more closely investigated that have high frequency in the South Tyrolean corpus and very low frequency in the corpus from Germany. With these three methods we were not only able to refine dictionary entries for South Tyrolean German, but also to add new ones. The findings on variants can be re-used for further corpus annotation resulting in again better resources for computational variant lexicography of the kind described, which is also to be extended to more complex levels of linguistic description. Ammon, U. et al (2004): Variantenwörterbuch des Deutschen. Die Standardsprache in Österreich, der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und Südtirol. Abfalterer, H. (2007): Der Südtiroler Sonderwortschatz aus plurizentrischer Sicht. Lexikalisch-semantische Besonderheiten im Standarddeutsch Südtirols. 1. ‚South Tyrolisms‘ counting ‘South Tyrolisms’ (Abfalterer 2007) in the two corpora and extracting words with ‘suspicious’ frequencies... Resources Korpus Südtirol (FUB, Eurac, UIBK)  Subcorpus ‘Dolomiten (IT) 66 mio tokens Corpus ‘Frankfurter Rundschau’ (D) 40 mio tokens Dolo FR (tokenised, PoS-tagged, lemmatised, chunked; queried with CQP) data from project ‘Datenbank zum Südtiroler Deutsch’ IBK lists of special vocabulary (‘South Tyrolisms’, legal terms, proper names etc.) weißer Stimmzettel: Dolo 81 vs. FR 2 allgemeine Klasse: Dolo 522 vs. FR 0 innerhalb : Dolo 420 vs. FR 0... ©©