Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison.

Similar presentations


Presentation on theme: "Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison."— Presentation transcript:

1 Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison and dictionary development. We present desiderata and suggestions as well as methods from computational linguistics to systematically apply variety corpora for the enrichment, i.e. confirmation, extension and generation, of lexical entries in distinctive variant dictionaries for German. Examples are those variant dictionaries developed by Ammon et al. (2004) and Abfalterer (2007), where we focus on the South Tyrolean German language. On the one hand, we conducted a systematic frequency analysis in newspaper variety corpora for approved lists of South Tyrolean special vocabulary in order to possibly refine corresponding dictionary entries with corpus evidence. On the other hand, we filtered the list of words of our South Tyrolean corpus which could not be lemmatised by a tool developed for the variety in Germany. After Approaches to Computational Lexicography for German Varieties * Approaches to Computational Lexicography for German Varieties * Andrea Abel, Stefanie Anstein - {aabel|sanstein}@eurac.edu - LCT day FUB - May 15th, 2008 Related Work Variety Corpora * Paper to be presented at Euralex 2008 German: DWDS-Korpus (DE), Austrian Academy Corpus (AT), Schweizer Text Korpus (CH), Korpus Südtirol (IT)  ‘C4’ platform English: International Corpus of English (ICE), London-Lund Corpus, ICAME etc. French: Trésor de la Langue Française Informatisé (au Quebec) etc. Spanish: Corpus del Español etc.... German variant dictionaries German variety in South Tyrol Studies on language contact phenomena and particularities on lexical and partly morpho-syntactical level (e.g. Rizzo-Bauer 1962, Riedmann 1972, Pernstich 1984, Forer/Moser 1988, Lanthaler 1995, Ammon et al. 2004, Abfalterer 2007) hardly on syntagmatic (e.g. collocations, idioms), textual level (e.g. Riehl 1997) or on translated texts (e.g. Putzer 1984) Interpretation of language contact phenomena shift: research based on criticism of contact phenomena as impairment of language (e.g. Riedmann 1972)  description of “special vocabularies” (e.g. Ammon 2004, Abfalterer 2007) on purely lexical level: less particularities than assumed (see e.g. Ammon 2001) Methods manual examination and excerption of references (e.g. Riedmann 1972, Riehl 1997); consultation of informants, relevant literature and dictionaries (e.g. Abfalterer 2007) Internet as resource for additional evidence (e.g. Abfalterer 2007, Bickel 2000) now: corpus linguistics (Korpus Südtirol, ‚C4‘ initiative) Requirements Desiderata for corpus lexicography content (confirmation and enrichment of existing data, addition of new data) and data modelling (e.g. special notes, frequency labels) methods for data acquisition (improvement and refinement of existing tools as well as development of new specific tools) data presentation (e.g. online dictionaries with direct links to corpus data) Research requirements on South Tyrolean German large-scale investigations on a lexical, syntagmatic and textual level intralinguistic comparison to other German varieties use of state of the art corpus linguistic methods and technologies Methods © 2. Tagger ‚unknowns‘ filtering of the ‘unknowns’ in the Dolomiten corpus yielding new special vocabulary ‘candidates’ © © 3. Continuous and discontinuous cooccurrences: Adj+N, Prep+N; Subj+Pred, Pred+Obj extraction and comparison of cooccurrences in the two corpora... Outlook enhance corpora to be compared and their annotation develop more tools for the semi-automatic comparison of varieties on the basis of corpora systematize exemplary findings on South Tyrolean variety investigate ‚South Tyrolisms‘ and their collocators, phraseologisms compare synthetical and analytical constructions analyse ‘cause’ and ‘origin’ for certain phenomena (e.g. language contact, language variation over time) removing special vocabulary collected for the South Tyrolean variety in other projects (e.g. legal terms), the remaining list was manually checked for possible new variant dictionary entries, thus - as an innovative variety corpus lexicographic approach - also automatically filtering a huge amount of data to extract only relevant data to be investigated in detail. In addition, we semi-automatically extracted lexical cooccurrences of our two newspaper corpora and compared their frequencies – with the assumption that those cooccurrences are worth being more closely investigated that have high frequency in the South Tyrolean corpus and very low frequency in the corpus from Germany. With these three methods we were not only able to refine dictionary entries for South Tyrolean German, but also to add new ones. The findings on variants can be re-used for further corpus annotation resulting in again better resources for computational variant lexicography of the kind described, which is also to be extended to more complex levels of linguistic description. Ammon, U. et al (2004): Variantenwörterbuch des Deutschen. Die Standardsprache in Österreich, der Schweiz und Deutschland sowie in Liechtenstein, Luxemburg, Ostbelgien und Südtirol. Abfalterer, H. (2007): Der Südtiroler Sonderwortschatz aus plurizentrischer Sicht. Lexikalisch-semantische Besonderheiten im Standarddeutsch Südtirols. 1. ‚South Tyrolisms‘ counting ‘South Tyrolisms’ (Abfalterer 2007) in the two corpora and extracting words with ‘suspicious’ frequencies... Resources Korpus Südtirol (FUB, Eurac, UIBK)  Subcorpus ‘Dolomiten (IT) 66 mio tokens Corpus ‘Frankfurter Rundschau’ (D) 40 mio tokens Dolo FR (tokenised, PoS-tagged, lemmatised, chunked; queried with CQP) data from project ‘Datenbank zum Südtiroler Deutsch’ IBK lists of special vocabulary (‘South Tyrolisms’, legal terms, proper names etc.) weißer Stimmzettel: Dolo 81 vs. FR 2 allgemeine Klasse: Dolo 522 vs. FR 0 innerhalb : Dolo 420 vs. FR 0... ©©


Download ppt "Corpora built for linguistic varieties of a pluricentric language such as German are an indispensable resource for a detailed and systematic variety comparison."

Similar presentations


Ads by Google