Korpuslinguistik mit und für Computerlinguistik

Korpuslinguistik mit und für Computerlinguistik
Martin Volk Universität Zürich Eurospider Information Technology AG

Sources for linguistic information
Introspection (own usage and judgement) Usage and judgement by others Questioning (goal-driven) interview questionaire Observation ('involuntary' utterances) spoken utterances ( corpora) written utterances ( corpora) observation: first-language acquisition (cursory observation) experimental observation: e.g. eye movement tests Martin Volk 28 November 2018

What is a corpus? a text collection a representative text collection
a representative and structured text collection a representative, structured and annotated text collection ... Martin Volk 28 November 2018

Example Is 'ob' used as a preposition in German? Introspection
Rothenburg ob der Tauber Dictionary (Wahrig. Deutsches Wörterbuch. 1996): Präp. mit Dativ; veraltet; ob dem Wasserfall Web: Google 'ob dem' Sage: Der Wilde Jäger ob dem Neuenburgersee Corpus Check 'ob' in IdS-Corpus!! Program for CZ corpus: perl ../find_special_sent.perl cz94*snv Martin Volk 28 November 2018

Corpus Examples CZ94: ... fiel schier vom Stuhl ob der Äusserung eines Ozeanologen ... CZ94: Bei manchem Ölgiganten kam ob der Ergebnisse gar Euphorie auf. CZ94: ... rieben sich vergnügt die Hände ob des zu erwartenden Schlagabtauschs. ob is a preposition with genitive!! in CZ corpus: 'ob' is tagged as preposition 21 times (obviously some incorrect) Martin Volk 28 November 2018

History of Corpus Linguistics
collections of text were widely used in the 19th century and in the first half of the 20th century language acquisition orthography (letter frequency) field linguistics  American Structuralism (influential until 1960) Martin Volk 28 November 2018

Chomsky's criticism: Speakers produce and understand infinitely many new sentences/words. therefore the new research goal is: to describe the underlying language faculty of a speaker (= universal grammar), competence rather than performance Martin Volk 28 November 2018

Chomsky's criticism: every collection of texts is a collection of performance data and so many factors contribute to it that it cannot be used to model competence. A corpus is necessarily skewed. Some sentences won't occur because they are obvious, false or impolite. Martin Volk 28 November 2018

theoretical linguistics competence (what is grammatical?) introspection indefinitely many types, productivity grammatical vs. ungrammatical corpus linguistics performance (what is attested?) instances finite number of types degrees of grammaticality Martin Volk 28 November 2018

Corpus research in Linguistics
Lexicography (Dictionaries) Grammaticography (Reference grammars) Learner corpora: Language acquisition Parallel corpora: Translation Martin Volk 28 November 2018

Construction of Corpora
Written text is easier to obtain than spoken text. Some examples: Newspapers Fiction (e.g. fairy tales) Technical Literature (e.g. manuals, medicine) Personal letters: Advertising (incl. political propaganda) Belief and Thought (e.g. bible) Martin Volk 28 November 2018

Corpora of spoken language
Spontaneous spoken language recording of dialogues (e.g. telephone conversation) Prepared spoken language Public speeches (e.g. in parliament) Radio or TV news Spoken utterances must be transcribed for linguistic research. Martin Volk 28 November 2018

Size of corpora Brown Corpus for English (1964, 1 Mio. words)
LIMAS-Corpus for German (1970, 1 Mio. words) British National Corpus (1995, 100 Mio. words) Cosmas corpus (2002, > 100 Mio. words) Martin Volk 28 November 2018

Brown Corpus (1964) 500 texts out of 15 different text types
with 2000 words each Martin Volk 28 November 2018

British National Corpus
90% written English, 10% spoken English 3209 texts out of 10 different text types written and 6 text types spoken with < 40'000 words each  multi-purpose corpus Martin Volk 28 November 2018

Other considerations Time frame of the corpus
Native and non-native speakers Sociolinguistic variables Gender Age Education Dialect Social context and relationships Martin Volk 28 November 2018

Types of corpora Raw texts Automatically annotated corpora
Texts with Part-of-Speech tags Partially parsed texts Manually annotated corpora Treebank FrameNet Martin Volk 28 November 2018

Types of Corpora Balanced Corpora vs. special corpora
Spoken vs. written language Monolingual vs. Multilingual Corpora Parallel vs. comparable corpora Martin Volk 28 November 2018

Corpora in Computational Linguistics
annotation Linguistic Facts: 'Peter Smith'  'Smith' is a last name Linguistic Rules/Heuristics: if a word can be determiner or pronoun and the following word is a noun, then it must be a determinier. Linguistic Preferences: if an English prepositional phrase starts with 'of', then it must be attached to the preceding noun with 90% probability. Facts Rules Preferences learning Martin Volk 28 November 2018

My Motivation for Corpus Linguistics
Attempt to build a parser for German But: problems with ambiguities!! Therefore: Learn attachment preferences from a corpus! Martin Volk 28 November 2018

Corpora vs. Test suites A test suite
is a collection of manually constructed and selected sentences. is used for testing computational grammars and parsers. reduces the amount of testing. leads to specific problems of the NLP system. Martin Volk 28 November 2018

Basic problems in CL Knowledge is missing (too little information)
e.g. unknown words Ambiguities (too much information) e.g. in syntax: attachment preferences Martin Volk 28 November 2018

Corpora in Computational Linguistics
Widespread use of (manually) annotated material for measuring progress! Some examples from COLING 2002: Treebanks to train and test probabilistic grammars Enriching treebanks with dependency information Automatic error detection in PoS-Tagged Corpora SENSEVAL data to train and test word sense disambiguation programs Martin Volk 28 November 2018

Possible Student Tasks
Which German prepositions take a noun without a determiner? (e.g. pro, via) When is mit used as an adverb? (e.g. ) What is the distribution of separable verb prefixes in German? How often are relative clauses introduced with welche(r) ? How often are present participle forms used in German? What kind of foreign language material is in the corpus? Martin Volk 28 November 2018

Possible Student Tasks
Create a small parallel corpus (e.g. with various versions of 'Alice in Wonderland' or National Geographic) Create a small corpus of spoken language (e.g. by transcription of one issue of 'Big Brother'). Create a small treebank with the ANNOTATE tool. Martin Volk 28 November 2018

What corpora do we have for German?
Raw text ComputerZeitung (about 1.3 million words per year) ComputerZeitung iX Tages-Anzeiger 2000 Martin Volk 28 November 2018

Information in TagesAnzeiger
Date Category (Sport, Politics, Culture, Economics etc.) Author Title vs. Text Martin Volk 28 November 2018

What corpora do we have for German?
Syntactically Annotated Text (Treebanks) NEGRA treebank (20'000 sentences) ComputerZeitung treebank (3'000 sentences) Text with manually corrected PoS tags 50'000 sentences from University speeches others Martin Volk 28 November 2018

If you can talk, you can sing. If you can parse, you can understand.
The goal If you can walk, you can dance. If you can talk, you can sing. If you can parse, you can understand. (Hans Uszkoreit, COLING 2002) Martin Volk 28 November 2018

Acknowledgement Some slides were highly influenced by or even copied from Anke Lüdeling's course "Introduction to Corpus Linguistics" at Martin Volk 28 November 2018

Korpuslinguistik mit und für Computerlinguistik

Similar presentations

Presentation on theme: "Korpuslinguistik mit und für Computerlinguistik"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Korpuslinguistik mit und für Computerlinguistik

Similar presentations

Presentation on theme: "Korpuslinguistik mit und für Computerlinguistik"— Presentation transcript:

Similar presentations

About project

Feedback