Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Computational Approaches to Lexical Collocations Stefan Evert IMS Stuttgart Brigitte Krenn.

Similar presentations


Presentation on theme: "Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Computational Approaches to Lexical Collocations Stefan Evert IMS Stuttgart Brigitte Krenn."— Presentation transcript:

1 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Computational Approaches to Lexical Collocations Stefan Evert IMS Stuttgart Brigitte Krenn ÖFAI Vienna

2 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Collocations What kind of animal are they? What are they good for? How can we find them? How likely are we to find them?

3 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Schedule Monday: Introduction to Collocations Tuesday: Extraction of Coocurrence Data Wednesday: Association Measures (AMs) Thursday: Evaluation of AMs Friday: Significance of Result Differences

4 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Collocations Terminology & Definitions Firth's Notion of Collocation ``Meaning by collocation is an abstraction at the syntagmatic level and is not directly concerned with the conceptual or idea approach to the meaning of words.'' ``One of the meanings of night is its collocability with dark, and of dark, of course, its collocation with night.' (Firth 1957)

5 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Collocations Terminology & Definitions Choueka's Notion of Collocation "A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components." (Choueka 1988)

6 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Collocations Terminology & Definitions Summing up, we have 2 different views 1.Firth: collocations as lexical proximities in text 2.Choueka: collocations as syntactic and semantic units, semantic irregularity

7 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Idioms preferably used in the English literature, e.g. Bar-Hillel:55, Hockett:58, Katz;Postal:63, Healey:68,Makkai:72.

8 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Phraseological Units, (Ge.: Phraseologismen) a widely used generic term in the German literature, e.g. BurgerEA:82, Fleischer:82.

9 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Light-verb Constructions, Support-verb Constructions refer to very particular phenomena, cross-categorisation with idioms

10 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Terms stemming from Computational Linguistics, e.g. multi-word lexemes, e.g. Tschichold:97, BreidtEA:96. multi-word expressions, e.g. Segond;Tapanainen:95 non-compositional compounds, e.g. Melamed:97

11 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Collocation e.g, red herring, kick the bucket, dark-night, dog-bark Base, Collocate elements if a collocation, the base selects its collocates, it is not always clear to define what the base is, e.g. red-herring, kick-bucket

12 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Terminology Collocation Phrase grammatical unit which is constituted by a base and its collocates or by two or more collocates, e.g. a red herring, kick the bucket, unexpectedly kick the bucket

13 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summing up Terminology and definitions are influenced by different linguistic traditions computational linguistic applications

14 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summing up The phenomena covered are manifold: lexical proximities in texts syntactic and semantic units semantic irregularity syntactic rigidity

15 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Observations: Generativity Collocations range from completely fixed to syntactically flexible constructions. Syntactic restrictions usually coincide with semantic restrictions and thus are indicators for the degree of lexicalization of a particular word combination.

16 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Observations: Generativity Particular word combinations are associated with specific restrictions that cannot be inferred from standard rules of grammar and thus need to be stored together with the collocation.

17 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Observations: Recurrence Within corpora, the proportion of collocations is larger among highly recurrent word combination than among infrequent ones. Recurrence is an effect of lexicalization.

18 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Observations: Idiomaticity Semantic opacity is not sufficient for the definition of collocations as there exists a variety of conventionalized word combinations that range from fully compositional ones like Hut aufsetzen (`put on a hat'), Jacke anziehen (`put on a jacket') to semantically opaque ones like ins Gras beissen (`bite into the grass' literal meaning, `die' idiomatic meaning).

19 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Observations: Words, Multi-words or Phrases Collocations can be word level phenomena phrase level phenomena (collocation phrase) Collocation phrases consist of the lexically determined words (collocates) only or contain additional lexically underspecified material.

20 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Word-level Collocations Adjective- and Adverb-Like Collocations nichts desto trotz (`nonetheless') adverb fix und fertig (`exhausted') adjective Preposition-Like Collocations im Lauf(e), im Zuge (`during') an Hand (`with the help of')

21 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Word-level Collocations Noun-Like Collocations Rotes Kreuz (Red Cross) Wiener Sängerknaben (Vienna choir boys) Hinz und Kunz (`every Tom, Dick and Harry') Sequences where the nouns are duplicated Schulter an Schulter (shoulder to shoulder), Kopf an Kopf (neck and neck)

22 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Phrase-level Collocations Modal constructions sich (nicht) lumpen lassen (`to splash out') Verb-object combinations übers Ohr hauen (`take somebody for a ride') unter die Lupe nehmen (`take a close look at') zum Vorschein bringen (`bring something to the light') des Weges kommen (`to approach') Lügen strafen (`prove somebody a liar')

23 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Phrase-level Collocations Copula constructions guten Glaubens sein (`be in good faith') auf Draht sein (`be on the ball') Proverbs Morgenstund hat Gold im Mund (morning hour has gold in the mouth, the early bid catches the worm) wissen, wo der Barthel den Most holt (know where the Barthel the cider fetches, `know every trick in the book')

24 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summing up Structural dependency the collocates of a collocation are syntactic dependents, thus knowledge of syntactic structure is a precondition for accurate collocation identification. Syntactic context may help to discriminate literal and collocational readings, see for instance im Lauf, im Zug where a genitive to the right is a strong indicator for collocational reading.

25 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summing up Markedness morphologically or syntactically marked constructions like seemingly incomplete syntactic structure or archaic e-suffix are suitable indicators for collocations, see im Laufe, im Zuge for e-suffix and zu Recht, an Hand for incomplete syntactic structures.

26 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summing up Single-word versus multi-word units single-word occurrences of word combinations indicate word-level collocations, see for instance zu Recht, zurecht. Syntactic rigidity is an important indicator for collocations see for instance Hinz und Kunz, an und für sich, fix und fertig, Kopf an Kopf.

27 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Identification of Collocations: Operationalizable Criteria over proportionally high recurrence of collocational word combinations compared to noncollocational word combinations in corpora lexical determination of the collocates of a collocation collocations constitute grammatical units grammatical restrictions in the collocation phrases

28 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams numerical span typically AMs work on bi-grams, i.e., all pairs within a certain span are considered pairs are not considered! e.g.: w i-3... w i-1 w i w i+1... w i+3 (span size is 6)

29 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams grammatical span e.g. we assume that collocational relations do not exceed sentence boundaries ð i.e., span size = sentence size

30 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems huge amounts of cooccurrence data need to be managed ðthis requires special algorithms (e.g. Yamamoto,Church 2001)

31 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems definition of span size is crucial If the span size is kept small, it is unlikely to properly cover nonadjacent collocates of structurally flexible collocations. Enlarging the span size leads to an increase of candidate collocations including an increase of noisy data which need to be discarded in a further processing step.

32 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems High amount of noise in the collocation data due to inappropriate span size

33 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems High amount of noise in the collocation data due to over-proportional frequency of function words within texts ðuse stop word lists

34 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems High amount of noise in the collocation data due to insensitivity to punctuation ðuse a sentence as the largest unit within which the collocates of a collocation may occur

35 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems High amount of noise in the collocation data due to insensitivity to parts-of-speech ðknowing parts-of-speech allows a large number of syntactically invalid n-grams to be excluded beforehand

36 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Positional N-Grams: Problems High amount of noise in the collocation data due to insensitivity to syntactic structure ðfurther improvement of the appropriateness of the collocation candidates selected is achieved by the availability of structural and/or dependency information

37 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Proposal (Partial) replacement of positional n-grams by relational n-grams What does it imply? In which cases do we want/need it?

38 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS From Positional to Relational N-Grams positional with part-of-speech, eg. (Evert,Kermes 2003) pairs within certain span

39 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS From Positional to Relational N-Grams positional with part-of-speech and lexical knowledge, e.g. Breidt 93 sentence final pairs past participles must be part of a list of 16 German support verbs e.g. kommen, bringen, stehen, stellen,...

40 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS From Positional to Relational N-Grams partially relational, e.g. Krenn 2000 extraction of PN-combinations from PPs extraction of main verbs combination of PN-pairs and verbs co-occurring in a sentence

41 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS From Positional to Relational N-Grams Result a theoretical maximum of PNV combinations, i.e., verbs are duplicated in sentences that contain more than one PP, PPs are duplicated in sentences where more than one main verb is found. ðThis has effects on counting!

42 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Relational N-Grams Assumption The collocates of a collocation are syntactically related!

43 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Relational N-Grams Requirements part-of-speech tagging partial parsing (phrase chunking) full parsing Evert,Kermes 2003

44 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Relational N-Grams Problems full parsing of free text is error prone there are cases where collocation extraction from fully parsed text is less accurate

45 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Summary Use application dependent notion of collocation as opposed to Firth/Choueka Extract collocations from relational data (relational n-grams) Interested in collocation phrases and their collocates Consider grammatically homogenous data (e.g. Adj-N, PP-V) Recurrence/cooccurrence frequency is main criterion for collocation extraction (statistical approaches)


Download ppt "Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Computational Approaches to Lexical Collocations Stefan Evert IMS Stuttgart Brigitte Krenn."

Similar presentations


Ads by Google