Presentation is loading. Please wait.

Presentation is loading. Please wait.

Co-occurrence and collocation

Similar presentations

Presentation on theme: "Co-occurrence and collocation"— Presentation transcript:

1 Co-occurrence and collocation

2 1. Definitions Firth (1951) : Modes of Meaning : « Words shall be known by the company they keep ». Collocation designates both a process or a state (the act of collocation or the state of being collocated) and the result of the process (an arrangement or juxtaposition, especially of linguistic elements, such as words.

3 Crystal (1991) : A Dictionary of Linguistics and Phonetics : « a habitual co-occurrence of individual lexical items » Co-occurrence may be fortuitous, whereas collocation reflects collective usage Collocation is a type of lexical constraint : « the language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices » (Sinclair : 1991)

4 A collocation is an arbitrary and recurrent word combination (Smadja, 1990)
A contrastive view of collocation, as expressed by F.J. Hausmann (1990) : « […] l'idiosyncrasie de la collocation ne se révèle définitivement que dans l'optique d'une autre langue qui combine, pour exprimer le même fait, des mots différents » (idiosyncrasy : a structural or behavioral characteristic peculiar to an individual or a group)

5 2. Types of collocations Halliday (1966: 151, 157) argues that the collocational patterns of lexical items can lead to generalizations at the lexical level. If certain items belong to the same set, then they can be regarded as “a single lexical item”:

6 A strong argument, he argued strongly, the strength of his argument and his argument was strengthened [can all be regarded] as instances of one and the same syntagmatic relation. What is abstracted is an item strong, having the scatter strong, strongly, strength, strengthened, which collocates with argue (argument).

7 Sinclair (1991) proposes two principles:
“The grammatical level is represented by the “open-choice principle”, which sees “language text as the result of a very large number of complex choices ... the only restraint [being] grammaticalness”. (cf. Colorless green ideas sleep furiously) The “idiom principle” represents the lexical level and accounts for “the restraints that are not captured by the open-choice model”.

8 Three factors that determine the categorization of a lexical combination
the degree of probability that the items will co-occur the degree of fixity of the combination (i.e. grammatical restrictions) the degree in which the meaning of the combination can be derived from the meaning of its constituent parts

9 the terms idiom and collocation (as well as their shadings) are used by different linguists with different definitions. Most linguists would agree that kick the bucket is an idiom, whereas [make / reach / take] a decision are collocations”.

10 An example : the verb carry in OH
[person, animal] porter [bag, shopping, load, news, message] [vehicle, pipe, wire, vein] transporter; [wind, tide, current, stream] emporter; comporter [warning, guarantee, review, report] supporter [weight, load, traffic] l'emporter dans [state, region, constituency]; remporter [battle, match] the motion was carried by 20 votes to 13 la motion l'a emporté par 20 votes contre 13 Idioms : to be carried away by sth être emballé[!] par qch; to get carried away[!] s'emballer[!], se laisser emporter.

11 Woods, E. & McLeod, N. (1990) Using English Grammar, Prentice Hall.
Woods & McLeod suggest the following continuum (from most to least predictable/fixed): Idioms (do not allow for substitution of their elements, nor for grammatical or syntactic alterations) Collocations (roughly predictable word combinations with some restrictions) Colligations (generalisable classes of collocations, for which at least one construct is specified by category rather than as a distinct lexical item) Free combinations (compositional and productive)

12 Oxford Dictionary of Current Idiomatic English, Vol
Oxford Dictionary of Current Idiomatic English, Vol. 2 - English Idioms (1983), Oxford University Press presents a continuum from idiom to non-idiom distinguishes between pure idioms (totally fixed) and figurative idioms (allowing for some variation) - blow a fuse, as a figurative idiom, can only be used in the active form. - the idiomatic sense of blow one's own [horn / trumpet] is not activated in the absence of own. Collocations (non-idioms) are divided between restricted (or semi-idioms) and open 

13 Restricted collocations allow a degree of lexical variation” (one element has a figurative sense not found outside that limited context whereas the other appears in a familiar, literal sense  cf. carry a motion) In open collocations elements are freely combinable and are used in a common literal sense

functional expressions (1) More haste less speed. (proverb) (2) Unaccustomed as I am to public speaking .. (speech formula) (3) You name it, we've got it. (slogan) (4) When in Rome. (abbreviated proverb) composite units (5) blow a trumpet (open collocation) (6) blow a fuse (restricted collocation) (7) blow your own trumpet (figurative idiom) (8) blow the gaff (pure idiom) = vendre la mèche

15 Cruse, D. A. (1986) Lexical Semantics, Cambridge University Press.
distinguishes between idioms (“lexically complex” units, constituting a “single minimal semantic constituent”  kick the bucket) and collocations (“sequences of lexical items which habitually co-occur”, each lexical item being a “semantic constituent”). He also introduces bound collocations (expressions “whose constituents do not like to be separated”) as a “transitional area bordering on idiom”.

16 Benson, M. , Benson, E. & Ilson, R
Benson, M., Benson, E. & Ilson, R. (1986) Lexicographic Description of English (Studies in Language Companion Series, No 14), John Benjamins Publishing Company. Benson, M., Benson, E. & Ilson, R. (1986) The BBI Combinatory Dictionary of English, John Benjamins Publishing Company.

17 B,B & I distinguish between grammatical and lexical collocations.
Grammatical collocations have a node followed by a subordinate unit (which is often a preposition) : refer to, reliance on, proud of In lexical collocations, both components have equal lexical status (ADJ-N, VB-N, ADV-ADJ, VB-ADV)

18 Sinclair, J. McH. (1991) Corpus, Concordance, Collocation, Oxford University Press.
Defines two types of collocations (upward / downward) depending on the relative frequency of the two words considered in the order in which they occur. « give sb an edge » is a downward collocation, because « give » is more frequently used than « edge ».

19 Clas (1994) : « Collocations et langues de spécialité » in Meta, XXXIX, 4.
V+N : prononcer un discours (verbe support) N+ADJ : rude épreuve, marque distinctive ADV+ADJ : grièvement blessé VB+ADV : recommander chaudement N+V : la cloche sonne, le chat miaule Marquage de la quantité du nom : un troupeau de vaches, une pincée de sel

20 Critique : la première catégorie est trop restrictive : set a record serait une collocation, mais pas [beat / break / hold] a record Les deux dernières catégories ne supportent presque pas la variation

21 Exemple de choix de lexicalisation (
Out of 55 nouns that co-occur with « emergency » at least 10 times in the BNC, only 14 can be found in both OH and RC : brake, measure, operation, repair for collocations case, center, exit, landing, powers, ration, room, service, services, ward for compound nouns.

22 What makes a collocation worth learning for an EFL learner?
Collocations that involve a verb and its typical object (drive a car, read a book) can usually be inferred. Some verbs generate an infinite number of collocates (buy a car, buy a book…) What makes the collocation worth memorizing is the fact that the verb takes on another meaning (buy a story, buy time)

23 What makes a collocation remarkable (salient) is the fact that one of its components has few collocates (cf. the Tact Z-score formula) Consequently, it makes more sense for an EFL learner to learn « downward collocations » grouped under the collocate rather than the node (record / beat, break, hold, set)

24 Mel’čuk’s lexical functions
Lexical functions are the main principle underlying Mel’čuk’s Meaning-Text Theory They are meant to describe « institutionalized » lexical relations. Wanner (1996) gives examples of such relations: aircraft and crew, sheep and flock, bachelor and confirmed, mountain and peak, influence and exert, attention and pay.

25 The list includes both syntagmatically and paradigmatically related pairs of words.
Mel’čuk admits that even tough all L-F covered phrases are collocations, his model does not cover some collocations when the logical relation between their components cannot be readily inferred (as with assurance maladie and assurance vie).

26 LFs only cover bigrams. Their aim is to cover syntagmatic and paradigmatic relations between words within a formalized notation system. The concept is meant to be applied to a wide variety of languages. Standard LFs include 36 syntagmatic LFs that belong to 4 distinct categories: nominal, adjectival/adverbial, prepositional and verbal.

27 Nominal LF:28 28. Centr. [Lat. centrum – ‘the center of culmination of’] Centr(crisis)= the peak Centr(desert)=the heart

28 Adjectival/Adverbial LFs
29. Magn [Lat. magnus – ‘big, great’] Magn(naked)=stark Magn(thin)=as a rake 33. Bon [Lat. bonus – ‘good’] Bon(aid)=valuable Bon(proposal)=tempting

29 Prepositional LFs 35. Locin [being in ‘place’]
Locin(height)=at [a height of…] 36. Locab [moving away from ‘place’] Locab(height)=from [a height of…] 37. Locad [moving into ‘place’] Locad(height)=to [a height of…]

30 Verbal LFs 59. Degrad [Lat. degradare – ‘lower, degrade’]
Degrad(clothes)=wear off Degrad(house)=become dilapidated Degrad(temper)=fray 60. Son [Lat. sonare – ‘sound’] Son(dog)=bark Son(waterfall)=roar

31 Dirk Siepmann : Collocation, Colligation and Encoding Dictionaries
Dirk Siepmann : Collocation, Colligation and Encoding Dictionaries. Part I: Lexicological Aspects IJL (4): Linguistic ‘intertextuality’ : the meaning of one text and its constituent elements depends on millions of other texts using similar or identical elements. Textual meaning is thus created by the interplay of two types of repetition : (a) collocation (in the largest possible sense, including colligation and phraseology) (b) cohesion.

32 The subject of collocation has been approached from two main angles:
the semantically-based approaches (Benson 1986, Mel’cuk 1998, Hausmann 2003) which assume a particular meaning relationship between the constituents of a collocation the frequency-oriented approach (Sinclair 1991)

33 A few of Siepmann’s opinions
Only the frequency-based approach can provide a heuristic for discovering the entire class of co-occurrences; in a way, it is safe from refutation, but empty. By contrast, the semantically-based approach is fragmentary – it cannot account for all possible cases. Heuristic Etymology:German heuristisch, from New Latin heuristicus, from Greek heuriskein to discover; akin to Old Irish fo-f*air he found Date:1821 involving or serving as an aid to learning, discovery, or problem-solving by experimental and especially trial-and-error methods *heuristic techniques* *a heuristic assumption*; also : of or relating to exploratory problem-solving techniques that utilize self-educating techniques (as the evaluation of feedback) to improve performance *a heuristic computer program*

34 A purely pragmatic approach relying on the extralinguistic context cannot explain a large number of co-occurrences operating at the level of semantic features. What is needed is an extension of the semantically-based approach that will take account of strings of regular syntactic composition which form a sense unit with a relatively stable meaning.

35 ‘Lexical bundles’ (Biber et al
‘Lexical bundles’ (Biber et al. 1999) such as je sais que c’est or it's been will not be included among the class of collocations. Although such sequences may perform similar or identical functions across a range of texts, they have no meaning ‘by themselves’.

36 there are good […] reasons for subsuming under the notion of collocation such colligational patterns as regarde où tu vas, dans les colonnes de (+ name of newspaper or magazine) or si elle est prise à temps (referring to an illness), which have so far been regarded as free sequences of words subject only to general rules of syntax and semantics.

37 Are collocations always binary ?
It is accepted wisdom among European researchers that collocations are binary units, and this is probably true for the majority of the class (e.g. take a step, launch an appeal). […] threatening to this view are irreducible three-element collocations such as the following: (2) the car holds the road well (3) avoir un geste déplacé -> (?)avoir un geste  recevoir un accueil chaleureux -> (?)recevoir un accueil

38 hold the road (subject: tyre), tomber à gros flocons (subject: neige), emporter la conviction (subject: argument) or eine Kurve machen (subject: Straße) [With such collocations] it [is] difficult to identify a standard lexical function (in the sense of Mel’cuk) that can provide a systematic link between the verb and the noun; this is because the entire collocation is semantically dependent on a specific subject.

39 Directionality the assumption of directionality (or of a hierarchical relationship between the constituents of the collocation) seems obvious with items such as table + lay / set or money + withdraw even such textbook examples of collocational theory as célibataire + endurci (‘confirmed bachelor’) may be viewed as bidirectional, since the adjective endurci combines with any noun carrying the semantic feature [+ figé dans son comportement]: criminel, catholique, Parisien

40 Berry-Rogghe’s Z_score
The Z-score is an indication of the probability that two words will co-occur within a certain span. P = frq_totale collocant / longueur du texte E = P x longueur du mini-texte Ecart type = SQR (E x (1-P)) Z-score = (frq_mini-texte collocant –E) / Ecart type


42 Concordance de « lit » dans l’expression faire le lit de
6286       infectieux semblent faire le lit des localisations 7774    | dont on | sait qu'ils font le lit du cancer. 21884  et cartilagineuses qui feront le lit de l' arthrose. 21939            qui | vont | faire le lit de l' arthrose. 27952  |  personnalité peuvent faire le lit de véritables maladies 21146 détérioration dentaire et fait le lit de l‘ATS 32987 | vieillissement artériel fait le lit de l' ATS 8847 de l' oreillette gauche faisant le lit des troubles rythmiques ; 17440 organes des sens, | peut faire le lit de délires d'

43 Z-SCORE des collocants de « lit » BEFORE 2, AFTER 0. Mini-text: 268
Z-SCORE des collocants de « lit »   BEFORE 2, AFTER 0. Mini-text: 268. Total Text: Collocate Collocate Freq. Type Freq. Z-score repos 13 300 64.077 au 46 9517 39.336 feront 1 11 25.781 garder           2 47 24.903 le 35 36247 13.646 faire 5 1124 12.384 faisant 178 6.263 font 244 5.300 fait 2185 3.120

44 The Mutual Information (MI) score
the word post  co-occurs with many words, among which are "the", "office" and "mortem". f(office) = 5237 f(the) = f(mortem) = 51 (f= overall frequency in the Birmingham Corpus)

45 Joint frequency for those three words is as follows :
j(the) = 1583 j(office) = 297 j(mortem) = 51 The relative frequencies can be compared with what would be expected under the null hypothesis

46 THE NULL HYPOTHESIS The word post has no effect whatsoever on its lexical environment and the frequencies of words surrounding post will be exactly the same as they would be if post were present or not. Expected co-occurrence of post is calculated as : (f(post) * span ) * relative_freq(the) (2579 * 8) * (1 / 20) = / 20 = 1031

47 The MI Score is the ratio between observed co-occurrence and expected co-occurrence
For post and the, it is log(1583/1031) = 0,17 The expected joint frequency for post and office is : (f(post) * span ) * relative_freq(office) 2579 * 8 * 297/20m = 0,3 The observed joint frequency is 297. Hence the MI score is about log(990)=2,99 For mortem, the MI score is log(51/0,05) = 3

48 The mutual-information score for a two-word collocation is a base-2 logarithm of the ratio of the combined probabilities of the occurrence of the first word and the occurrence of the second word to the probability of the occurrence of the two-word collocation. T-scores differ from mutual information scores in being scaled by an estimate of the variance (they tend to correct skewed MI scores that are due to a low number of occurrences).

Download ppt "Co-occurrence and collocation"

Similar presentations

Ads by Google