Presentation on theme: "Co-occurrence and collocation. 1. Definitions Firth (1951) : Modes of Meaning : « Words shall be known by the company they keep ». Collocation designates."— Presentation transcript:
Co-occurrence and collocation
1. Definitions Firth (1951) : Modes of Meaning : « Words shall be known by the company they keep ». Collocation designates both a process or a state (the act of collocation or the state of being collocated) and the result of the process (an arrangement or juxtaposition, especially of linguistic elements, such as words.
Crystal (1991) : A Dictionary of Linguistics and Phonetics : « a habitual co-occurrence of individual lexical items » Co-occurrence may be fortuitous, whereas collocation reflects collective usage Collocation is a type of lexical constraint : « the language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices » (Sinclair : 1991)
A collocation is an arbitrary and recurrent word combination (Smadja, 1990) A contrastive view of collocation, as expressed by F.J. Hausmann (1990) : « […] l'idiosyncrasie de la collocation ne se révèle définitivement que dans l'optique d'une autre langue qui combine, pour exprimer le même fait, des mots différents » (idiosyncrasy : a structural or behavioral characteristic peculiar to an individual or a group)
2. Types of collocations Halliday (1966: 151, 157) argues that the collocational patterns of lexical items can lead to generalizations at the lexical level. If certain items belong to the same set, then they can be regarded as a single lexical item:
A strong argument, he argued strongly, the strength of his argument and his argument was strengthened [can all be regarded] as instances of one and the same syntagmatic relation. What is abstracted is an item strong, having the scatter strong, strongly, strength, strengthened, which collocates with argue (argument).
Sinclair (1991) proposes two principles: The grammatical level is represented by the open-choice principle, which sees language text as the result of a very large number of complex choices... the only restraint [being] grammaticalness. (cf. Colorless green ideas sleep furiously) The idiom principle represents the lexical level and accounts for the restraints that are not captured by the open-choice model.
Three factors that determine the categorization of a lexical combination the degree of probability that the items will co-occur the degree of fixity of the combination (i.e. grammatical restrictions) the degree in which the meaning of the combination can be derived from the meaning of its constituent parts
the terms idiom and collocation (as well as their shadings) are used by different linguists with different definitions. Most linguists would agree that kick the bucket is an idiom, whereas [make / reach / take] a decision are collocations.
An example : the verb carry in OH [person, animal] porter [bag, shopping, load, news, message] [vehicle, pipe, wire, vein] transporter; [wind, tide, current, stream] emporter; comporter [warning, guarantee, review, report] supporter [weight, load, traffic] l'emporter dans [state, region, constituency]; remporter [battle, match] the motion was carried by 20 votes to 13 la motion l'a emporté par 20 votes contre 13 Idioms : to be carried away by sth être emballé[!] par qch; to get carried away[!] s'emballer[!], se laisser emporter.
Woods, E. & McLeod, N. (1990) Using English Grammar, Prentice Hall. Woods & McLeod suggest the following continuum (from most to least predictable/fixed): – Idioms (do not allow for substitution of their elements, nor for grammatical or syntactic alterations) – Collocations (roughly predictable word combinations with some restrictions) – Colligations (generalisable classes of collocations, for which at least one construct is specified by category rather than as a distinct lexical item) – Free combinations (compositional and productive)
Oxford Dictionary of Current Idiomatic English, Vol. 2 - English Idioms (1983), Oxford University Press presents a continuum from idiom to non-idiom distinguishes between pure idioms (totally fixed) and figurative idioms (allowing for some variation) - blow a fuse, as a figurative idiom, can only be used in the active form. - the idiomatic sense of blow one's own [horn / trumpet] is not activated in the absence of own. Collocations (non-idioms) are divided between restricted (or semi-idioms) and open
Restricted collocations allow a degree of lexical variation (one element has a figurative sense not found outside that limited context whereas the other appears in a familiar, literal sense cf. carry a motion) In open collocations elements are freely combinable and are used in a common literal sense
Word Combinations (Howarth, 1993 : A PHRASEOLOGICAL APPROACH TO ACADEMIC WRITING) functional expressions (1) More haste less speed.(proverb) (2) Unaccustomed as I am to public speaking.. (speech formula) (3) You name it, we've got it.(slogan) (4) When in Rome.(abbreviated proverb) composite units (5) blow a trumpet(open collocation) (6) blow a fuse(restricted collocation) (7) blow your own trumpet(figurative idiom) (8) blow the gaff(pure idiom) = vendre la mèche
Cruse, D. A. (1986) Lexical Semantics, Cambridge University Press. distinguishes between idioms (lexically complex units, constituting a single minimal semantic constituent kick the bucket) and collocations (sequences of lexical items which habitually co- occur, each lexical item being a semantic constituent). He also introduces bound collocations (expressions whose constituents do not like to be separated) as a transitional area bordering on idiom.
Benson, M., Benson, E. & Ilson, R. (1986) Lexicographic Description of English (Studies in Language Companion Series, No 14), John Benjamins Publishing Company. Benson, M., Benson, E. & Ilson, R. (1986) The BBI Combinatory Dictionary of English, John Benjamins Publishing Company.
B,B & I distinguish between grammatical and lexical collocations. Grammatical collocations have a node followed by a subordinate unit (which is often a preposition) : refer to, reliance on, proud of In lexical collocations, both components have equal lexical status (ADJ-N, VB-N, ADV-ADJ, VB-ADV)
Sinclair, J. McH. (1991) Corpus, Concordance, Collocation, Oxford University Press. Defines two types of collocations (upward / downward) depending on the relative frequency of the two words considered in the order in which they occur. « give sb an edge » is a downward collocation, because « give » is more frequently used than « edge ».
Clas (1994) : « Collocations et langues de spécialité » in Meta, XXXIX, 4. V+N : prononcer un discours (verbe support) N+ADJ : rude épreuve, marque distinctive ADV+ADJ : grièvement blessé VB+ADV : recommander chaudement N+V : la cloche sonne, le chat miaule Marquage de la quantité du nom : un troupeau de vaches, une pincée de sel
Critique : la première catégorie est trop restrictive : set a record serait une collocation, mais pas [beat / break / hold] a record Les deux dernières catégories ne supportent presque pas la variation
Exemple de choix de lexicalisation (http://pie.usna.edu/explore.html)http://pie.usna.edu/explore.html Out of 55 nouns that co-occur with « emergency » at least 10 times in the BNC, only 14 can be found in both OH and RC : brake, measure, operation, repair for collocations case, center, exit, landing, powers, ration, room, service, services, ward for compound nouns.
What makes a collocation worth learning for an EFL learner? Collocations that involve a verb and its typical object (drive a car, read a book) can usually be inferred. Some verbs generate an infinite number of collocates (buy a car, buy a book…) What makes the collocation worth memorizing is the fact that the verb takes on another meaning (buy a story, buy time)
What makes a collocation remarkable (salient) is the fact that one of its components has few collocates (cf. the Tact Z-score formula) Consequently, it makes more sense for an EFL learner to learn « downward collocations » grouped under the collocate rather than the node (record / beat, break, hold, set)
Melčuks lexical functions Lexical functions are the main principle underlying Melčuks Meaning-Text Theory They are meant to describe « institutionalized » lexical relations. Wanner (1996) gives examples of such relations: aircraft and crew, sheep and flock, bachelor and confirmed, mountain and peak, influence and exert, attention and pay.
The list includes both syntagmatically and paradigmatically related pairs of words. Melčuk admits that even tough all L-F covered phrases are collocations, his model does not cover some collocations when the logical relation between their components cannot be readily inferred (as with assurance maladie and assurance vie).
LFs only cover bigrams. Their aim is to cover syntagmatic and paradigmatic relations between words within a formalized notation system. The concept is meant to be applied to a wide variety of languages. Standard LFs include 36 syntagmatic LFs that belong to 4 distinct categories: nominal, adjectival/adverbial, prepositional and verbal.
Nominal LF: Centr. [Lat. centrum – the center of culmination of] – Centr(crisis)= the peak – Centr(desert)=the heart
Prepositional LFs 35. Loc in [being in place] – Loc in (height)=at [a height of…] 36. Loc ab [moving away from place] – Loc ab (height)=from [a height of…] 37. Loc ad [moving into place] – Loc ad (height)=to [a height of…]
Dirk Siepmann : Collocation, Colligation and Encoding Dictionaries. Part I: Lexicological Aspects IJL (4): Linguistic intertextuality : the meaning of one text and its constituent elements depends on millions of other texts using similar or identical elements. Textual meaning is thus created by the interplay of two types of repetition : – (a) collocation (in the largest possible sense, including colligation and phraseology) – (b) cohesion.
The subject of collocation has been approached from two main angles: – the semantically-based approaches (Benson 1986, Melcuk 1998, Hausmann 2003) which assume a particular meaning relationship between the constituents of a collocation – the frequency-oriented approach (Sinclair 1991)
A few of Siepmanns opinions Only the frequency-based approach can provide a heuristic for discovering the entire class of co-occurrences; in a way, it is safe from refutation, but empty. By contrast, the semantically-based approach is fragmentary – it cannot account for all possible cases.
A purely pragmatic approach relying on the extralinguistic context cannot explain a large number of co-occurrences operating at the level of semantic features. What is needed is an extension of the semantically-based approach that will take account of strings of regular syntactic composition which form a sense unit with a relatively stable meaning.
Lexical bundles (Biber et al. 1999) such as je sais que cest or it's been will not be included among the class of collocations. Although such sequences may perform similar or identical functions across a range of texts, they have no meaning by themselves.
there are good […] reasons for subsuming under the notion of collocation such colligational patterns as regarde où tu vas, dans les colonnes de (+ name of newspaper or magazine) or si elle est prise à temps (referring to an illness), which have so far been regarded as free sequences of words subject only to general rules of syntax and semantics.
Are collocations always binary ? It is accepted wisdom among European researchers that collocations are binary units, and this is probably true for the majority of the class (e.g. take a step, launch an appeal). […] threatening to this view are irreducible three- element collocations such as the following: – (2) the car holds the road well – (3) avoir un geste déplacé -> (?)avoir un geste recevoir un accueil chaleureux -> (?)recevoir un accueil
hold the road (subject: tyre), tomber à gros flocons (subject: neige), emporter la conviction (subject: argument) or eine Kurve machen (subject: Straße) [With such collocations] it [is] difficult to identify a standard lexical function (in the sense of Melcuk) that can provide a systematic link between the verb and the noun; this is because the entire collocation is semantically dependent on a specific subject.
Directionality the assumption of directionality (or of a hierarchical relationship between the constituents of the collocation) seems obvious with items such as table + lay / set or money + withdraw even such textbook examples of collocational theory as célibataire + endurci (confirmed bachelor) may be viewed as bidirectional, since the adjective endurci combines with any noun carrying the semantic feature [+ figé dans son comportement]: criminel, catholique, Parisien
Berry-Rogghes Z_score The Z-score is an indication of the probability that two words will co-occur within a certain span. P = frq_totale collocant / longueur du texte E = P x longueur du mini-texte Ecart type = SQR (E x (1-P)) Z-score = (frq_mini-texte collocant –E) / Ecart type
Concordance de « lit » dans lexpression faire le lit de 6286 infectieux semblent faire le lit des localisations 7774 | dont on | sait qu'ils font le lit du cancer et cartilagineuses qui feront le lit de l' arthrose qui | vont | faire le lit de l' arthrose | personnalité peuvent faire le lit de véritables maladies détérioration dentaire et fait le lit de lATS | vieillissement artériel fait le lit de l' ATS 8847 de l' oreillette gauche faisant le lit des troubles rythmiques ; organes des sens, | peut faire le lit de délires d'
Z-SCORE des collocants de « lit » BEFORE 2, AFTER 0. Mini-text: 268. Total Text: CollocateCollocate Freq.Type Freq.Z-score repos au feront garder le faire faisant font fait
The Mutual Information (MI) score the word post co-occurs with many words, among which are "the", "office" and "mortem". f(office) = 5237 f(the) = f(mortem) = 51 (f= overall frequency in the Birmingham Corpus)
Joint frequency for those three words is as follows : j(the) = 1583 j(office) = 297 j(mortem) = 51 The relative frequencies can be compared with what would be expected under the null hypothesis
THE NULL HYPOTHESIS The word post has no effect whatsoever on its lexical environment and the frequencies of words surrounding post will be exactly the same as they would be if post were present or not. Expected co-occurrence of post is calculated as : (f(post) * span ) * relative_freq(the) (2579 * 8) * (1 / 20) = / 20 = 1031
The MI Score is the ratio between observed co- occurrence and expected co-occurrence For post and the, it is log(1583/1031) = 0,17 The expected joint frequency for post and office is : (f(post) * span ) * relative_freq(office) 2579 * 8 * 297/20m = 0,3 The observed joint frequency is 297. Hence the MI score is about log(990)=2,99 For mortem, the MI score is log(51/0,05) = 3
The mutual-information score for a two-word collocation is a base-2 logarithm of the ratio of the combined probabilities of the occurrence of the first word and the occurrence of the second word to the probability of the occurrence of the two-word collocation. T-scores differ from mutual information scores in being scaled by an estimate of the variance (they tend to correct skewed MI scores that are due to a low number of occurrences).