Presentation on theme: "Corpora in lexical studies"— Presentation transcript:
1 Corpora in lexical studies Corpus LinguisticsRichard Xiao
2 Aims of this session Lecture Lab session Corpus-based lexicography Collocation and colligationLab sessionCollocation using WSTCollocation using AntConcCollocation and colligation in XairaUsing the BNCweb to study collocation
3 Corpus revolution in lexicographic and lexical studies Lexicographic and lexical studies are the greatest beneficiaries of corporaCorpora have “revolutionised” dictionary making and reference publishingIt is now nearly unheard of for new dictionaries and new editions of old dictionaries published from the 1990s onwards not to claim to be based on corpus data
4 Why use corpora in dictionary making? Machine-readable corpora allow dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few secondsCorpora allow dictionary makers to select entries based on frequency informationCorpora can readily provide frequency information and collocation information for readersTextual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) information encoded in corpora allows lexicographers to give a more accurate description of the usage of a lexical item
5 Why use corpora in dictionary making? Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographsA “monitor corpus” allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-dateCorpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable because of potential biases in intuitions
6 Five emphasesChanges brought about by corpora to dictionaries and other reference books - five “emphases” (Hunston 2002)an emphasis on frequencyan emphasis on collocation and phraseologyan emphasis on variationan emphasis on lexis in grammaran emphasis on authenticity
7 Top 1000 written / spoken words Authentic examples
8 Corpus-based learner dictionaries First ‘fully corpus-based’ dictionaryCollins Cobuild English Dictionary (1987)Some corpus-based learner dictionariesLongman Dictionary of Contemporary English (3rd edition)Oxford Advanced Learner’s Dictionary (OALD, 5th edition)Cambridge International Dictionary of English (1st edition)
10 CollocationCollocation is among the linguistic concepts which have benefited most from advances in corpus linguisticsWhat is collocation?strong tea, powerful car (Halliday 1976)“collocations of a given word are statements of the habitual or customary places of that word…the company that words keep” (Firth 1968:181-2)“One of the meanings of night is its collocability with dark” (Firth 1957:196)“a frequent co-occurrence of two lexical items in the language” (Greenbaum 1974:82)expel a school child vs. cashier an army officer“I propose to bring forward as a technical term, meaning by collocation, and apply the test of collocability” (Firth 1957: 194)
11 Meaning by collocation “There is frequently so high a degree of interdependence between lexemes which tend to occur in texts in collocation with one another that their potentiality for collocation is reasonably described as being part of their meaning” (Lyons 1977: 613)Complete description of the meaning of a word would have to include the other word or words that collocate with it“You shall know a word by the company it keeps!” (Firth 1968:179)Collocation is part of the word meaning
12 Two types of collocation Coherence collocation vs. neighbourhood (horizontal) collocation (Scott 1998)Coherence collocationCollocates associated with a word (e.g. letter – stamp, post office)Neighbourhood collocationWords which do actually co-occur with the word (letter - my, this, a, etc)
13 Coherence collocation “A cover term for the cohesion that results from the co-occurrence of lexical items that are in some way or other typically associated with one another, because they tend to occur in similar environments.” (Halliday & Hasan 1976:287)candle – flame – flickerhair – comb – curl – wavesky – sunshine – cloud – rainDifficult to measure using a statistical formula
14 Neighbourhood collocation Collocation in corpus linguisticsStructure of collocation – collocation window“We may use the term node to refer to an item whose collocations we are studying, and we may then define a span as the number of lexical items on each side of a node that we consider relevant to that node. Items in the environment set by the span we will call collocates.” (Sinclair 1966:415)Casual vs. significant collocationSignificant collocation: collocation that occurs more frequently than would be expected (in a statistical sense) on the basis of the individual itemsn.b. Neighbourhood (horizontal) collocations can include some coherence collocations
15 Intuition vs. collocation Greenbaum (1974): “people disagree on collocations” in introspection-based elicitation experimentsAlthough “collocation can be observed informally” on the basis of intuitions, “it is more reliable to measure it statistically, and for this a corpus is essential” (Hunston 2002: 68)Intuition is often a poor guide to collocation“because each of us has only a partial knowledge of the language, we have prejudices and preferences, our memory is weak, our imagination is powerful (so we can conceive of possible contexts for the most implausible utterances), and we tend to notice unusual words or structures but often overlook ordinary ones” (Krishnamurthy 2000: 32-33)Collocation can be measured on the basis of co-occurrence statistics (MI, z, t, LL etc) – more discussion to follow
16 Collocation is syntagmatic Langue (Language system)paradigmaticfamous boots. On the stroke of full time theStoke the lead on the stroke of half-time with a goalSmith sin-binned on the stroke of half-time, added aclinched their win on the stroke of lunch after resumingchase by declaring on the stroke of lunch. <p> With a leadexpectant crowd, on the stroke of midday. The birdhour began not upon the stroke of midnight but upon theof midnight but upon the stroke of noon. There was,booked in advance. On the stroke of seven, a gong summonsPromptly on the stroke of six 'clock, the chooksfrom Edinburgh on the stroke of the Millennium.Parole (Utterance) syntagmatic
17 Collocation vs. colligation Relationship between a lexical item and other lexical itemsRelationship between words at the lexical levelE.g. very collocates with goodColligationRelationship between a lexical item and a grammatical categoryRelationship between words at the grammatical levelE.g. very colligates with ADJ
38 Rank by frequency“Sweet Maxwell” is a personal name.Frequent words crowd into the top of the collocate list:Are they genuine collocates?
39 Rank by the t testAlso focusing on frequent words?
40 Rank by MI Infrequent words at the top of the list n.b. - “Sweet Afton” is a phrase from the lyrics expressing the beauty of the River Afton; “sweet nothings” means romantic and loving talks between sweethearts; “sweet marjoram” is the name of a herb for cooking.Infrequent words at the top of the listHow useful are they (especially to English learners)?
41 Rank by the z scoreLike MI, the z score also over-estimates infrequent items (e.g. nothings, afton, marjoram)