1 Computational Lexicography: Mapping Meaning onto Use Patrick Hanks Brandeis University and Berlin-Brandenburg Academy of Sciences.

Slides:



Advertisements
Similar presentations
You may be a victim of. Are you anxious and worried about what will happen when you and your significant other are together? Apart? Are you the subject.
Advertisements

Eye of the Storm Unit 3 Week 4.
A Student’s Guide to Methodology
TEST-TAKING STRATEGIES FOR THE OHIO ACHIEVEMENT READING ASSESSMENT
Gallup Q12 Definitions Notes to Managers
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Language Chapter 3 Content.
1 The Generative Lexicon (GL) meets Corpus Pattern Analysis (CPA) Patrick Hanks Institute of Formal and Applied Linguistics, Charles University in Prague,
1 Why do CPA? Patrick Hanks Research Institute for Information and Language Processing, University of Wolverhampton; Bristol Centre for Linguistics, University.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Introduction to corpus session General corpora Rosamund Moon: lexicography, polysemy data Alice Deignan Specialised corpora Elena Semino Andreas Musolff.
THERE IS NO GENERAL METHOD OR FORMULA WHICH IS ‘CORRECT’. YOU CAN PROBABLY IGNORE SOME OF THIS ADVICE AND STILL WRITE A GOOD ESSAY… BUT FOLLOWING IT MAY.
Seven possibly controversial but hopefully useful rules Paul De Boeck K.U.Leuven.
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Term 2 Week 3 Semantics.
August 23, 2010 Grammars and Lexicons How do linguists study grammar?
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
PSY 369: Psycholinguistics Some basic linguistic theory part3.
Guidelines to Publishing in IO Journals: A US perspective Lois Tetrick, Editor Journal of Occupational Health Psychology.
Meaning and Language Part 1.
Spelling Lists.
Spelling Lists. Unit 1 Spelling List write family there yet would draw become grow try really ago almost always course less than words study then learned.
thinking hats Six of Prepared by Eman A. Al Abdullah ©
RESEARCH DESIGN.
1 Syntagmatic Preferences Patrick Hanks Masaryk University In honour of Yorick Wilks BCS, London, June 22, 2007.
15 Powerful Habits Make You The Winner!!!.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Peer Review Expectations Practice With Sentence Types.
1 Corpus Pattern Analysis: Word Meaning and Word Use Patrick Hanks Brandeis University and Berlin-Brandenburg Academy of Sciences.
1 How People use words to make meanings __ How to compute the meaning of natural language utterances Patrick Hanks Professor in Lexicography University.
The DVC project: Disambiguation of Verbs by Collocation ____ an introduction to the linguistic theory of norms and exploitations Patrick Hanks Research.
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
1 Computational Lexicography: Mapping Meaning onto Use Patrick Hanks Institute of Formal and Applied Linguistics, Charles University in Prague Masaryk.
Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Tools in Media Research In every research work, if is essential to collect factual material or data unknown or untapped so far. They can be obtained from.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
Word Sense Disambiguation (WSD)
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Methods of Science Section 1.1. Methods of Science 3 areas of science: Life, Earth, Physical –What is involved in each? Scientific Explanations- not always.
Practice Examples 1-4. Def: Semantics is the study of Meaning in Language  Definite conclusions Can be arrived at concerning meaning.  Careful thinking.
ORAL EXAMINATION INDIVIDUAL PRESENTATION What you should know about Part one? What is an individual presentation? What is an individual presentation?
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
The Problem page, Coherence, ideology How an ideological message is conveyed through language, and particularly through the following aspects of textual.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Disambiguation Read J & M Chapter 17.1 – The Problem Washington Loses Appeal on Steel Duties Sue caught the bass with the new rod. Sue played the.
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 16, March 6, 2007.
Scientific Investigations The Nature of Scientific Research.
1 Corpus Pattern Analysis (CPA) Patrick Hanks Research Institute of Information and Language Processing, University of Wolverhampton ***
Sight Words.
Corpus search What are the most common words in English
The Nature of Knowledge. Thick Concept When a short definition is not enough, it is called a thick concept word. It can only be understood through experience.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
PSY 432: Personality Chapter 1: What is Personality?
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
Plagiarism Miss H. 2008/2009. The entire content of this presentation comes from TurnItIn.com Turnitin allows free distribution and non-profit use of.
Presented by The Solutions Group Decision Making Tools.
SEMANTICS DEFINITION: Semantics is the study of MEANING in LANGUAGE Try to get yourself into the habit of careful thinking about your language and the.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
What do we mean by the word “knowledge?”
A CORPUS-BASED STUDY OF COLLOCATIONS OF HIGH-FREQUENCY VERB —— MAKE
Statistical NLP: Lecture 9
Fact and Opinion: Is There Really a Difference
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

1 Computational Lexicography: Mapping Meaning onto Use Patrick Hanks Brandeis University and Berlin-Brandenburg Academy of Sciences

2 Two senses of “computational lexicography” 1.Exploiting published dictionaries for use in new computer programs 2.Using computer programs to create new dictionaries

3 Using dictionaries for computational purposes Inventory of the words of a language + tokenization, lemmatization Word class recognition (noun vs. verb vs. adj.) –but dictionaries don’t give comparative frequencies –see, sees n. district of a bishop: 136 in BNC. –see, sees vb. perceive: 118,500 in BNC. Word sense disambiguation –assumes that dictionary sense distinctions are reliable. –dictionaries don’t give comparative frequencies!

4 Word Sense Disambiguation Lesk (1986): ‘How to tell a pine cone from an ice cream cone’, using OALD definitions: pine 1. kind of evergreen tree with needle-shaped leaves. 2. waste away through sorrow or illness. cone 1. a solid object with a round flat base and sides that slope up to a point… 2. something of this shape whether solid or hollow. 3. a piece of thin crisp biscuit shaped like a cone, which you can put ice cream in to eat it. 4. the fruit of certain evergreen trees.

5 Scaling up Lesk’s Approach to WSD Mark Stevenson and Yorick Wilks (1999): ‘Combining weak knowledge sources for sense disambiguation’ – using LDOCE: –Part of speech tags –Selectional “restrictions” (preferences) –Subject Field tags –Lexical items in definition text (like Lesk) They claim “over 90% success when tested against a specially created corpus”.

6 Other approaches to WSD Ide and Veronis (1990): constructed a large neural network based on an MRD (Collins) –Combinatorial explosion. No selectivity: it couldn’t cope with a whole discourse, nor even a long sentence. Yarowsky et al. (1992): “One sense per discourse” Yarowsky (1993): “One sense per collocation” –In both cases, claiming high probability, not certainty.

7 Other resources used in WSD Miller and Fellbaum: WordNet (thesaurus based) –intuition based, not corpus based. –“senses” are actually nodes in an inheritance hierarchy, so not suitable for serious disambiguation. Levin verb classes –Levin’s classes are not empirically well founded –Levin’s alternations are an important contribution FrameNet –proceeds frame by frame, not word by word. –uses the corpus as a fishpond.

8 Other problems No general agreement on what counts as a word sense No clear criteria in dictionaries for distinguishing one sense from another Very little syntagmatic information in dictionaries

9 Lumping and splitting Most dictionaries are splitters. E.g. why did OALD 1963 make these two senses (cone)? 1. a solid object with a round flat base and sides that slope up to a point… 2. something of this shape whether solid or hollow. Why not: a solid or hollow object with a round flat base and sides that slope up to a point (?) This problem endlessly multiplied.

10 Dictionaries before corpora Based on collections of citations (from literary texts) In some dictionaries, examples were – and are – based on introspection, not taken from actual texts. Murray (OED, 1878): “The editor and his assistants have to spend precious hours searching for examples of common everyday words. Thus, … we have 50 examples of abusion, but of abuse not 5.”

11 Definitions Before corpora: attempt to state necessary conditions for the meaning of each word. It was assumed (wrongly) that this would enable people to use words correctly. Definitions in dictionaries of the future: associate meanings with words in context, not words in isolation.

12 Implicatures: taking stereotypes seriously If someone files a lawsuit, they activate a procedure asking a court for justice. When a pilot files a flight plan, he or she informs ground control of the intended route and obtains permission to begin flying. … When a group of people file into a room or other place, they walk in one behind the other. (12 more such definitions of file, verb.)

13 The problem: deciding relevant context Peter treated Mary. Peter treated Mary for her asthma. Peter treated Mary badly. Peter treated Mary with respect. Peter treated Mary with antibiotics. Peter treated Mary to his views on George W. Bush Peter treated the woodwork with creosote. (See treat_for_presentation.txt )

14 A theoretical breakthrough Fillmore (1975): ‘An alternative to checklist theories of meaning’: “measure meaning, not by statements of necessary and sufficient conditions, but by resemblance to a prototype.”

15 The CPA method CPA: Corpus Pattern Analysis (based on TNE: the Theory of Norms and Exploitations). 1. Create a sample concordance (KWIC index): –~ 300 examples of actual uses of the word –from a ‘balanced’ corpus (i.e. general language) [We use the British National Corpus, 100 million words, and the Associated Press Newswire for , 150 million words] –or a ‘relevant’ corpus (i.e. domain-specific) 2. Classify every line in the sample, by context. 3. Take further samples if necessary. 4. Use introspection to interpret data, but not to create data.

16 Sample from a concordance incessant noise and bustle had abated. It seemed everyone was up after dawn the storm suddenly abated. Ruth was there waiting when Thankfully, the storm had abated, at least for the moment, and storm outside was beginning to abate, but the sky was still ominous Fortunately, much of the fuss has abated, but not before hundreds of, after the shock had begun to abate, the vision of Benedict's been arrested and street violence abated, the ruling party stopped he declared the recession to be abating, only hours before the ‘ soft landing’ in which inflation abates but growth continues moderate the threshold. The fearful noise abated in its intensity, trailed ability. However, when the threat abated in 1989 with a ceasefire in bag to the ocean. The storm was abating rapidly, the evening sky ferocity of sectarian politics abated somewhat between 1931 and storm. By dawn the weather had abated though the sea was still angry the dispute showed no sign of abating yesterday. Crews in

17 The Importance of Context “You shall know a word by the company it keeps” – J. R. Firth. Corpus analysis can show what company our words keep. Frequency alone is not enough: “of the” is a frequent collocation – but not interesting! “storm abated” is less frequent, but more interesting. Contrasted with “threat abated”, it can give a different meaning to the verb abate. So we need a way of measuring the statistical significance of collocations.

18 Mutual information A way of computing the statistical significance of two words in collocation. Compares the actual co-occurrence of two words in a corpus with chance. Church and Hanks (1990): ‘Word Association Norms, Mutual Information, and Lexicography’ in Computational Linguistics 16:1. Kilgarriff and Tugwell (2001): “Waspbench, word sketches” ACL, Toulouse.

19 In CPA, every line in the sample must be classified The classes are: Norms Exploitations Alternations Names (Midnight Storm: name of a horse, not a storm) Mentions (to mention a word or phrase is not to use it) Errors (e.g. learned mistyped as leaned) Unassignables

20 Methodological precepts Focus on the probable. On the basis of what has happened, predict what is likely to happen. Don’t look for necessary conditions for the meaning of a word. (There aren’t any.) Don’t try to account for all possibilities. Use prototype theory to account for probable meanings. Don’t ever say “all and only”.

21 Norms How the words are normally used. Descriptive (not prescriptive). Norms are discovered by systematic, empirical Corpus Pattern Analysis (CPA).

22 Exploitations People don’t just say the same thing, using the same words repeatedly. They also exploit norms in order to say new things, or in order to say old things in new and interesting ways. Exploitations include metaphor, ellipsis, word creation, and other figures of speech. Exploitations are a form of creativity.

23 Example of a CPA verb norm abate/VBNC frequency: 185 in 100m. 1. [[Event = Storm]] abate [NO OBJ](11%) 2. [[Event = Flood]] abate [NO OBJ] (4%) 3. [[Event = Fever]] abate [NO OBJ] (2%) 4. [[Event = Problem]] abate [NO OBJ] (44%) 5. [[Emotion = Negative]] abate [NO OBJ] (20%) 6. [[Person | Action]] abate [[State = Nuisance]] (19%) (Domain: Law)

24 [[Event = Storm]] abate [ NO OBJ ] dry kit and go again.The storm abates a bit, and there is no problem in ling.Thankfully, the storm had abated, at least for the moment, and the sting his time until the storm abated but also endangering his life, Ge storm outside was beginning to abate, but the sky was still ominously o bag to the ocean.The storm was abating rapidly, the evening sky clearin after dawn the storm suddenly abated.Ruth was there waiting when the h t he wait until the rain storm abated.She had her way and Corbett went storm.By dawn the weather had abated though the sea was still angry, i lcolm White, and the gales had abated: Yachting World had performed the he rain, which gave no sign of abating, knowing her options were limite n became a downpour that never abated all day.My only protection was ned away, the roar of the wind abating as he drew the hatch closed behi

25 [[Event = Problem]] abate [ NO OBJ ] ‘ soft landing’ in which inflation abates but growth continues modera Fortunately, much of the fuss has abated, but not before hundreds of the threshold. The fearful noise abated in its intensity, trailed incessant noise and bustle had abated. It seemed everyone was up ability. However, when the threat abated in 1989 with a ceasefire in the Intifada shows little sign of abating. It is a cliche to say that h he declared the recession to be abating, only hours before the pub he ferocity of sectarian politics abated somewhat between 1931 and 1 been arrested and street violence abated, the ruling party stopped b the dispute showed no sign of abating yesterday. Crews in

26 [[Emotion = Negative]] abate [ NO OBJ ] (selected lines) ript on the table and his anxiety abated a little.This talented, if that her initial awkwardness had abated # for she had never seen a es if some inner pressure doesn't abate.He wanted to play at the fun Baker in the foyer and my anxiety abated.He seemed disappointed and hained at the time.When the agony abated he was prepared to laugh wi self; the pain gradually began to abate spontaneously, a great relie ght, after the shock had begun to abate, the vision of Benedict's sn y calm, control it!) The fear was abating, the trembling beginning t his dark eyes. That fear did not abate when, briefly, he halted. For AN EXPLOITATION OF THIS NORM: isapproval, his kindlier feelings abated, to be replaced by a resurg (“kindlier feelings” are normally positive, not negative.)

27 Part of the lexical set [[Event = Problem]] as subject of ‘abate’ From BNC: {fuss, problem, tensions, fighting, price war, hysterical media clap-trap, disruption, slump, inflation, recession, the Mozart frenzy, working-class militancy, hostility, intimidation, ferocity of sectarian politics, diplomatic isolation, dispute, …} From AP: {threat, crisis, fighting, hijackings, protests, tensions, violence, bloodshed, problem, crime, guerrilla attacks, turmoil, shelling, shooting, artillery duels, fire-code violations, unrest, inflationary pressures, layoffs, bloodletting, revolution, murder of foreigners, public furor, eruptions, bad publicity, outbreak, jeering, criticism, infighting, risk, crisis, …} (All these are kinds of problem.)

28 Part of the lexical set [[Emotion = Negative]] as subject of ‘abate’ From BNC: {anxiety, fear, emotion, rage, anger, fury, pain, agony, feelings,…} From AP: {rage, anger, panic, animosity, concern, …}

29 A domain-specific norm: [[Person | Action]] abate [[Nuisance]] ( DOMAIN: Law. Register: Jargon ) o undertake further measures to abate the odour, and in Attorney Ge us methods were contemplated to abate the odour from a maggot farm s specified are insufficient to abate the odour then in any further as the inspector is striving to abate the odour, no action will be t practicable means be taken to abate any existing odour nuisance, ll equipment to prevent, and or abate odour pollution would probabl rmation alleging the failure to abate a statutory nuisance without t I would urge you at least to abate the nuisance of bugles forthw way that the nuisance could be abated, but the decision is the dec otherwise the nuisance is to be abated.They have full jurisdiction ion, or the local authority may abate the nuisance and do whatever

30 Lexical sets are contrastive Different lexical sets generate different meanings. Lexical sets are not like syntactic structures. In principle, lexical sets are open-ended, but most have high-value best examples. In practice, a lexical set may have only 1 or 2 members, e.g. take a {look | glance}. No certainties in word meaning; only probabilities. … but probabilities can be measured.

31 A more complicated verb: ‘take’ 61 phrasal verb patterns, e.g. [[Person]] take [[Garment]] off [[Plane]] take off [[Human Group]] take [[Business]] over 105 light verb uses (with specific objects), e.g. [[Event]] take place [[Person]] take {photograph | photo | snaps | picture} [[Person]] take {the plunge} 18 ‘heavy verb’ uses, e.g. [[Person]] take [[PhysObj]] [Adv[Direction]] 13 adverbial patterns, e.g. [[Person]] take [[TopType]] seriously [[Human Group]] take [[Child]] {into care} TOTAL: 204, and growing (but slowly)

32 A fine distinction: ‘take + place’ [[Event]] take {place}: A meeting took place. [[Person 1]] take {[[Person 2]]’s place}: –George took Bill’s place. [[Person]] take {[COREF POSDET] place}: Wilkinson took his place among the greats of the game. [[Person=Competitor]] take {[ORDINAL] place}: The Germans took first place.

33 Noun norms Norms for nouns are different in kind from norms for verbs. Adjectives and prepositions are more like verbs than nouns. A different analytical apparatus is required for nouns. Prototype statements for each true noun can be derived from a corpus. Examples for the noun ‘storm’ follow.

34 Storm (literal meaning) (1) WHAT DO STORMS DO? Storms blow. Storms rage. Storms lash coastlines. Storms batter ships and places. Storms hit ships and places. Storms ravage coastlines and other places.

35 Storm (literal meaning) (2) BEGINNING OF A STORM: Before it begins, a storm is brewing, gathering, or impending. There is often a calm or a lull before a storm. Storms last for a certain period of time. Storms break. END OF A STORM: Storms abate. Storms subside. Storms pass.

36 Storm (literal meaning) (3) WHAT HAPPENS TO PEOPLE IN A STORM? People can weather, survive, or ride (out) a storm. Ships and people may get caught in a storm.

37 Storm (literal meaning) (4) WHAT KINDS OF STORMS ARE THERE? There are thunder storms, electrical storms, rain storms, hail storms, snow storms, winter storms, dust storms, sand storms, tropical storms… Storms are violent, severe, raging, howling, terrible, disastrous, fearful, ferocious…

38 Storm (literal meaning) (5) OTHER ASSOCIATIONS OF ‘STORM’: Storms, especially snow storms, may be heavy. An unexpected storm is a freak storm. The centre of a storm is called the eye of the storm. A major storm is remembered as the great storm (of [[Year]]). STORMS ARE ASSOCIATED WITH rain, wind, hurricanes, gales, and floods.

39 Why norms are important These statements about storm are stereotypical (prototypical). They are corpus-derived (empirically well founded). They represent central and typical beliefs about storms: the ‘meaning’ of storm. This is where syntax meets semantics.

40 Other kinds of exploitation besides metaphor: e.g. ellipsis I hazarded various Stuartesque destinations like Bali, Florence, Istanbul …. (The prototype is: [[Person]] hazard {guess}) Other collocates: hazard a comment, hazard a definition “Perhaps it’s in the kitchen,” she hazarded.

41 Goals of CPA To create an inventory of semantically motivated syntagmatic patterns (normal clauses), focused on verbs, in various languages. To write programs for creating lexical sets by computational cluster analysis. To collect evidence for the principles that govern the exploitations of norms.

42 Conclusions Meanings are best associated with normal contexts, rather than words in isolation. Normal contexts correlate statistically significant collocations in different clause roles. The whole language system is probabilistic and preferential. The probabilities can be analysed in a new kind of dictionary – a syntagmatic dictionary.