Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton.

Similar presentations


Presentation on theme: "Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton."— Presentation transcript:

1 Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton

2

3 Outline Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

4 What is a thesaurus? a resource that groups words according to similarity

5 Manual and automatic Manual –Roget, WordNets, many publishers Automatic –Sparck Jones (1960s), Grefenstette (1994), Lin (1998), Lee (1999) –aka distributional –two words are similar if they occur in same contexts Are they comparable?

6 Thesauruses in NLP sparse data

7 Thesauruses in NLP sparse data does x go with y? –don’t know, they have never been seen together New question: does x+friends go with y+friends –indirect evidence for x and y –thesaurus tells us who friends are –“backing off”

8 Relevant in: Parsing –PP-attachment –conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

9 Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze

10 Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory?

11 Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? alligator?

12 Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No

13 Speech understanding He’s as headstrong as an alleg***** in the upwaters of the Yangtze allegory? in upwaters? No alligator? in upwaters? No allegory+friends in upwaters? No alligator+friends in upwaters? Yes

14 PP-attachment investigate stromatolite with microscope/speckles –microscope: verb attachment –speckles: noun attachment inspect jasper with spectrometer –which?

15 PP attachment (cont) compare frequencies of –

16 PP attachment (cont) compare frequencies of – both zero? Try –

17 Conjunction scope Compare –old boots and shoes –old boots and apples

18 Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old?

19 Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old?

20 Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old? Hypothesis: –wide scope only when words are similar

21 Conjunction scope Compare –old boots and shoes –old boots and apples Are the shoes old? Are the apples old? Hypothesis: –wide scope only when words are similar hard problem: thesaurus might help

22 Bridging anaphor resolution –Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer

23 Bridging anaphor resolution –Maria bought a large apple. The fruit was red and crisp. fruit and apple co-refer How to find co-referring terms?

24 Text cohesion words on same theme –same segment change in theme of words –new segment same theme: same thesaurus class

25 Word Sense Disambiguation (WSD) pike: fish or weapon –We caught a pike this afternoon probably no direct evidence for –catch pike probably is direct evidence for –catch {pike,carp,bream,cod,haddock,…}

26 WordNet, Roget widely used for all the above

27 The WASPS thesaurus –credit: David Tugwell –EPSRC grant K8931 POS-tag, lemmatise and parse the BNC (100M words) Find all grammatical relations – 70 million triples

28 WASPS thesaurus (cont) Similarity: – one point similarity between beer and wine count all points of similarity between all pairs of words weight according to frequencies –product of MI: Lin (1998)

29 Word Sketches one-page summary of a word’s grammatical and collocational behaviour demo: http://wasps.itri.bton.ac.ukhttp://wasps.itri.bton.ac.uk the Sketch Engine –input any corpus –generate word sketches and thesaurus –just available now

30 Nearest neighbours to zebra

31 Nearest neighbours zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

32

33 exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

34 VERBS measure determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust boil simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

35 ADJECTIVES hypnotic haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky pink purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

36 Nearest neighbours cranewinchswanheron winchcraneherontern heronmastcranegull tractorrigginggullswan truckpumpterncrane swantractorcurlewflamingo

37 no clustering (tho’ could be done) no hierarchy (tho’ could be done) rhythm all on the web: http://wasps.itri.bton.ac.uk http://wasps.itri.bton.ac.uk –registration required

38 The web an enormous linguist’s playground –Computational Linguistics Special Issue, Kilgarriff and Grefenstette (eds) 29 (3) (coming soon)

39 Google sets http://labs.google.com/sets Input: zebra giraffe buffalo

40 Google sets http://labs.google.com/sets Input: zebra giraffe buffalo kudu hyena impala leopard hippo waterbuck elephant cheetah eland

41 Google sets http://labs.google.com/sets Input: harbin beijing nanking

42 Google sets http://labs.google.com/sets Input: harbin beijing nanking Output: shanghai chengdu guangzhou hangzhou changchun zhejiang kunming dalian jinan fuzhou

43 Tree structure Roget –all human knowledge as tree structure –1000 top categories subdivisions –like this »etc

44 Directories and thesauruses Yahoo, http://www.yahoo.comhttp://www.yahoo.com Open directory project, http://dmoz.orghttp://dmoz.org –all human activity as tree structure plus corpus at every node –gather corpus, identify domain vocabulary Gonzalo and colleagues, Madrid, CL Special Issue Agirre and colleagues, ‘topic signatures’

45 Words and word senses automatic thesauruses –words

46 Words and word senses automatic thesauruses –words manual thesauruses –simple hierarchy is appealing –homonyms

47 Words and word senses automatic thesauruses –words manual thesauruses –simple hierarchy is appealing –homonyms –“aha! objects must be word senses”

48 Problems Theoretical Practical

49 Theoretical

50

51

52 Wittgenstein Don’t ask for the meaning, ask for the use

53 Practical

54 Problems Practical –a thesaurus is a tool –if the tool organises words senses you must do WSD before you can use it –WSD: state of the art, optimal conditions: 80%.

55 Problems Practical –a thesaurus is a tool –if the tool organises words senses you must do WSD before you can use it –WSD: state of the art, optimal conditions: 80% “To use this tool, first replace one fifth of your input with junk”

56 Avoid word senses

57 This word has three meanings/senses

58 Avoid word senses This word has three meanings/senses This word has three kinds of use –well founded –empirical –we can study it

59 sorry, roget

60 sorry, AI

61 AI model for NLP: –NLP turns text into meanings –AI reasons over meanings –word meanings are concepts in an ontology –a Roget-like thesaurus is (to a good approximation) an ontology –Guarino: “cleansing” WordNet If a thesaurus groups words in their various uses (not meanings) –not the sort of thing AI can reason over

62 sorry, AI “linguistics expressions prompt for meanings rather than express meanings” –Fauconnier and Turner 2003 It would be nice if … But …

63 Evaluation manual thesauruses –not done automatic thesauruses: attempts –pseudo-disambiguation (Lee 1999) –with ref to manual ones (Lin 1998)

64 Task-based evaluation

65 Parsing –PP-attachment –conjunction scope Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction

66 What is performance at the task –with no thesaurus –with Roget –with WordNet –with WASPS

67 Plans set up evaluation tasks theseval web-based thesaurus –Open Directory Project hierarchies campaign

68 Cyborgs Robots: will they take over? Rod Brooks’s answer: –Wrong question: greatest advances are in what the human+computer ensemble can do

69 Cyborgs A creature that is partly human and partly machine –Macmillan English Dictionary

70

71

72

73

74 Cyborgs and the Information Society The thedsaurus-making agent is part human (for precision), part computer (for recall).

75 Summary: Thesauruses for NLP Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs

76 Thesaurus-makers of the future?


Download ppt "Thesauruses for Natural Language Processing Adam Kilgarriff Lexicography MasterClass and University of Brighton."

Similar presentations


Ads by Google