Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy,

Similar presentations


Presentation on theme: "Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy,"— Presentation transcript:

1 Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003): OUP, chapter 4 (Kaplan) J Allen Natural Language Understanding (2 nd ed) (1994): Addison Wesley

2 2 Parsing POS tags give information about the individual words, and their internal form (eg sing vs plur, tense of verb) Additional level of information concerns the way the words relate to each other –the overall structure of each sentence –the relationships between the words This can be achieved by parsing the corpus

3 3 Parsing – overview What sort of information does parsing add? What are the difficulties relating to parsing? How is parsing done? Parsing and corpora –partial parsing, chunking –stochastic parsing –treebanks

4 4 Structural information Parsing adds information about sentence structure and constituents Allows us to see what constructions words enter into –eg, transitivity, passivization, argument structure for verbs Allows us to see how words function relative to each other –eg, what words can modify / be modified by other words

5 5 [S[N Nemo_NP1,_, [N the_AT killer_NN1 whale_NN1 N],_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N],_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V]._. S] Nemo, the killer whale, who ‘d grown too big for his pool on Clacton Pier, has arrived safely at his new home in Windsor safari park. S N Fr V V J P P N N P P N N N N NP1, AT NN1 NN1, PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1, VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1. Nemo, the killer whale, who ’d grown too big for his pool on Clacton Pier, has arrived safely at his new home in Windsor safari park.

6 6 [S[N Nemo_NP1,_, [N the_AT killer_NN1 whale_NN1 N],_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N],_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V]._. S] S N Fr V V J P P N N P P N N N N NP1, AT NN1 NN1, PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1, VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1. Nemo, the killer whale, who ’d grown too big for his pool on Clacton Pier, has arrived safely at his new home in Windsor safari park. given this verb, what kinds of things can be subject?

7 7 [S[N Nemo_NP1,_, [N the_AT killer_NN1 whale_NN1 N],_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N],_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V]._. S] S N Fr V V J P P N N P P N N N N NP1, AT NN1 NN1, PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1, VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1. Nemo, the killer whale, who ’d grown too big for his pool on Clacton Pier, has arrived safely at his new home in Windsor safari park. verb with adjective complement: what verbs can participate in this construction? with what adjectives? any other constraints?

8 8 [S[N Nemo_NP1,_, [N the_AT killer_NN1 whale_NN1 N],_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N],_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V]._. S] S N Fr V V J P P N N P P N N N N NP1, AT NN1 NN1, PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1, VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1. Nemo, the killer whale, who ’d grown too big for his pool on Clacton Pier, has arrived safely at his new home in Windsor safari park. verb with PP complement: what verbs with what prepositions? any constraints on noun?

9 9 Parsing: difficulties Besides lexical ambiguities (usually resolved by tagger), language can be structurally ambiguous –global ambiguities due to ambiguous words and/or alternative possible combinations –local ambiguities, especially due to attachment ambiguities, and other combinatorial possibilities –sheer weight of alternatives available in the absence of (much) knowledge

10 10 Global ambiguities Individual words can be ambiguous as to category In combination with each other this can lead to ambiguity: –Time flies like an arrow –Gas pump prices rose last time oil stocks fell

11 11 Local ambiguities Structure of individual constituents may be given, but how they fit together can be in doubt Classic example of PP attachment –The man saw the girl with the telescope The man saw the girl in the parkwith a statue of the generalon a horse with a sword on a stand in the morning with a red dress with a telescope Many other attachments potentially ambiguous –relative clauses, adverbs, parentheticals, etc

12 12 Difficulties Broad coverage necessary for parsing corpora of real text Long sentences: –structures are very complex –ambiguities proliferate Difficulty (even for human) to verify if parse is correct –because it is complex –because it may be genuinely ambiguous

13 13 How to parse Traditionally (in linguistics) –hand-written grammar –usually narrow coverage –linguists are interested in theoretical issues regarding syntax Even in computational linguistics –interest is (was?) in parsing algorithms In either case, grammars typically used small set of categories (N, V, Adj etc)

14 14 Lack of knowledge Humans are very good at disambiguating In fact they rarely even notice the ambiguity Usually, only one reading “makes sense” They use a combination of –linguistic knowledge –common-sense (real-world) knowledge –contextual knowledge Only the first is available to computers, and then only in a limited way

15 15 Parsing corpora Using tagger as a front-end changes things: –Richer set of grammatical categories which reflect some morphological information –Hand-written grammars more difficult though because many generalisations lost (eg now need many more rules for NP) –Disambiguation done by tagger in some sense pre-empts work that you might have expected the parser to do

16 16 Parsing corpora Impact of broad coverage requirement –Broad coverage means that many more constructions are covered by the grammar –This increases ambiguity massively Partial parsing may be sufficient for some needs Availability of corpora permits (and encourages) stochastic approach

17 17 Partial parsing Identification of constituents (noun phrases, verb groups, PPs) is often quite robust … Only fitting them together can be difficult Although some information is lost, identifying “chunks” can be useful

18 18 Stochastic parsing Like ordinary parsing, but competing rules are assigned a probability score Scores can be used to compare (and favour) alternative parses Where do the probabilities come from? S  NP VP.80 S  aux NP VP.15 S  VP.05 NP  det n.20 NP  det adj n.35 NP  n.20 NP  adj n.15 NP  pro.10

19 19 Where do the probabilities come from? 1)Use a corpus of already parsed sentences: a “treebank” –Best known example is the Penn Treebank Marcus et al Available from Linguistic Data Consortium Based on Brown corpus + 1m words of Wall Street Journal + Switchboard corpus –Count all occurrences of each rule variation (e.g. NP) and divide by total number of NP rules –Very laborious, so of course is done automatically

20 20 Where do the probabilities come from? 2)Create your own treebank from your own corpus –Easy if all sentences are unambiguous: just count the (successful) rule applications –When there are ambiguities, rules which contribute to the ambiguity have to be counted separately and weighted

21 21 Where do the probabilities come from? 3)Learn them as you go along –Again, assumes some way of identifying the correct parse in case of ambiguity –Each time a rule is successfully used, its probability is adjusted –You have to start with some estimated probabilities, e.g. all equal –Does need human intervention, otherwise rules become self-fulfilling prophecies

22 22 Bootstrapping the grammar Start with a basic grammar, possibly written by hand, with all rules equally probable Parse a small amount of text, then correct it manually –this may involve correcting the trees and/or changing the grammar Learn new probabilities from this small treebank Parse another (similar) amount of text, then correct it manually Adjust the probabilities based on the old and new trees combined Repeat until the grammar stabilizes

23 23 Treebanks – some examples (with links) Penn perhaps best knownPenn –Wall Street Corpus, Brown Corpus; >1m words International Corpus of English (ICE);International Corpus of English (ICE) Lancaster Parsed Corpus and Lancaster-Leeds treebankLancaster Parsed Corpus –parsed excerpts from LOB; 140k and 45k words resp. Susanne Corpus, Christine Corpus, Lucy Corpus;Susanne CorpusChristine CorpusLucy Corpus –related to Lancaster corpora; developed by Geoffrey Sampson Verbmobil treebanks –parallel treebanks (Eng, Ger, Jap) used in speech MT project LinGO Redwoods: HPSG-based parsing of Verbmobil dataLinGO Redwoods Multi-Treebank –parses in various frameworks of 60 sentences The PARC 700 Dependency Bank;The PARC 700 Dependency Bank –LFG parses of 700 sentences also found in Penn treebank CHILDES –Brown Eve corpus of children’s speech samples with dependency annotation


Download ppt "Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy,"

Similar presentations


Ads by Google