Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSA2050 Introduction to Computational Linguistics Parsing I.

Similar presentations


Presentation on theme: "CSA2050 Introduction to Computational Linguistics Parsing I."— Presentation transcript:

1 CSA2050 Introduction to Computational Linguistics Parsing I

2 Apr 2008 -- MRCSA2050 - Parsing I2 Why Is Syntax Important? The presidential candidate who was extremely popular smiled broadly. How many presidential candidates are implied? 1 or >1?

3 Apr 2008 -- MRCSA2050 - Parsing I3 Why Is Syntax Important? The presidential candidate, who was extremely popular, smiled broadly. How many presidential candidates are implied? 1 or >1?

4 Apr 2008 -- MRCSA2050 - Parsing I4 Why Is Syntax Important? The presidential candidate, who was extremely popular, smiled broadly. The presidential candidate who was extremely popular smiled broadly. …because the syntactic structure has an important bearing on the meaning

5 Apr 2008 -- MRCSA2050 - Parsing I5 PP Attachment The policeman saw a burglar with a gun The policemen saw a burglar with a telescope PP can modify V or N In the first case, it modifes V In the second, it modifies N

6 Apr 2008 -- MRCSA2050 - Parsing I6 PP modifies V D N V D N P D N The policemen saw the burglar with a telescope S NP VP PP NP

7 Apr 2008 -- MRCSA2050 - Parsing I7 PP modifies N D N V D N P D N The policemen saw a burglar with a gun S NP VP PP NP

8 Apr 2008 -- MRCSA2050 - Parsing I8 Issue In general, how can we determine whether a prepositional phrase modifies the preceding noun or verb? Knowledge based approach must encode, for example burglars often have guns people can see things with a telescope + a lot of other things Statistical approach

9 Apr 2008 -- MRCSA2050 - Parsing I9 PP Attachment – Statistical Approach The Prepositional Phrase Attachment Corpus, included with NLTK as ppattach, makes it possible for us to study this question systematically. Derived from the IBM-Lancaster Treebank of Computer Manuals and the Penn Treebank, Distils only the essential information about PP attachment.

10 Apr 2008 -- MRCSA2050 - Parsing I10 Corpus Example Sentence Original Four of the five surviving workers have asbestos- related diseases, including three with recently diagnosed cancer. including three with recently diagnosed cancer versus including three by adding two and one

11 Apr 2008 -- MRCSA2050 - Parsing I11 Distilled Information in Corpus Original Four of the five surviving workers have asbestos- related diseases, including three with recently diagnosed cancer. ppattach corpus 16 including three with cancer N i/d head verb head of obj prep head of pp’s np N or V

12 Apr 2008 -- MRCSA2050 - Parsing I12 Further examples 47830 allow visits between families N 47830 allow visits on peninsula V 42457 acquired interest in firm N 42457 acquired interest in 1986 V Etc.

13 Apr 2008 -- MRCSA2050 - Parsing I13 Minimal Pair Extraction NLTK contains primitives that allow us to to extract minimal pairs where we hold NP1, PREP and NP2 constant and get different attachments with respect to verb, e.g. received (NP offer) (PP from group) V rejected (NP offer (PP from group)) N receive x from y reject x

14 Apr 2008 -- MRCSA2050 - Parsing I14 Why Syntactic Structure? Helps to make explicit how a sentence says who did what to whom The fierce dog bit the man Key idea is to identify noun phrases around the verb We can do this in terms of sequences of POS tags, e.g. D JJ* N But there are limitations to this approach The child with a fierce dog bit the man Here child is biting but D JJ* N still precedes “bit” so fierce dog remains the thing doing the biting.

15 Apr 2008 -- MRCSA2050 - Parsing I15 Constituency We could repair with a more complex regular expression such as DT JJ* NN (IN DT JJ* NN)* But this is defeated by The seagull that attacked the child with the fierce dog bit the man Basic problem is that we need a richer notion of constituency – how the words fit together to form a noun phrase.

16 Apr 2008 -- MRCSA2050 - Parsing I16 Recursion – Central Embedding The dog barked

17 Apr 2008 -- MRCSA2050 - Parsing I17 Recursion – Central Embedding The dog barked The dog the cat scratched barked

18 Apr 2008 -- MRCSA2050 - Parsing I18 Recursion – Central Embedding The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched barked.

19 Apr 2008 -- MRCSA2050 - Parsing I19 Recursion – Central Embedding The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched barked. The dog the cat the horse the man rode liked scratched barked.

20 Apr 2008 -- MRCSA2050 - Parsing I20 Chomsky Hierarchy

21 Apr 2008 -- MRCSA2050 - Parsing I21 CFG Review A CFG is a 4-tuple (N, Σ, P, S), where: N is a set of non-terminal symbols (the category labels); Σ is a set of terminal symbols (e.g., lexical items); P is a set of productions of the form A → α, where – A is a non-terminal, and – α is a string of symbols from (N U Σ)* (i.e., strings of either terminals or non-terminals); S is the start symbol. A derivation of a string from a non-terminal N in P is the result or trace of successively applying individual productions in P to A.

22 Apr 2008 -- MRCSA2050 - Parsing I22 Different Derivations for the Same Sentence Derivation 1 NP Det N PP the N PP the dog PP the dog P NP the dog with NP the dog with Det N the dog with a N the dog with a telescope Derivation 2 NP Det N PP Det N P NP Det N with NP The N with NP The N with a N

23 Apr 2008 -- MRCSA2050 - Parsing I23 What Does Context Free Mean? LHS of rule is just one symbol. Can have NP -> Det N Cannot have X NP Y -> X Det N Y

24 Apr 2008 -- MRCSA2050 - Parsing I24 Grammar Symbols Symbols of the grammar fall into three categories: 1. Non Terminal Symbols 2. Terminal Symbols 3. Parts of Speech We will sometimes not distinguish between 2 and 3

25 Apr 2008 -- MRCSA2050 - Parsing I25 Technical Aspects of CFGs Rules of the form LHS -> RHS LHS comprises at most one NT symbol RHS any combination of NT and T symbols Finite State (type 3) grammars have different restrictions LHS comprises at most one NT symbol RHS combination of T symbols with at most one NT. Right linear grammar: NT must come at extreme left Left linear grammar: NT must come at extreme right

26 Apr 2008 -- MRCSA2050 - Parsing I26 A Simple Grammar + Lexicon grammar: S  NP VP NP  N VP  V NP lexicon: V  kicks N  John N  Bill S NP N Johnkicks NPV VP N Bill

27 Apr 2008 -- MRCSA2050 - Parsing I27 Grammar versus Parser A grammar/lexicon defines a relation between sentences generated by the grammar and their respective syntactic structures. The grammar does not tell us how to actually go about discovering the structure of a sentence. A parsing algorithm is an effective procedure for carrying out that discovery. A parser implements a parsing algorithm. Recursive descent parsing.


Download ppt "CSA2050 Introduction to Computational Linguistics Parsing I."

Similar presentations


Ads by Google