Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007.

Similar presentations


Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007."— Presentation transcript:

1 Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007

2 Lecture 1, 7/21/2005Natural Language Processing2 Parsing with features  We need to constrain the rules in CFGs, for example to coerce agreement within and between constituents to pass features around to enforce subcategorisation constraints  Features can be easily added to our grammars  And later we’ll see that feature bundles can completely replace constituents

3 Lecture 1, 7/21/2005Natural Language Processing3 Parsing with features  Rules can stipulate values, or placeholders (variables) for values  Features can be used within the rule, or passed up via the mother nodes  Example: subject-verb agreement S  NP VP [if NP and VP agree in number] number of NP depends on noun and/or determiner number of VP depends on verb S  NP(num=X) VP (num=X) NP (num=X)  det(num=X) n (num=X) VP(num=X)  v(num=X) NP(num=?)

4 Lecture 1, 7/21/2005Natural Language Processing4 Declarative nature of features The rules can be used in various ways To build an NP only if det and n agree (bottom-up) When generating an NP, to choose an n which agrees with the det (if working L-to-R) (top- down) To show that the num value for an NP comes from its components (percolation) To ensure that the num value is correctly set when generating an NP (inheritance) To block ill-formed input NP (num=X)  det(num=X) n (num=X) this det (num=sg) these det (num=pl) the det (num=?) man n (num=sg) men n (num=pl) det(num=sg) n(num=sg) thisman NP (num=sg) n(num=pl) men

5 Lecture 1, 7/21/2005Natural Language Processing5 Use of variables  Unbound (unassigned) variables (ie variables with a free value): the can combine with any value for num Unification means that the num value for the is set to sg NP (num=X)  det(num=X) n (num=X) this det (num=sg) these det (num=pl) the det (num=?) man n (num=sg) men n (num=pl) det(num=?) n(num=sg) theman NP (num=sg)

6 Lecture 1, 7/21/2005Natural Language Processing6 Parsing with features  Features must be compatible  Formalism should allow features to remain unspecified  Feature mismatch can be used to block false analyses, and disambiguate e.g. they can fish ~ he can fish ~ he cans fish  Formalism may have attribute-value pairs, or rely on argument position e.g. NP(_num,_sem)  det(_num) n (_num,_sem) an = det(sing) the = det(_num) man = n(sing,hum)

7 Lecture 1, 7/21/2005Natural Language Processing7 Parsing with features  Using features to impose subcategorization constraints VP  ve.g. dance VP  v NPe.g. eat VP  v NP NPe.g. give VP  v PPe.g. wait (for) VP(_num)  v(_num,intr) VP(_num)  v(_num,trans) NP VP(_num)  v(_num,ditrans) NP NP VP(_num)  v(_num,prepobj(_case)) PP(_case) PP(_case)  prep(_case)NP dance = v(plur,intr) dances = v(sing,intr) danced = v(_num,intr) waits = v(sing,prepobj(for)) for = prep(for)

8 Lecture 1, 7/21/2005Natural Language Processing8 v (sing,intrans) S  NP(_num) VP(_num) NP(_num)  det(_num) n(_num) VP(_num)  v(_num,intrans) VP(_num)  v (_num,trans) NP(_1) Parsing with features (top-down) S  NP(_num) VP(_num) S NP VP (_num) NP(_num)  det(_num) n(_num) the man shot those elephants det n (_num) the = det(_num) the man = n(sing) man VP(sing)  v(sing,intrans) shot = v(sing,trans) (sing) VP(sing)  v(sing,trans) NP(_1) shot = v(sing,trans) v NP (sing,trans) (_1) shot det n (_1) (_1) thoseelephants (pl) NP(_1)  det(_1) n(_1) those = det(pl) elephants = n(pl) _num=sing (pl)

9 Lecture 1, 7/21/2005Natural Language Processing9 Feature structures  Instead of attaching features to the symbols, we can parse with symbols made up entirely of attribute-value pairs: “feature structures”  Can be used in the same way as seen previously  Values can be atomic …  … or embedded feature structures … CAT NP NUMBER SG PERSON 3 ATTR1 VAL1 ATTR2 VAL2 ATTR3 VAL3 CAT NP AGR NUM SG PERS 3

10 Lecture 1, 7/21/2005Natural Language Processing10 Unification Probabilistic CFG August 31, 2006

11 Lecture 1, 7/21/2005Natural Language Processing11 Feature Structures  A set of feature-value pairs  No feature occurs in more than one feature-value pair (a partial function from features to values)  Circular structures are prohibited.

12 Lecture 1, 7/21/2005Natural Language Processing12 Structured Feature Structure Part of a third-person singular NP:

13 Lecture 1, 7/21/2005Natural Language Processing13 Reentrant Feature Structure  Two features can share a feature structure as value. Not the same thing as them having equivalent values!  Two distinct feature structure values:  One shared value (reentrant feature structure):

14 Lecture 1, 7/21/2005Natural Language Processing14  they can be coindexed CAT S HEAD AGR 1 SUBJ [ AGR 1 ] NUM SG PERS 3

15 Lecture 1, 7/21/2005Natural Language Processing15 Parsing with feature structures  Grammar rules can specify assignments to or equations between feature structures  Expressed as “feature paths” e.g. HEAD.AGR.NUM = SG CAT S HEAD AGR 1 SUBJ [ AGR 1 ] NUM SG PERS 3

16 Lecture 1, 7/21/2005Natural Language Processing16 Number Agreement (Subject-Verb, Determiner-Noun) s  np(Num), vp(Num). np(Num)  name(Num). np(Num)  n(Num). np(Num)  det(Num), n(Num). vp(Num)  v(Num). vp(Num)  v(Num), np. vp(Num)  v(Num), np, np. name(sing)  [john]. name(sing)  [mary]. det(sing)  [a]. det(sing)  [the]. det(plur)  [the]. n(sing)  [dog]. n(plur)  [dogs]. v(sing)  [snores]. v(plur)  [snore]. v(sing)  [sees]. v(plur)  [see].

17 Lecture 1, 7/21/2005Natural Language Processing17 Verb Subcategorization (Transitivity) s  np, vp. np  name. np  n. np  det, n. vp  v(intrans). vp  v(trans), np. vp  v(ditrans), np, np. name  [john]. name  [mary]. det  [a]. det  [the]. n  [dog]. n  [dogs]. v(intrans)  [snores]. v(trans)  [sees]. v(ditrans)  [gives].

18 Lecture 1, 7/21/2005Natural Language Processing18 Subsumption (  )  (Partial) ordering of feature structures  Based on relative specificity  The second structure carry less information, but is more general (or subsumes) the first.

19 Lecture 1, 7/21/2005Natural Language Processing19 Subsumption  A more abstract (less specific) feature structure subsumes an equally or more specific one.  A feature structure F subsumes a feature structure G ( F  G) if and only if : For every structure x in F, F(x)  G(x) (where F(x) means the value of the feature x of the feature structure F). For all paths p and q in F such that F(p)=F(q), it is also the case that G(p)=G(q).  An atomic feature structure neither subsumes nor is subsumed by another atomic feature structure.  Variables subsume all other feature structures.  A feature structure F subsumes a feature structure G (F  G) if if all parts of F subsumes all parts of G.

20 Lecture 1, 7/21/2005Natural Language Processing20 Subsumption Example Consider the following feature structures: (1) (2) (3) (1)  (3) (2)  (3) but there is no subsumption relation between (1) and (2)

21 Lecture 1, 7/21/2005Natural Language Processing21 Feature Structures in The Grammar  We will incorporate the feature structures and the unification process as follows: All constituents (non-terminals) will be associated with feature structures. Sets of unification constraints will be associated with grammar rules, and these rules must be satisfied for the rule to be satisfied.  These attachments accomplish the following goals: To associate feature structures with both lexical items and instances of grammatical categories. To guide the composition of feature structures for larger grammatical constituents based on the feature structures of their component parts. To enforce compatibility constraints between specified parts of grammatical constraints.

22 Lecture 1, 7/21/2005Natural Language Processing22 Feature unification  Feature structures can be unified if They have like-named attributes that have the same value: [NUM SG]  [NUM SG] = [NUM SG] Like-named attributes that are “open” get the value assigned: CAT NP NUMBER ?? PERSON 3 NUMBER SG PERSON 3  = CAT NP NUMBER SG PERSON 3

23 Lecture 1, 7/21/2005Natural Language Processing23 Feature unification Complementary features are brought together Unification is recursive Coindexed structures are identical (not just copies): assignment to one effects all CAT NP NUMBER SG [PERSON 3]  = CAT NP NUMBER SG PERSON 3 CAT NP AGR [NUM SG]  CAT NP AGR NUM SG PERS 3 = CAT NP AGR [PERS 3]

24 Lecture 1, 7/21/2005Natural Language Processing24 Example CAT NP AGR _1  _2 SEM _3  CAT DET AGR _1 CAT N AGR _2 SEM _3  CAT DET AGR VAL INDEF NUM SG a  CAT DET AGR [VAL DEF] the  CATN LEX “man” AGR [NUM SG] SEM HUM man

25 Lecture 1, 7/21/2005Natural Language Processing25 the man the  CAT N AGR _2 SEM _3  CAT DET AGR [VAL DEF] the CAT NP AGR _1  _2 SEM [_3] CAT DET AGR _1 [VAL DEF] VAL DEF  CATN LEX “man” AGR [NUM SG] SEM HUM man LEX “man” AGR [NUM SG] SEM HUM the man NUM SG HUM

26 Lecture 1, 7/21/2005Natural Language Processing26 a mana  CAT N AGR _2 SEM _3 CAT NP AGR _1  _2 SEM [_3] CAT DET AGR _1  CATN LEX “man” AGR [NUM SG] SEM HUM man LEX “man” AGR [NUM SG] SEM HUM a man [NUM SG] HUM  CAT DET AGR VAL INDEF NUM SG a VAL INDEF NUM SG VAL INDEF NUM SG VAL INDEF AGR NUM SG

27 Lecture 1, 7/21/2005Natural Language Processing27 Types and inheritance  Feature typing allows us to constrain possible values a feature can have e.g. num = {sing,plur} Allows grammars to be checked for consistency, and can make parsing easier  We can express general “feature co-occurrence conditions” …  And “feature inheritance rules”  Both these allow us to make the grammar more compact

28 Lecture 1, 7/21/2005Natural Language Processing28 Co-occurrence conditions and Inheritance rules  General rules (beyond simple unification) which apply automatically, and so do not need to be stated (and repeated) in each rule or lexical entry  Examples: [cat=np]  [num=??, gen=??, case=??] [cat=v,num=sg]  [tns=pres] [attr1=val1]   [attr2=val2]

29 Lecture 1, 7/21/2005Natural Language Processing29 Inheritance rules  Inheritance rules can be over-ridden e.g. [cat=n]  [gen=??,sem=??] sex={male,female} gen={masc,fem,neut} [cat=n,gen=fem,sem=hum]  [sex=female] uxor [cat=n,gen=fem,sem=hum] agricola [cat=n,gen=fem,sem=hum,sex=male]

30 Lecture 1, 7/21/2005Natural Language Processing30 Unification in Linguistics  Lexical Functional Grammar If interested, see PARGRAM project  GPSG, HPSG  Construction Grammar  Categorial Grammar

31 Lecture 1, 7/21/2005Natural Language Processing31 Unification  Joining the contents of two feature structures into one new (the union of the two originals).  The union is most general feature structure subsumed by both.  The union of two contradictory feature structures is undefined (unification fails).

32 Lecture 1, 7/21/2005Natural Language Processing32 Unification Constraints  Each grammar rule will be associated with a set of unification constraints.  0   1 …  n {set of unification constraints}  Each unification constraint will be in one of the following forms. = Atomic value =

33 Lecture 1, 7/21/2005Natural Language Processing33 Unification Constraints -- Example  For example, the following rule S  NP VP Only if the number of the NP is equal to the number of the VP. will be represented as follows: S  NP VP =

34 Lecture 1, 7/21/2005Natural Language Processing34 Agreement Constraints S  NP VP = S  Aux NP VP = NP  Det NOMINAL = NOMINAL  Noun = VP  Verb NP =

35 Lecture 1, 7/21/2005Natural Language Processing35 Agreement Constraints -- Lexicon Entries Aux  does = SG = 3 Aux  do = PL Det  these = PL Det  this = SG Verb  serves = SG = 3 Verb  serve = PL Noun  flights = PL Noun  flight = SG

36 Lecture 1, 7/21/2005Natural Language Processing36 Head Features  Certain features are copied from children to parent in feature structures. Example: AGREEMENT feature in NOMINAL is copied into NP. The features for most grammatical categories are copied from one of the children to the parent.  The child that provides the features is called head of the phrase, and the features copied are referred to as head features.  A verb is a head of a verb phrase, and a nominal is a head of a noun phrase. We may reflect these constructs in feature structures as follows: NP  Det NOMINAL = VP  Verb NP =

37 Lecture 1, 7/21/2005Natural Language Processing37 SubCategorization Constraints  For verb phrases, we can represent subcategorization constraints using three techniques: Atomic Subcat Symbols Encoding Subcat lists as feature structures Minimal Rule Approach (using lists directly)  We may use any of these representations.

38 Lecture 1, 7/21/2005Natural Language Processing38 Atomic Subcat Symbols VP  Verb = = INTRANS VP  Verb NP = = TRANS VP  Verb NP NP = = DITRANS Verb  slept = INTRANS Verb  served = TRANS Verb  gave = DITRANS

39 Lecture 1, 7/21/2005Natural Language Processing39 Encoding Subcat Lists as Features Verb  gave = NP = END VP  Verb NP NP = = END  We are only encoding lists using positional features

40 Lecture 1, 7/21/2005Natural Language Processing40 Minimal Rule Approach  In fact, we do not use symbols like SECOND, THIRD. They are just used to encode lists. We can use lists directly (similar to LISP). = NP = END

41 Lecture 1, 7/21/2005Natural Language Processing41 Subcategorization Frames for Lexical Entries  We can use two different notations to represent subcategorization frames for lexical entries (verbs). Verb  want = NP Verb  want = VP = INFINITITIVE

42 Lecture 1, 7/21/2005Natural Language Processing42 Implementing Unification  The representation we have used cannot facilitate the destructive merger aspect of unification algorithm.  For this reason, we add additional features (additional edges to DAGs) into our feature structures.  Each feature structure will consists of two fields: Content Field -- This field can be NULL or may contain ordinary feature structure. Pointer Field -- This field can be NULL or may contain a pointer into another feature structure.  If the pointer field of a DAG is NULL, the content field of DAG contains the actual feature structure to be processed.  If the pointer field of a DAG is not NULL, the destination of that pointer represents the actual feature structure to be processed.

43 Lecture 1, 7/21/2005Natural Language Processing43 Extended Feature Structures 

44 Lecture 1, 7/21/2005Natural Language Processing44 Extended DAG          C P C C P P Num Per Null 3 SG Nul l

45 Lecture 1, 7/21/2005Natural Language Processing45 Unification of Extended DAGs    C P Per    C P Null 3    C P Num    C P Null SG Null

46 Lecture 1, 7/21/2005Natural Language Processing46 Unification of Extended DAGs (cont.)    C P Num    C P Null SG Null    C P Per    C P Null 3    C P P Per

47 Lecture 1, 7/21/2005Natural Language Processing47 Unification Algorithm function UNIFY(f1,f2) returns fstructure or failure f1real  real contents of f1 /* dereference f1 */ f2real  real contents of f2 /* dereference f2 */ if f1real is Null then { f1.pointer  f2; return f2; } else if f2real is Null then { f2.pointer  f1; return f1; } else if f1real and f2real are identical then { f1.pointer  f2; return f2; } else if f1real and f2real are complex feature structures then { f2.pointer  f1; for each feature in f2real do { otherfeature  Find or create a feature corresponding to feature in f1real; if UNIFY(feature.value,otherfeature.value) returns failure then return failure; } return f1; } else return failure;

48 Lecture 1, 7/21/2005Natural Language Processing48 Example - Unification of Complex Structures

49 Lecture 1, 7/21/2005Natural Language Processing49 Example - Unification of Complex Structures (cont.) Null C Agr Num C C SG Null C Agr C Null Null Null Null Sub C Sub Agr C C Null Null Null C Per Null 3 C Null Per

50 Lecture 1, 7/21/2005Natural Language Processing50 Parsing with Unification Constraints  Let us assume that we have augmented our grammar with sets of unification constraints.  What changes do we need to make a parser to make use of them? Building feature structures and associate them with sub- trees. Unifying feature structures when sub-trees are created. Blocking ill-formed constituents

51 Lecture 1, 7/21/2005Natural Language Processing51 Earley Parsing with Unification Constraints  What do we have to do to integrate unification constraints with Early Parser? Building feature structures (represented as DAGs) and associate them with states in the chart. Unifying feature structures as states are advanced in the chart. Blocking ill-formed states from entering the chart.  The main change will be in COMPLETER function of Earley Parser. This routine will invoke the unifier to unify two feature structures.

52 Lecture 1, 7/21/2005Natural Language Processing52 Building Feature Structures NP  Det NOMINAL = corresponds to

53 Lecture 1, 7/21/2005Natural Language Processing53 Augmenting States with DAGs  Each state will have an additional field to contain the DAG representing the feature structure corresponding to the state.  When a rule is first used by PREDICTOR to create a state, the DAG associated with the state will simply consist of the DAG retrieved from the rule.  For example, S   NP VP, [0,0],[],Dag 1 where Dag 1 is the feature structure corresponding to S  NP VP. NP   Det NOMINAL, [0,0],[],Dag 2 where Dag 2 is the feature structure corresponding to S  Det NOMINAL.

54 Lecture 1, 7/21/2005Natural Language Processing54 What does COMPLETER do?  When COMPLETER advances the dot in a state, it should unify the feature structure of the newly completed state with the appropriate part of the feature structure being advanced.  If this unification process is succesful, the new state gets the result of the unification as its DAG, and this new state is entered into the chart. If it fails, nothing is entered into the chart.

55 Lecture 1, 7/21/2005Natural Language Processing55 A Completion Example NP  Det  NOMINAL, [0,1],[SDet],Dag 1 Dag 1 Parsing the phrase that flight after that is processed. A newly completed state NOMINAL  Noun , [1,2],[SNoun],Dag 2 Dag 2 To advance in NP, the parser unifies the feature structure found under the NOMINAL feature of Dag2, with the feature structure found under the NOMINAL feature of Dag1.

56 Lecture 1, 7/21/2005Natural Language Processing56 Earley Parse function EARLEY-PARSE(words,grammar) returns chart ENQUEUE((    S, [0,0], chart[0],dag  ) for i from 0 to LENGTH(words) do for each state in chart[i] do if INCOMPLETE?(state) and NEXT-CAT(state) is not a PS then PREDICTOR(state) elseif INCOMPLETE?(state) and NEXT-CAT(state) is a PS then SCANNER(state) else COMPLETER(state) end return(chart)

57 Lecture 1, 7/21/2005Natural Language Processing57 Predictor and Scanner procedure PREDICTOR((A    B , [i,j],dagA)) for each (B   ) in GRAMMAR-RULES-FOR(B,grammar) do ENQUEUE((B   , [i,j],dagB), chart[j]) end procedure SCANNER((A    B , [i,j],dagA)) if (B  PARTS-OF-SPEECH(word[j]) then ENQUEUE((B  word[j] , [j,j+1],dagB), chart[j+1]) end

58 Lecture 1, 7/21/2005Natural Language Processing58 Completer and UnifyStates procedure COMPLETER((B   , [j,k],dagB)) for each (A    B , [i,j],dagA) in chart[j] do if newdag  UNIFY-STATES(dagB,dagA,B)  fails then ENQUEUE((A   B  , [i,k],newdag), chart[k]) end procedure UNIFY-STATES(dag1,dag2,cat) dag1cp  CopyDag(dag1); dag2cp  CopyDag(dag2); UNIFY(FollowPath(cat,dag1cp),FollowPath(cat,dag2cp)); end

59 Lecture 1, 7/21/2005Natural Language Processing59 Enqueue procedure ENQUEUE(state,chart-entry) if state is not subsumed by a state in chart-entry then Add state at the end of chart-entry end

60 Lecture 1, 7/21/2005Natural Language Processing60 Probabilistic Parsing Slides by Markus Dickinson, Georgetown University

61 Lecture 1, 7/21/2005Natural Language Processing61 Motivation and Outline  Previously, we used CFGs to parse with, but: Some ambiguous sentences could not be disambiguated, and we would like to know the most likely parse How do we get such grammars? Do we write them ourselves? Maybe we could use a corpus …  Where we’re going: Probabilistic Context-Free Grammars (PCFGs) Lexicalized PCFGs Dependency Grammars

62 Lecture 1, 7/21/2005Natural Language Processing62 Statistical Parsing  Basic idea Start with a treebank  a collection of sentences with syntactic annotation, i.e., already-parsed sentences Examine which parse trees occur frequently Extract grammar rules corresponding to those parse trees, estimating the probability of the grammar rule based on its frequency  That is, we’ll have a CFG augmented with probabilities

63 Lecture 1, 7/21/2005Natural Language Processing63 Using Probabilities to Parse  P(T): probability of a particular parse tree  P(T) = Π nєT p(r(n)) i.e., the product of the probabilities of all the rules r used to expand each node n in the parse tree Example: given the probabilities on p. 449, compute the probability of the parse tree on the right

64 Lecture 1, 7/21/2005Natural Language Processing64 Computing probabilities  We have the following rules and probabilities (adapted from Figure 12.1): S  VP.05 VP  V NP.40 NP  Det N.20 V  book.30 Det  that.05 N  flight.25  P(T) = P(S  VP)*P(VP  V NP)*…*P(N  flight) =.05*.40*.20*.30*.05*.25 =.000015, or 1.5 x 10 -5

65 Lecture 1, 7/21/2005Natural Language Processing65 Using probabilities  So, the probability for that parse is 0.000015. What’s the big deal? Probabilities are useful for comparing with other probabilities  Whereas we couldn’t decide between two parses using a regular CFG, we now can.  For example, TWA flights is ambiguous between being two separate NPs (cf. I gave [ NP John] [ NP money]) or one NP: A: [book [TWA] [flights]] B: [book [TWA flights]]  Probabilities allows us to choose choice B (see figure 12.2)

66 Lecture 1, 7/21/2005Natural Language Processing66 Obtaining the best parse  Call the best parse T(S), where S is your sentence Get the tree which has the highest probability, i.e. T(S) = argmax Tєparse-trees(S) P(T)  Can use the Cocke-Younger-Kasami (CYK) algorithm to calculate best parse CYK is a form of dynamic programming CYK is a chart parser, like the Earley parser

67 Lecture 1, 7/21/2005Natural Language Processing67 The CYK algorithm  Base case Add words to the chart Store P(A  w_i) for every category A in the chart  Recursive case  makes this dynamic programming because we only calculate B and C once Rules must be of the form A  BC, i.e., exactly two items on the RHS (we call this Chomsky Normal Form (CNF)) Get the probability for A at this node by multiplying the probabilities for B and for C by P(A  BC)  P(B)*P(C)*P(A  BC)  For a given A, only keep the maximum probability (again, this is dynamic programming)

68 Lecture 1, 7/21/2005Natural Language Processing68 Problems with PCFGs  It’s still only a CFG, so dependencies on non-CFG info not captured e.g., Pronouns are more likely to be subjects than objects: P[(NP  Pronoun) | NP=subj] >> P[(NP  Pronoun) | NP =obj]  Ignores lexical information (statistics), which is usually crucial for disambiguation (T1) America sent [[250,000 soldiers] [into Iraq]] (T2) America sent [250,000 soldiers] [into Iraq]  send with into-PP always attached high (T2) in PTB!  To handle lexical information, we’ll turn to lexicalized PCFGs

69 Lecture 1, 7/21/2005Natural Language Processing69 Lexicalized Grammars  The head information is passed up in a syntactic analysis? e.g., VP[head *1]  V[head *1] NP  Well, if you follow this down all the way to the bottom of a tree, you wind up with a head word  In some sense, we can say that Book that flight is not just an S, but an S rooted in book  Thus, book is the headword of the whole sentence  By adding headword information to nonterminals, we wind up with a lexicalized grammar

70 Lecture 1, 7/21/2005Natural Language Processing70 Lexicalized PCFGs  Lexicalized Parse Trees Each PCFG rule in a tree is augmented to identify one RHS constituent to be the head daughter The headword for a node is set to the head word of its head daughter [book] [flight]

71 Lecture 1, 7/21/2005Natural Language Processing71 Incorporating Head Proabilities: Wrong Way  Simply adding headword w to node won’t work: So, the node A becomes A[w] e.g., P(A[w]  β|A) =Count(A[w]  β)/Count(A)  The probabilities are too small, i.e., we don’t have a big enough corpus to calculate these probabilities VP(dumped)  VBD(dumped) NP(sacks) PP(into) 3x10 -10 VP(dumped)  VBD(dumped) NP(cats) PP(into) 8x10 -11  These probabilities are tiny, and others will never occur

72 Lecture 1, 7/21/2005Natural Language Processing72 Incorporating head probabilities: Right way  Previously, we conditioned on the mother node (A): P(A  β|A)  Now, we can condition on the mother node and the headword of A (h(A)): P(A  β|A, h(A))  We’re no longer conditioning on simply the mother category A, but on the mother category when h(A) is the head e.g., P(VP  VBD NP PP | VP, dumped)

73 Lecture 1, 7/21/2005Natural Language Processing73 Calculating rule probabilities  We’ll write the probability more generally as: P(r(n) | n, h(n)) where n = node, r = rule, and h = headword  We calculate this by comparing how many times the rule occurs with h(n) as the headword versus how many times the mother/headword combination appear in total: P(VP  VBD NP PP | VP, dumped) = C(VP(dumped)  VBD NP PP)/ Σ β C(VP(dumped)  β)

74 Lecture 1, 7/21/2005Natural Language Processing74 Adding info about word-word dependencies  We want to take into account one other factor: the probability of being a head word (in a given context) P(h(n)=word | …)  We condition this probability on two things: 1. the category of the node (n), and 2. the headword of the mother (h(m(n))) P(h(n)=word | n, h(m(n))), shortened as: P(h(n) | n, h(m(n))) P(sacks | NP, dumped)  What we’re really doing is factoring in how words relate to each other  We will call this a dependency relation later: sacks is dependent on dumped, in this case

75 Lecture 1, 7/21/2005Natural Language Processing75 Putting it all together  See p. 459 for an example lexicalized parse tree for workers dumped sacks into a bin  For rules r, category n, head h, mother m P(T) = Π nєT p(r(n)| n, h(n)) e.g., P(VP  VBD NP PP |VP, dumped) subcategorization info * p(h(n) | n, h(m(n))) e.g. P(sacks | NP, dumped) dependency info between words

76 Lecture 1, 7/21/2005Natural Language Processing76 Dependency Grammar  Capturing relations between words (e.g. dumped and sacks) is moving in the direction of dependency grammar (DG)  In DG, there is no such thing as constituency  The structure of a sentence is purely the binary relations between words  John loves Mary is represented as: LOVE  JOHN LOVE  MARY  where A  B means that B depends on A

77 Lecture 1, 7/21/2005Natural Language Processing77 Dependency parsing

78 Lecture 1, 7/21/2005Natural Language Processing78 Dependency Grammar/Parsing  A sentence is parsed by relating each word to other words in the sentence which depend on it.  The idea of dependency structure goes back a long way To Pā ṇ ini’s grammar (c. 5th century BCE)  Constituency is a new-fangled invention 20th century invention  Modern work often linked to work of L. Tesniere (1959) Dominant approach in “East” (Eastern bloc/East Asia)  Among the earliest kinds of parsers in NLP, even in US: David Hays, one of the founders of computational linguistics, built early (first?) dependency parser (Hays 1962)

79 Lecture 1, 7/21/2005Natural Language Processing79 Dependency structure  Words are linked from head (regent) to dependent  Warning! Some people do the arrows one way; some the other way (Tesniere has them point from head to dependent…).  Usually add a fake ROOT so every word is a dependent Shaw Publishing acquired 30 % of American City in March $$

80 Lecture 1, 7/21/2005Natural Language Processing80 Relation between CFG to dependency parse  A dependency grammar has a notion of a head  Officially, CFGs don’t  But modern linguistic theory and all modern statistical parsers (Charniak, Collins, Stanford, …) do, via hand- written phrasal “head rules”: The head of a Noun Phrase is a noun/number/adj/… The head of a Verb Phrase is a verb/modal/….  The head rules can be used to extract a dependency parse from a CFG parse (follow the heads).  A phrase structure tree can be got from a dependency tree, but dependents are flat (no VP!)

81 Lecture 1, 7/21/2005Natural Language Processing81 Propagating head words Small set of rules propagate heads S(announced) NP(Smith) NNP John NNP Smith NP(president) NP DT the NN president PP(of) IN of NP NNP IBM VP(announced) VBD announced NP(resignation) PRP$ his NN resignation NP NN yesterday

82 Lecture 1, 7/21/2005Natural Language Processing82 Extracted structure NB. Not all dependencies shown here  Dependencies are inherently untyped, though some work like Collins (1996) types them using the phrasal categories NP [JohnSmith] NP [thepresident]of[IBM] SNPVP announced[hisResignation][yesterday] VPVBDNP VP VBD

83 Lecture 1, 7/21/2005Natural Language Processing83 Sources of information:  bilexical dependencies  distance of dependencies  valency of heads (number of dependents) A word’s dependents (adjuncts, arguments) tend to fall near it in the string. Dependency Conditioning Preferences These next 6 slides are based on slides by Jason Eisner and Noah Smith

84 Lecture 1, 7/21/2005Natural Language Processing84 Probabilistic dependency grammar: generative model 1. Start with left wall $ 2. Generate root w 0 3. Generate left children w -1, w -2,..., w -ℓ from the FSA λ w 0 4. Generate right children w 1, w 2,..., w r from the FSA ρ w 0 5. Recurse on each w i for i in {-ℓ,..., -1, 1,..., r}, sampling α i (steps 2-4) 6. Return α ℓ...α -1 w 0 α 1...α r w0w0 w -1 w -2 w-ℓw-ℓ wrwr w2w2 w1w1... w -ℓ.-1 $ λw-ℓλw-ℓ λw0λw0 ρw0ρw0

85 Lecture 1, 7/21/2005Natural Language Processing85 Naïve Recognition/Parsing Ittakestwototango ItIt takestwoto tango totakes O ( n 5 ) combinations ItIt p pc ij k O ( n 5 N 3 ) if N nonterminals r 0 n goal

86 Lecture 1, 7/21/2005Natural Language Processing86 Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999)  Triangles: span over words, where tall side of triangle is the head, other side is dependent, and no non-head words expecting more dependents  Trapezoids: span over words, where larger side is head, smaller side is dependent, and smaller side is still looking for dependents on its side of the trapezoid } }

87 Lecture 1, 7/21/2005Natural Language Processing87 Dependency Grammar Cubic Recognition/Parsing (Eisner & Satta, 1999) Ittakestwototango goal One trapezoid per dependency. A triangle is a head with some left (or right) subtrees.

88 Lecture 1, 7/21/2005Natural Language Processing88 Cubic Recognition/Parsing (Eisner & Satta, 1999) ij k ij k ij k ij k O ( n 3 ) combinations 0 in goal Gives O ( n 3 ) dependency grammar parsing O ( n ) combinations

89 Lecture 1, 7/21/2005Natural Language Processing89 Evaluation of Dependency Parsing: Simply use (labeled) dependency accuracy 1 2 3 4 5 1 2 We SUBJ 2 0 eat ROOT 3 5 the DET 4 5 cheeseMOD 5 2 sandwichSUBJ 1 2 We SUBJ 2 0 eat ROOT 3 4 the DET 4 2 cheeseOBJ 5 2 sandwichPRED Accuracy = number of correct dependencies total number of dependencies = 2 / 5 = 0.40 40% GOLDPARSED

90 Lecture 1, 7/21/2005Natural Language Processing90 McDonald et al. (2005 ACL): Online Large-Margin Training of Dependency Parsers  Builds a discriminative dependency parser  Can condition on rich features in that context Best-known recent dependency parser Lots of recent dependency parsing activity connected with CoNLL 2006/2007 shared task  Doesn’t/can’t report constituent LP/LR, but evaluating dependencies correct: Accuracy is similar to but a fraction below dependencies extracted from Collins:  90.9% vs. 91.4% … combining them gives 92.2% [all lengths]  Stanford parser on length up to 40:  Pure generative dependency model: 85.0%  Lexicalized factored parser: 91.0%

91 Lecture 1, 7/21/2005Natural Language Processing91 McDonald et al. (2005 ACL): Online Large-Margin Training of Dependency Parsers  Score of a parse is the sum of the scores of its dependencies  Each dependency is a linear function of features times weights  Feature weights are learned by MIRA, an online large-margin algorithm But you could think of it as using a perceptron or maxent classifier  Features cover: Head and dependent word and POS separately Head and dependent word and POS bigram features Words between head and dependent Length and direction of dependency

92 Lecture 1, 7/21/2005Natural Language Processing92 Extracting grammatical relations from statistical constituency parsers [de Marneffe et al. LREC 2006]  Exploit the high-quality syntactic analysis done by statistical constituency parsers to get the grammatical relations [typed dependencies]  Dependencies are generated by pattern-matching rules Bills on ports and immigration were submitted by Senator Brownback NP S NNP PP IN VP VBN VBD NN CC NNS NPIN NP PP NNS submitted Bills were Brownback Senator nsubjpassauxpass agent nn prep_on ports immigration cc_and

93 Lecture 1, 7/21/2005Natural Language Processing93 Evaluating Parser Output  Dependency relations are also useful for comparing parser output to a treebank  Traditional measures of parser accuracy: Labeled bracketing precision: # correct constituents in parse/# constituents in parse Labeled bracketing recall: # correct constituents in parse/# (correct) constituents in treebank parse  There are known problems with these measures, so people are trying to use dependency-based measures instead How many dependency relations did the parse get correct?


Download ppt "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007."

Similar presentations


Ads by Google