Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29.

Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29

NLP Lab., POSTECH 2Contents oMotivation oPartially Bracketed Text oGrammar Reestimation m The Inside-Outside Algorithm m The Extended Algorithm m Complexity oExperimental Evaluation m Inferring the Palindrome Language m Experiments on the ATIS Corpus oConclusions and Further Work

NLP Lab., POSTECH 3 Motivation I oVery simple method for learning SCFGs [Charniak] m Generate all possible SCFG rules m Assign some initial probabilities m Run the training algorithm on a sample text  raw text m remove those rules with zero probabilities oDifficulties in using SCFGs m Time complexity - O(n 3 |w| 3 ) n : the number of non-terminalsw : training sentence cf. O(s 2 |w|) : training an HMM with s states m Bad convergence properties The larger number of non-terminals, the worse. m Inferred only by chance

NLP Lab., POSTECH 4 Motivation II oExtension of the Inside-Outside algorithm m Inferring grammars from a partially parsed corpus m Advantages constituent boundary information in grammar reduced number of iteration for training better time complexity

NLP Lab., POSTECH 5 Partially Bracketed Text oExample m (((VB(DT NNS(IN((NN)(NN CD)))))).) m (((List (the fares(for((flight)(number 891)))))).) oNotations m Corpus C = { c | c = ( w, B) }, w : string, B : bracketing of w m w=w 1 w 2  w i w i+1  w j  w |w| m (i,j) delimits i w j m consistent : no overlapping in a bracketing m compatible : union of two bracketing is consistent m valid : a span is compatible with a bracketing m span in derivation  0   1     m =w if j=m, span of w i in  j is (i-1,i) if j<m,  j =  A ,  j+1 =  X 1  X k , span A in  j is (i 1,j k )

NLP Lab., POSTECH 6 Grammar Reestimation oUsing reestimation algorithm m parameter estimates for a SCFG derived by other means m grammar inferring from scratch oGrammar inferring m Given set N of Non-terminals, set  of terminals n=|N|, t=|  | N={A 1, ,A n },  ={b 1, ,b t } m CNF SCFG over N,  : n 3 +nt probabilities B p,q,r on A p  A q A r : n 3 U p,m on A p  b m : nt m oMeaning of rule probabilities : intuition of context freeness

NLP Lab., POSTECH 7 The Inside-Outside Algorithm oDefinition of inner (e) and outer (f) probabilities S i 1s-1t+1Tst Inner probability Outer probability i S i Special thanks to ohwoog

NLP Lab., POSTECH 8 The Extended Algorithm oCompatible function oExtended algorithm m Table 1. 참조 m Inside probabilities : (1), (2) ; (2) 에 compatible function 사용. m Outside probabilities : (3), (4); (4) 에 compatible function 사용. m Parameter reestimation : (5), (6) ; original algorithm 과 같음. oStopping criterior m When the cross entropy estimate becomes negligible.

NLP Lab., POSTECH 9Complexity oComplexity of original algorithm : O(|w| 3 ) for each sentence m computation of inside probability, computation of outside probability and rule probability reestimation : 각각 O(|w| 3 ) for each sentence oComplexity of extended algorithm : O(|w|) at best case m In the case of full binary bracketing B of a string w O(|w|) spans in B Only one split point for each (i,k) Each valid span must be a member of B. m Preprocessing Enumerating valid spans and split points

NLP Lab., POSTECH 10 Experimental Evaluation oTwo experiments m Artificial Language ; Palindrome m Natural Language ; Penn Treebank oEvaluation m Bracketing accuracy proportion of phrases that are compatible

NLP Lab., POSTECH 11 Inferring the Palindrome Language oL={ww R |w  {a,b}*} oInitial grammar : 135 rules ( =5 3 +5*2 ) oTraining with 100 sentences oInferred grammar : correct palindrome language grammar oBracketing accuracy : above 90% (100% in several cases) m In the unbracketing training : 15% - 69%

NLP Lab., POSTECH 12 Experiments on the ATIS Corpus oATIS(Air Travel Information System) corpus ; 770 sentences (7812 words) m 700 training set, 70 test set (901 words) oInitial grammar : 4095 rules ( =15 3 +15*48) m 15 nonterminals, 48 terminal symbols for POS tags oBracketing accuracy : 90.36% after 75 iteration m In the unbracketing training : 37.35% oIn the case (A) m (Delta flight number) : not compatible m (the cheapest) : linguistically wrong ; lack of information m 16 incompatibles in G R oIn the case (B) m fully compatible m 9 incompatibles in G R

NLP Lab., POSTECH 13 Conclusions and Further Work oThe use of partially bracketed corpus can m reduce the number of iterations for convergence m find good solution m infer grammars specifying linguistically reasonable constituent boundaries m reduce time complexity (linear in the best case) oMore Extensions m determination of sensitivity to the initial probability assignments training corpus lack or misplacement of brackets. m larger terminal vocabularies

Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29.

Similar presentations

Presentation on theme: "Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29.

Similar presentations

Presentation on theme: "Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29."— Presentation transcript:

Similar presentations

About project

Feedback