Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University.

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner2 Only Connect… Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Weischedel 2004 Quirk et al. 2005 Pantel & Lin 2002 Parser (Dependency) Trained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner3 Outline: Bootstrapping Parsers What kind of parser should we train? How should we train it semi-supervised? Does it work? (initial experiments) How can we incorporate other knowledge?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner4 Re-estimation: EM or Viterbi EM Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner5 Re-estimation: EM or Viterbi EM Trained Parser (iterate process) Oops! Not much supervised training. So most of these parses were bad. Retraining on all of them overwhelms the good supervised data.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner6 So only retrain on “good” parses... Simple Bootstrapping: Self-Training Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner7 So only retrain on “good” parses... at least, those the parser itself thinks are good. (Can we trust it? We’ll see...) Simple Bootstrapping: Self-Training Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner8 Why Might This Work? Sure, now we avoid harming the parser with bad training. But why do we learn anything new from the unsup. data? Trained Parser After training, training parses have  Many features with positive weights  Few features with negative weights But unsupervised parses have  Few positive or negative features unknown  Mostly unknown features Words or situations not seen in training data Still, sometimes enough positive features to be sure it’s the right parse

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner9 Why Might This Work? Sure, we avoid bad guesses that harm the parser. But why do we learn anything new from the unsup. data? Trained Parser Still, sometimes enough positive features to be sure it’s the right parse Now, retraining the weights makes the gray (and red) features greener

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner10 Still, sometimes enough positive features to be sure it’s the right parse... and makes features redder for the “losing” parses of this sentence (not shown) Why Might This Work? Sure, we avoid bad guesses that harm the parser. But why do we learn anything new from the unsup. data? Trained Parser Now, retraining the weights makes the gray (and red) features greener Learning!

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner11 This Story Requires Many Redundant Features! Bootstrapping for WSD (Yarowsky 1995)  Lots of contextual features  success Co-training for parsing (Steedman et. al 2003)  Feature-poor parsers  disappointment Self-training for parsing (McClosky et al. 2006)  Feature-poor parsers  disappointment  Reranker with more features  success More features  more chances to identify correct parse even when we’re undertrained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner12 This Story Requires Many Redundant Features! So, let’s bootstrap a feature-rich parser! In our experiments so far, we follow McDonald et al. (2005)  Our model has 450 million features (on Czech)  Prune down to 90 million frequent features  About 200 are considered per possible edge Note: Even more features proposed at end of talk More features  more chances to identify correct parse even when we’re undertrained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner13 Edge-Factored Parsers (McDonald et al. 2005) No global features of a parse Each feature is attached to some edge Simple; allows fast O(n 2 ) or O(n 3 ) parsing Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner14 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? yes, lots of green...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner15 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner16 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner17 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  NA  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner18 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  N preceding conjunction A  NA  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner19 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC not as good, lots of red...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner20 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clocks”)... undertrained...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner21 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC být-jasn-stud-dubn-den-a-hodi-odbí-třin- jasný  hodiny (“bright clocks”)... undertrained... jasn-  hodi- (“bright clock,” stems only)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner22 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn-  hodi- (“bright clock,” stems only) A plural  N singular jasný  hodiny (“bright clocks”)... undertrained... být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner23 jasný  hodiny (“bright clocks”)... undertrained... Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clock,” stems only) A plural  N singular A  N where N follows a conjunction být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner24 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC Which edge is better?  “bright day” or “bright clocks”? být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner25 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC býtjasnýstudenýdubnovýdenahodinyodbittřináct Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner26 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges. Retraining then learns to reduce (or increase) its score.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner27 Only Connect… Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner28 Only retrain on “good” parses... at least, those the parser itself thinks are good. Can we recast this declaratively? Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner29 Can we recast this declaratively? Seed set ClassifierLabel Examples Select Examples W/ High Confidence New Labeled Set

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner30 Bootstrapping as Optimization Maximize a function on supervised and unsupervised data Try to predict the supervised parses Entropy regularization (Brand 1999; Grandvalet & Bengio; Jiao et al.) Try to be confident on the unsupervised parses Yesterday’s talk: How to compute these for non-projective models See Hwa ‘01 for projective tree entropy

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner31  H/  p Claim: Gradient descent on this objective function works like bootstrapping When we’re pretty sure the true parse is A or B, we reduce entropy H by becoming even surer (  retraining  on the example) When we’re not sure, the example doesn’t affect  (  not retraining on the example) sure of parse A (H  0) sure of parse B (H  0) H p not sure (H  1) ?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner32 Claim: Gradient descent on this objective function works like bootstrapping This gives us a tunable parameter  :  Connect to Abney’s view of bootstrapping (  =0)  Obtain Viterbi variant (limit as    )  Obtain Gini variant (  =2)  Still get Shannon entropy (limit as   1) Also easier to compute in some circumstances In the paper, we generalize: replace Shannon entropy H(  ) with Rényi entropy H  (  )

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner33 Experimental Questions Are confident parses (or edges) actually good for retraining? Does bootstrapping help accuracy? What is being learned?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner34 ridiculously small (pilot experiments, sorry) Experimental Design Czech, German, and Spanish (some Bulgarian)  CoNLL-X dependency trees  Non-projective (MST) parsing  Hundreds of millions of features Supervised training sets of 100 & 1000 trees Unparsed but tagged sets of 2k to 70k sentences Stochastic gradient descent  First optimize just likelihood on seed set  Then optimize likelihood + confidence criterion on all data  Stop when accuracy peaks on development data

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner35 Are confident parses accurate? Correlation of entropy with accuracy Shannon entropy “Viterbi” self-training Gini = -log(expected 0/1 gain) log(# of parses): favor short sentences; Abney’s Yarowsky alg. -.26 -.32 -.27 -.25

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner36 How Accurate Is Bootstrapping? Significant on paired permutation test  (baseline) +71K+37K+2K 100-tree supervised set

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner37 How Does Bootstrapping Learn? Recall Precision 90%: Maybe enough precision so retraining doesn’t hurt Maybe enough recall so retraining will learn new things

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner38 Bootstrapping vs. EM 100 training trees, 100 dev trees for model selection Two ways to add unsupervised data 0 10 20 30 40 50 60 70 80 90 BulgarianGermanSpanish EM (joint) MLE (joint) MLE (cond.) Boot. (cond.) Compare on a feature-poor model that EM can handle (DMV) Supervised baselines

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner39 There’s No Data Like More Data Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner40 “Token” Projection Project 1-best English dependencies (Hwa et al. ‘04)???  Imperfect or free translation  Imperfect parse  Imperfect alignment Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? No. Just use them to get further noisy features.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner41 “Token” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? Probably aligns to some English link A  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner42 “Token” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? Probably aligns to some English path N  in  N Cf. “quasi-synchronous grammars” (Smith & Eisner, 2006)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner43 “Type” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou Probably translate as English words that usually link as N  V when cosentential Can we use world knowledge, e.g., from comparable corpora? …will no longer be royal when the clock strikes midnight. But when the clock strikes 11 a.m. and the race cars rocket… …vehicles and pedestrians after the clock struck eight. …when the clock of a no-passenger Airbus A-320 struck… …born right after the clock struck 12:00 p.m. of December… …as the clock in Madrid’s Plaza del Sol strikes 12 times. Parsed Gigaword corpus clockstrike

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner44 “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou bright broad cheerful pellucid straight … cold fresh hyperborean stone-cold day daytime clock meter metre strikethirteenbe exist subsist Apriland plus Probably translate as English words that usually link as N  V when cosentential

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner45 Conclusions Declarative view of bootstrapping as entropy minimization Improvements in parser accuracy with feature-rich models Easily added features from alternative data sources, e.g. comparable text In future: consider also the WSD decision list learner: is it important for learning robust feature weights?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner46 Thanks Noah Smith Keith Hall The Anonymous Reviewers Ryan McDonald for making his code available

Extra slides …

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner48 Dependency Treebanks

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner49 A Supervised CoNLL-X System What system was this?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner50 How Does Bootstrapping Learn? Supervised iter. 1 Supervised iter. 10 Boostrapping w/ R2Boostrapping w/ Rinf

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner51 How Does Bootstrapping Learn? UpdatedM feat.Acc. [%]UpdatedM feat.Acc. [%] all15.564.3none060.9 seed1.464.1Non- seed 14.144.7 Non-lex.3.564.4lexical12.059.9 Non- bilex. 12.664.4bilexical2.961.0

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner52 target word: plant table taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm life (1%) manufacturing (1%) 98%

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner53 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm Should be a good classifier, unless we accidentally learned some bad cues along the way that polluted the original sense distinction.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner54 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm Learn a classifier that distinguishes A from B. It will notice features like “animal”  A, “automate”  B.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner55 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm That confidently classifies some of the remaining examples. Now learn a new classifier and repeat … & repeat …

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner56 Bootstrapping: Pivot Features Sat beside the river bank Sat on the bank Run on the bank quickandslyfox slyandcraftyfox quickofslyfoxgaitthe Lots of overlapping features vs. PCFG (McClosky et al.)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner57 Bootstrapping as Optimization Given a “labeling” distribution p̃, log likelihood to max is: On labeled data, p̃ is 1 at the label and 0 elsewhere. Thus, supervised training: Abney (2004)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner58 Triangular Trade Features Models Objectives Globally normalized LL Projective/non-projective EM Abney’s K Entropy Regularization Words, Tags, Translations, … Derivational (Rényi) entropy Parent Prediction Inside/Outside Matrix-Tree Data ???

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University.

Similar presentations

Presentation on theme: "Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University.

Similar presentations

Presentation on theme: "Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University."— Presentation transcript:

Similar presentations

About project

Feedback