Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1

Roadmap Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models 2

Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 3

Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets 4

Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies 5

Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies Used diverse NLP sequence labeling tasks: Named entity recognition, coreference resolution, etc 6

Graphical Models 7

Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables 8

Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables 9

Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables 10

Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables Model types: Bayesian Networks Markov Random Fields 11

Modeling (In)dependence Bayesian network 12

Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) 13

Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency 14

Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency Arcs = Child depends on parent(s) No arcs = independent (0 incoming: only a priori) Parents of X = For each X need 15

Example I 16 Russel & Norvig, AIMA

Simple Bayesian Network MCBN1 ABCDE A B depends on C depends on D depends on E depends on Need: Truth table 19

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on C depends on D depends on E depends on Need: P(A) Truth table 2 20

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on D depends on E depends on Need: P(A) P(B|A) Truth table 2 2*2 21

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on E depends on Need: P(A) P(B|A) P(C|A) Truth table 2 2*2 22

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C Need: P(A) P(B|A) P(C|A) P(D|B,C) P(E|C) Truth table 2 2*2 2*2*2 2*2 23

Holmes Example (Pearl) Holmes is worried that his house will be burgled. For the time period of interest, there is a 10^-4 a priori chance of this happening, and Holmes has installed a burglar alarm to try to forestall this event. The alarm is 95% reliable in sounding when a burglary happens, but also has a false positive rate of 1%. Holmes ’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is also a bit of a practical joker and, knowing Holmes ’ concern, might (30%) call even if the alarm is silent. Holmes ’ other neighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times more likely to call him if there is an alarm than not. 24

Holmes Example: Model There a four binary random variables: 25

Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 26

Holmes Example: Tables B = #t B=#f 0.0001 0.9999 A=#t A=#f B #t #f 0.95 0.05 0.01 0.99 W=#t W=#fA #t #f 0.90 0.10 0.30 0.70 G=#t G=#fA #t #f 0.40 0.60 0.10 0.90 30

Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 31

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)= 34

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A) 35

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A) 36

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A) 37

Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C) There exist algorithms for training, inference on BNs 38

Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 39

Hidden Markov Model Bayesian Network where: y t depends on 42

Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t 43

Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 44

Generative Models Both Naïve Bayes and HMMs are generative models 48

Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x 49

Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts 50

Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) 51

Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) Also satisfy local Markov property: where ne(X) are the neighbors of X 52

Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Example due to F. Xia 53

Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Example due to F. Xia 54

Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Maximum clique is largest clique in G. Clique: Maximal clique: Maximum clique: Example due to F. Xia A B C ED 55

MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) Example due to F. Xia 56

MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 57

MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 58

Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G 59

Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G. A CRF is a Markov Random Field globally conditioned on the observation X, and has the form: 60

Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. 61

Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference 62

Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure 63

Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling 64

Discriminative & Generative Model perspectives (Sutton & McCallum) 65

Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. 66

Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T  R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 67

Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T  R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 68

Linear-Chain CRFs 69

Linear-Chain CRFs 70

Linear-chain CRFs: Training & Decoding Training: 71

Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS 72

Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS Decoding: Compute label sequence that optimizes P(y|x) Can use approaches like HMM, e.g. Viterbi 73

Skip-chain CRFs 74

Motivation Long-distance dependencies: 75

Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks 76

Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 77

Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 78

Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints 79

Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? 80

Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? 81

Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? 82

Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? Not too many, increases inference cost 83

Skip Chain CRF Model Two clique templates: Standard linear chain template 84

Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 85

Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails 88

Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window 89

Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window Skip chain CRFs: Skip edges between identical capitalized words 90

NER Features 91

Skip Chain NER Results Skip chain improves substantially on ‘speaker’ recognition - Slight reduction in accuracy for times 92

Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields 93

Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models 94

Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: 95

Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: 96

Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: Compute intensive 97

HW #5 98

HW #5: Beam Search Apply Beam Search to MaxEnt sequence decoding Task: POS tagging Given files: test data: usual format boundary file: sentence lengths model file Comparisons: Different topN, topK, beam_width 99

Tag Context Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tag i-2 +tag i-1 ) These are NOT in the data file; you compute them on the fly. Notes: Due to sparseness, it is possible a bigram may not appear in the model file. Skip it. These are feature functions: If you have a different candidate tag for the same word, weights will differ. 100

Uncertainty Real world tasks: Partially observable, stochastic, extremely complex Probabilities capture “Ignorance & Laziness” Lack relevant facts, conditions Failure to enumerate all conditions, exceptions 101

Motivation Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests not perfect False positive, false negative How do we estimate confidence? 102

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Similar presentations

Presentation on theme: "Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Similar presentations

Presentation on theme: "Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1."— Presentation transcript:

Similar presentations

About project

Feedback