Download presentation

Presentation is loading. Please wait.

Published byLester Hubbard Modified over 6 years ago

1
Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1

2
Roadmap Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models 2

3
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 3

4
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets 4

5
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies 5

6
Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies Used diverse NLP sequence labeling tasks: Named entity recognition, coreference resolution, etc 6

7
Graphical Models 7

8
Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables 8

9
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables 9

10
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables 10

11
Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables Model types: Bayesian Networks Markov Random Fields 11

12
Modeling (In)dependence Bayesian network 12

13
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) 13

14
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency 14

15
Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency Arcs = Child depends on parent(s) No arcs = independent (0 incoming: only a priori) Parents of X = For each X need 15

16
Example I 16 Russel & Norvig, AIMA

17
Example I 17 Russel & Norvig, AIMA

18
Example I 18 Russel & Norvig, AIMA

19
Simple Bayesian Network MCBN1 ABCDE A B depends on C depends on D depends on E depends on Need: Truth table 19

20
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on C depends on D depends on E depends on Need: P(A) Truth table 2 20

21
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on D depends on E depends on Need: P(A) P(B|A) Truth table 2 2*2 21

22
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on E depends on Need: P(A) P(B|A) P(C|A) Truth table 2 2*2 22

23
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C Need: P(A) P(B|A) P(C|A) P(D|B,C) P(E|C) Truth table 2 2*2 2*2*2 2*2 23

24
Holmes Example (Pearl) Holmes is worried that his house will be burgled. For the time period of interest, there is a 10^-4 a priori chance of this happening, and Holmes has installed a burglar alarm to try to forestall this event. The alarm is 95% reliable in sounding when a burglary happens, but also has a false positive rate of 1%. Holmes ’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is also a bit of a practical joker and, knowing Holmes ’ concern, might (30%) call even if the alarm is silent. Holmes ’ other neighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times more likely to call him if there is an alarm than not. 24

25
Holmes Example: Model There a four binary random variables: 25

26
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 26

27
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 27

28
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 28

29
Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 29

30
Holmes Example: Tables B = #t B=#f 0.0001 0.9999 A=#t A=#f B #t #f 0.95 0.05 0.01 0.99 W=#t W=#fA #t #f 0.90 0.10 0.30 0.70 G=#t G=#fA #t #f 0.40 0.60 0.10 0.90 30

31
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 31

32
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 32

33
Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 33

34
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)= 34

35
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A) 35

36
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A) 36

37
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A) 37

38
Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C) There exist algorithms for training, inference on BNs 38

39
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 39

40
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 40

41
Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 41

42
Hidden Markov Model Bayesian Network where: y t depends on 42

43
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t 43

44
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 44

45
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 45

46
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 46

47
Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 47

48
Generative Models Both Naïve Bayes and HMMs are generative models 48

49
Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x 49

50
Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts 50

51
Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) 51

52
Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) Also satisfy local Markov property: where ne(X) are the neighbors of X 52

53
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Example due to F. Xia 53

54
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Example due to F. Xia 54

55
Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Maximum clique is largest clique in G. Clique: Maximal clique: Maximum clique: Example due to F. Xia A B C ED 55

56
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) Example due to F. Xia 56

57
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 57

58
MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 58

59
Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G 59

60
Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G. A CRF is a Markov Random Field globally conditioned on the observation X, and has the form: 60

61
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. 61

62
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference 62

63
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure 63

64
Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling 64

65
Discriminative & Generative Model perspectives (Sutton & McCallum) 65

66
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. 66

67
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 67

68
Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 68

69
Linear-Chain CRFs 69

70
Linear-Chain CRFs 70

71
Linear-chain CRFs: Training & Decoding Training: 71

72
Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS 72

73
Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS Decoding: Compute label sequence that optimizes P(y|x) Can use approaches like HMM, e.g. Viterbi 73

74
Skip-chain CRFs 74

75
Motivation Long-distance dependencies: 75

76
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks 76

77
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 77

78
Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 78

79
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints 79

80
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? 80

81
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? 81

82
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? 82

83
Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? Not too many, increases inference cost 83

84
Skip Chain CRF Model Two clique templates: Standard linear chain template 84

85
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 85

86
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 86

87
Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 87

88
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails 88

89
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window 89

90
Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window Skip chain CRFs: Skip edges between identical capitalized words 90

91
NER Features 91

92
Skip Chain NER Results Skip chain improves substantially on ‘speaker’ recognition - Slight reduction in accuracy for times 92

93
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields 93

94
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models 94

95
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: 95

96
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: 96

97
Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: Compute intensive 97

98
HW #5 98

99
HW #5: Beam Search Apply Beam Search to MaxEnt sequence decoding Task: POS tagging Given files: test data: usual format boundary file: sentence lengths model file Comparisons: Different topN, topK, beam_width 99

100
Tag Context Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tag i-2 +tag i-1 ) These are NOT in the data file; you compute them on the fly. Notes: Due to sparseness, it is possible a bigram may not appear in the model file. Skip it. These are feature functions: If you have a different candidate tag for the same word, weights will differ. 100

101
Uncertainty Real world tasks: Partially observable, stochastic, extremely complex Probabilities capture “Ignorance & Laziness” Lack relevant facts, conditions Failure to enumerate all conditions, exceptions 101

102
Motivation Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests not perfect False positive, false negative How do we estimate confidence? 102

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google