Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Similar presentations


Presentation on theme: "Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1."— Presentation transcript:

1 Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1

2 Roadmap Graphical Models Modeling independence Models revisited Generative & discriminative models Conditional random fields Linear chain models Skip chain models 2

3 Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 3

4 Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets 4

5 Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies 5

6 Preview Conditional random fields Undirected graphical model Due to Lafferty, McCallum, and Pereira, 2001 Discriminative model Supports integration of rich feature sets Allows range of dependency structures Linear-chain, skip-chain, general Can encode long-distance dependencies Used diverse NLP sequence labeling tasks: Named entity recognition, coreference resolution, etc 6

7 Graphical Models 7

8 Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables 8

9 Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables 9

10 Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables 10

11 Graphical Models Graphical model Simple, graphical notation for conditional independence Probabilistic model where: Graph structure denotes conditional independence b/t random variables Nodes: random variables Edges: dependency relation between random variables Model types: Bayesian Networks Markov Random Fields 11

12 Modeling (In)dependence Bayesian network 12

13 Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) 13

14 Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency 14

15 Modeling (In)dependence Bayesian network Directed acyclic graph (DAG) Nodes = Random Variables Arc ~ directly influences, conditional dependency Arcs = Child depends on parent(s) No arcs = independent (0 incoming: only a priori) Parents of X = For each X need 15

16 Example I 16 Russel & Norvig, AIMA

17 Example I 17 Russel & Norvig, AIMA

18 Example I 18 Russel & Norvig, AIMA

19 Simple Bayesian Network MCBN1 ABCDE A B depends on C depends on D depends on E depends on Need: Truth table 19

20 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on C depends on D depends on E depends on Need: P(A) Truth table 2 20

21 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on D depends on E depends on Need: P(A) P(B|A) Truth table 2 2*2 21

22 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on E depends on Need: P(A) P(B|A) P(C|A) Truth table 2 2*2 22

23 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C Need: P(A) P(B|A) P(C|A) P(D|B,C) P(E|C) Truth table 2 2*2 2*2*2 2*2 23

24 Holmes Example (Pearl) Holmes is worried that his house will be burgled. For the time period of interest, there is a 10^-4 a priori chance of this happening, and Holmes has installed a burglar alarm to try to forestall this event. The alarm is 95% reliable in sounding when a burglary happens, but also has a false positive rate of 1%. Holmes ’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is also a bit of a practical joker and, knowing Holmes ’ concern, might (30%) call even if the alarm is silent. Holmes ’ other neighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times more likely to call him if there is an alarm than not. 24

25 Holmes Example: Model There a four binary random variables: 25

26 Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 26

27 Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 27

28 Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 28

29 Holmes Example: Model There a four binary random variables: B: whether Holmes ’ house has been burgled A: whether his alarm sounded W: whether Watson called G: whether Gibbons called BAWG 29

30 Holmes Example: Tables B = #t B=#f 0.0001 0.9999 A=#t A=#f B #t #f 0.95 0.05 0.01 0.99 W=#t W=#fA #t #f 0.90 0.10 0.30 0.70 G=#t G=#fA #t #f 0.40 0.60 0.10 0.90 30

31 Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 31

32 Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 32

33 Bayes’ Nets: Markov Property Bayes’s Nets: Satisfy the local Markov property Variables: conditionally independent of non-descendents given their parents 33

34 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)= 34

35 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A) 35

36 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A) 36

37 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A) 37

38 Simple Bayesian Network MCBN1 ABCDE A = only a priori B depends on A C depends on A D depends on B,C E depends on C P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C) There exist algorithms for training, inference on BNs 38

39 Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 39

40 Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 40

41 Naïve Bayes Model Bayes’ Net: Conditional independence of features given class Y f1f1 f2f2 f3f3 fkfk 41

42 Hidden Markov Model Bayesian Network where: y t depends on 42

43 Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t 43

44 Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 44

45 Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 45

46 Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 46

47 Hidden Markov Model Bayesian Network where: y t depends on y t-1 x t depends on y t y1y1 y2y2 y3y3 ykyk x 1 x 2 x 3 x k 47

48 Generative Models Both Naïve Bayes and HMMs are generative models 48

49 Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x 49

50 Generative Models Both Naïve Bayes and HMMs are generative models We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y. (Sutton & McCallum, 2006) State y generates an observation (instance) x Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts 50

51 Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) 51

52 Markov Random Fields aka Markov Network Graphical representation of probabilistic model Undirected graph Can represent cyclic dependencies (vs DAG in Bayesian Networks, can represent induced dep) Also satisfy local Markov property: where ne(X) are the neighbors of X 52

53 Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Example due to F. Xia 53

54 Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Example due to F. Xia 54

55 Factorizing MRFs Many MRFs can be analyzed in terms of cliques Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices v i,v j, there exists E(v i,v j ) Maximal clique can not be extended Maximum clique is largest clique in G. Clique: Maximal clique: Maximum clique: Example due to F. Xia A B C ED 55

56 MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) Example due to F. Xia 56

57 MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 57

58 MRFs Given an undirected graph G(V,E), random vars: X Cliques over G: cl(G) B C ED Example due to F. Xia 58

59 Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G 59

60 Conditional Random Fields Definition due to Lafferty et al, 2001: Let G = (V,E) be a graph such that Y=(Y v ) vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Y v obey the Markov property with respect to the graph: p(Y v |X,Y w,w!=v)=p(Y v |X,Y w,w~v), where w ∼ v means that w and v are neighbors in G. A CRF is a Markov Random Field globally conditioned on the observation X, and has the form: 60

61 Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. 61

62 Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference 62

63 Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure 63

64 Linear-Chain CRF CRFs can have arbitrary graphical structure, but.. Most common form is linear chain Supports sequence modeling Many sequence labeling NLP problems: Named Entity Recognition (NER), Coreference Similar to combining HMM sequence w/MaxEnt model Supports sequence structure like HMM but HMMs can’t do rich feature structure Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling 64

65 Discriminative & Generative Model perspectives (Sutton & McCallum) 65

66 Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. 66

67 Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T  R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 67

68 Linear-Chain CRFs Feature functions: In MaxEnt: f: X x Y  {0,1} e.g. f j (x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0 o.w. In CRFs, f: Y x Y x X x T  R e.g. f k (y t,y t-1,x,t)=1, if y t =V and y t-1 =N and x t =“flies”,0 o.w. frequently indicator function, for efficiency 68

69 Linear-Chain CRFs 69

70 Linear-Chain CRFs 70

71 Linear-chain CRFs: Training & Decoding Training: 71

72 Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS 72

73 Linear-chain CRFs: Training & Decoding Training: Learn λ j Approach similar to MaxEnt: e.g. L-BFGS Decoding: Compute label sequence that optimizes P(y|x) Can use approaches like HMM, e.g. Viterbi 73

74 Skip-chain CRFs 74

75 Motivation Long-distance dependencies: 75

76 Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks 76

77 Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make very local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 77

78 Motivation Long-distance dependencies: Linear chain CRFs, HMMs, beam search, etc All make local Markov assumptions Preceding label; current data given current label Good for some tasks However, longer context can be useful e.g. NER: Repeated capitalized words should get same tag 78

79 Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints 79

80 Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? 80

81 Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? 81

82 Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? 82

83 Skip-Chain CRFs Basic approach: Augment linear-chain CRF model with Long-distance ‘skip edges’ Add evidence from both endpoints Which edges? Identical words, words with same stem? How many edges? Not too many, increases inference cost 83

84 Skip Chain CRF Model Two clique templates: Standard linear chain template 84

85 Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 85

86 Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 86

87 Skip Chain CRF Model Two clique templates: Standard linear chain template Skip edge template 87

88 Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails 88

89 Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window 89

90 Skip Chain NER Named Entity Recognition: Task: start time, end time, speaker, location In corpus of seminar announcement emails All approaches: Orthographic, gazeteer, POS features Within preceding, following 4 word window Skip chain CRFs: Skip edges between identical capitalized words 90

91 NER Features 91

92 Skip Chain NER Results Skip chain improves substantially on ‘speaker’ recognition - Slight reduction in accuracy for times 92

93 Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields 93

94 Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models 94

95 Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: 95

96 Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: 96

97 Summary Conditional random fields (CRFs) Undirected graphical model Compare with Bayesian Networks, Markov Random Fields Linear-chain models HMM sequence structure + MaxEnt feature models Skip-chain models Augment with longer distance dependencies Pros: Good performance Cons: Compute intensive 97

98 HW #5 98

99 HW #5: Beam Search Apply Beam Search to MaxEnt sequence decoding Task: POS tagging Given files: test data: usual format boundary file: sentence lengths model file Comparisons: Different topN, topK, beam_width 99

100 Tag Context Following Ratnaparkhi ‘96, model uses previous tag (prevT=tag) and previous tag bigram (prevTwoTags=tag i-2 +tag i-1 ) These are NOT in the data file; you compute them on the fly. Notes: Due to sparseness, it is possible a bigram may not appear in the model file. Skip it. These are feature functions: If you have a different candidate tag for the same word, weights will differ. 100

101 Uncertainty Real world tasks: Partially observable, stochastic, extremely complex Probabilities capture “Ignorance & Laziness” Lack relevant facts, conditions Failure to enumerate all conditions, exceptions 101

102 Motivation Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests not perfect False positive, false negative How do we estimate confidence? 102


Download ppt "Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1."

Similar presentations


Ads by Google