Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields

Similar presentations


Presentation on theme: "Conditional Random Fields"— Presentation transcript:

1 Conditional Random Fields
Mark Stamp CRF

2 Intro Hidden Markov Model (HMM) used in
Bioinformatics Natural language processing Speech recognition Malware detection/analysis And many, many other applications Bottom line: HMMs are very useful Everybody knows that! CRF

3 Generic HMM Recall that A is Markov process
Implies that Xi only depends on Xi-1 Matrix B is observation probabilities Note probability of Oi only depends on Xi CRF

4 HMM Limitations Assumptions Often independence is not realistic
Observation depends on current state Current state depends on previous state Strong independence assumption Often independence is not realistic Observation can depend on several states And/or current state might depend on several previous states CRF

5 HMMs Within HMM framework, we can… Increase N, number of hidden states
And/or higher order Markov process “Order 2” means hidden state depends on 2 immediately previous hidden states Order > 1 limits independence constraint More hidden states, more “breadth” Higher order, increased “depth” CRF

6 Beyond HMMs HMMs do not fit some situations
For example, arbitrary dependencies on state transitions and/or observations Here, focus on generalization of HMM Conditional Random Fields (CRF) There are other generalizations We mention a few Mostly focused on the “big picture” CRF

7 HMM Revisited Illustrates graph structure of HMM
That is, HMM is a directed line graph Can other types of graphs work? Would they make sense? CRF

8 MEMM In HMM, observation sequence O is related to states X via B matrix And O affects X in training, not scoring Might want X to depend on O in scoring Maximum Entropy Markov Model State Xi is function of Xi-1 and Oi MEMM focused on “problem 2” That is, determine (hidden) states CRF

9 Generic MEMM How does this differ from HMM?
State Xi is function of Xi-1 and Oi Cannot generate Oi using the MEMM, while we can do so using HMM While an HMM can be used to generate observation sequences that fit a given model, your humble author is not aware of a lot of application where this feature is particularly useful… CRF

10 MEMM vs HMM HMM  Find “best” state sequence X
That is, solve HMM Problem 2 Solution is X that maximizes P(X|O) = Π P(Oi|Xi) Π P(Xi|Xi-1) MEMM  Find “best” state sequence X P(X|O) = Π P(Xi|Xi-1,Oi) where P(x|y,o) = 1/Z(o,y) exp(Σwjfj(o,x)) Note the form of the MEMM probability function, which is very different from the HMM case. Also, note that for MEMM, observation directly affects the probability P(X|O). CRF

11 MEMM vs HMM Note Σwj fj(o,x) in MEMM probability
This sum is over entire sequence Any useful feature of input observation can affect probability MEMM more “general”, in this sense As compared to HMM, that is But MEMM creates a new problem A problem that does not occur in HMM CRF

12 Label Bias Problem MEMM uses dynamic programming (DP)
Also known as the Viterbi algorithm HMM (problem 2) does not use DP HMM α-pass uses sum, DP uses max In MEMM probability is “conserved” Probability must be split between successor states (not so in HMM) Is this good or bad? CRF

13 Label Bias Problem Only one possible successor in MEMM?
All probability passed along to that state In effect, observation is ignored More generally, if one dominant successor, observation doesn’t matters much CRF solves label bias problem of MEMM So, observation matters We won’t go into details here… CRF

14 Label Bias Problem Example In M state… Hot, Cold, and Medium states
Observation does little (MEMM) Observation can matter more (HMM) 0.7 0.99 H 0.3 0.3 M 0.01 In CRF, transition from M to H (almost surely), but observation at M could affect resulting probability, In contrast, in MEMM, all of the probability that arrives at M must be passed along to its successor state, regardless of observation at M. C 0.1 0.6 CRF

15 Conditional Random Fields
CRFs a generalization of HMMs Generalization to other graphs Undirected graphs Linear Chain CRF is simplest case But also generalizes to arbitrary (undirected) graphs That is, can have arbitrary dependencies between states and observations CRF

16 Simplest Case of CRF How is it different from HMM/MEMM?
More things can depend on each other The case illustrated is a linear chain CRF More general graph structure can work CRF

17 Another View Next, consider deeper connection between HMM and CRF
But first, we need some background Naïve Bayes Logistic regression These topics are very useful in their own right… …so wake up and pay attention! CRF

18 What Are We Doing Here? Recall, O observation, X is state
Ideally, want to model P(X,O) All possible interactions of Xs and Os But P(X,O) involves lots of parameters Like the complete covariance matrix Lots of data needed for “training” And too much work to train Generally, this problem is intractable For example, we probably don’t care too much about all of the possible interactions between the observations. So, it would make sense not to expend a lot of effort trying to model these interactions. CRF

19 What to Do? Simplify, simplify, simplify…
Need to make problem tractable And then hope we get decent results In Naïve Bayes, assume independence In regression analysis, try to fit specific function to data Eventually, we’ll see this is relevant Wrt HMMs and CRFs, that is CRF

20 Naïve Bayes Why is it “naïve”? Assume features in X are independent
Probably not true, but simplifies things And often works well in practice Why does independence simplify? Recall covariance: For X = (x1,…,xn) and Y = (y1,…,yn), if means are 0, then Cov(X,Y) = (x1y1 +…+ xnyn) / n CRF

21 Naïve Bayes Independent implies covariance is 0
If so, in covariance matrix only the diagonal elements are non-zero Only need means and variances Not the entire covariance matrix Far fewer parameters to estimate And a lot less data needed for training Bottom line: Practical solution Independent implies covariance is 0, but covariance of 0 does not, in general, imply independence. CRF

22 Naïve Bayes Why is it “Bayes”? Because it uses Bayes Theorem: That is,
More generally, where Aj form partition CRF

23 Bayes Formula Example Consider a test for an illegal drug
If you use drug, 98% positive (TPR = sensitivity) If don’t use, 99% negative (TNR = specificity) In overall population, 5/1000 use the drug Let A = uses the drug, B = tests positive Then = .98 × .005 / (.98 × × .995) = = 33% CRF

24 Naïve Bayes Why is this relevant? Spse classify based on observation O
Compute P(X|O) = P(O|X) P(X) / P(O) Where X is one possible class (state) And P(O|X) is easy to compute Repeat for all possible classes X Biggest probability is most likely class X Can ignore P(O) since it’s constant CRF

25 Regression Analysis Generically, method for measuring relationship between 2 or more things E.g., house price vs size First, we consider linear regression Since it’s the simplest case Then logistic regression More complicated, but often more useful Used for binary classifiers CRF

26 Linear Regression Spse x is house ft2 And y is sale price
Could be vector x of observations instead And y is sale price Points represent recent sales results How to use this info? Given a house to sell… Given a recent sale… x Eigenvector Techniques

27 Linear Regression Blue line is “best fit” What good is it?
y Blue line is “best fit” Minimum squared error Perpendicular distance Linear least squares What good is it? Given a new point, how well does it fit in? Given x, predict y This sounds familiar… x Eigenvector Techniques

28 Regression Analysis In many problems, only 2 outcomes
Binary classifier, e.g., malware vs benign “Malware of specific type” vs “other” Then x is an observation (vector) But each y is either 0 or 1 Linear regression not so good (Why?) A better idea  logistic regression Fit a logistic function instead of line CRF

29 Binary Classification
Suppose we compute score for many files Score is on x-axis Output on y-axis 1 if file is malware 0 if file is “other” Linear regression not very useful here x Eigenvector Techniques

30 Binary Classification
Instead of a line… Use a function better for 0,1 data Logistic function Transition from 0 to 1 more abrupt than line Why is this better? Less wasted time between 0 and 1 x Eigenvector Techniques

31 Logistic Regression Logistic function Here, t = b0 + b1x
F(t) = 1 / (1 + e-t) Input: –∞ to ∞ Output: 0 to 1, can be interpreted as P(t) Here, t = b0 + b1x Or t=b0+b1x1+…+bmxm I.e., x is observation CRF

32 Logistic Regression Instead of fitting a line to data…
Fit logistic function to data And instead of least squares error… Measure “deviance”  distance from ideal case (where ideal is “saturated model”) Iterative process to find parameters Find best fit F(t) using data points More complex training than linear case… …but, better suited to binary classification Actually, finding parameters is much more complex than the linear least squares algorithm used in linear regression. CRF

33 Conditional Probability
Recall, we would like to model P(X,O) Observe that P(X,O) includes all relationships between Xs and Os Too complex, too many parameters, too… So we settle for P(X|O) A lot fewer parameters Problem is tractable Works well in practice CRF

34 Generative vs Discriminative
We are interested in P(X|O) Generative models Focus on P(O|X) P(X) From Naïve Bayes (without denominator) Discriminative models Focus directly on P(X|O) Like logistic regression Tradeoffs? CRF

35 Generative vs Discriminative
Naïve Bayes is generative model Since it uses P(O|X) P(X) Good in unsupervised case, unlabeled data Logistic regression is discriminative Directly deal with P(X|O) No need to expend effort modeling O So, more freedom to model X Unsupervised is “active area of research” In principle, fewer parameters of concern in discriminative case. But we have efficient algorithms in the generative case. So, maybe the tradeoff here is between “advantages in theory” vs “efficient in practice”. CRF

36 HMM and Naïve Bayes Connection(s) between NB and HMM?
Recall HMM, problem 2 For given O, find “best” (hidden) state X We use P(X|O) to determine best X Alpha pass used in solving problem 2 Looking closely at alpha pass… It is based on computing P(O|X) P(X) With probabilities from the model λ Note that the “alpha pass” is usually known as the forward algorithm and the “beta pass” is the backward algorithm. CRF

37 HMM and Naïve Bayes Connection(s) between NB and HMM?
HMM can be viewed as sequential version of Naïve Bayes Classifications over series of observations HMM uses info about state transitions Conversely, Naïve Bayes is a “static” version of HMM Bottom line: HMM is generative model CRF

38 CRF and Logistic Regression
Connection between CRF & regression? Linear chain CRF is sequential version of logistic regression Classification over series of observations CRF uses info about state transitions Conversely, logistic regression can be viewed as static (linear chain) CRF Bottom line: CRF discriminative model CRF

39 Generative vs Discriminative
Naïve Bayes and Logistic Regression A “generative-discriminative pair” HMM and (Linear Chain) CRF Another generative-discriminative pair Sequential versions of those above Are there other such pairs? Yes, based on further generalizations What’s more general than sequential? CRF

40 General CRF Can define CRF on any (undirected) graph structure
Not just a linear chain In general CRF, training and scoring not as efficient, so… Linear Chain CRF used most in practice If special cases, might be worth considering more general CRF Determining such structure is very problem-specific. CRF

41 Generative Directed Model
Can view HMM as defined on (directed) line graph Could consider similar process on more general (directed) graph structures This more general case is known as “generative directed model” Algorithms (training, scoring, etc.) not as efficient in more general case CRF

42 Generative-Discriminative Pair
Generative directed model As the name implies, a generative model General CRF A discriminative model So, this gives us a 3rd generative-discriminative pair Summary on next slide… CRF

43 Generative-Discriminative Pairs
CRF

44 HCRF Yes, you guessed it… So, what is hidden? To be continued…
Hidden Conditional Random Field So, what is hidden? To be continued… CRF

45 Algorithms Where are the algorithms? Yes, CRF algorithms do exist
This is a CS class, after all… Yes, CRF algorithms do exist Omitted, since lot of background needed Would take too long to cover it all We’ve got better things to do So, just use existing implementations It’s your lucky day… CRF

46 References E. Chen, Introduction to conditional random fields
Y. Ko, Maximum entropy Markov models and conditional random fields A. Quattoni, Tutorial on conditional random fields for sequence prediction The blog by E. Chen is the easiest to read of these references. The slides by Y. Ko are also fairly readable and good, especially wrt the algorithms. Many other sources can be found online, but most (including some of the references listed here) are challenging to read (to put it mildly…). CRF

47 References C. Sutton and A. McCallum, An introduction to conditional random fields, Foundations and Trends in Machine Learning, 4(4): , 2011 H.M. Wallach, Conditional random fields: An introduction, 2004 CRF


Download ppt "Conditional Random Fields"

Similar presentations


Ads by Google