Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 Machine Learning for Sequential.

Similar presentations


Presentation on theme: "Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 Machine Learning for Sequential."— Presentation transcript:

1 Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd Machine Learning for Sequential Data

2 Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

3 Some Example Learning Problems  Cellular Telephone Fraud  Part-of-speech Tagging  Information Extraction from the Web  Hyphenation for Word Processing

4 Cellular Telephone Fraud  Given the sequence of recent telephone calls, can we determine which calls (if any) are fraudulent?

5 Part-of-Speech Tagging  Given an English sentence, can we assign a part of speech to each word?  “Do you want fries with that?” 

6 Information Extraction from the Web Srinivasan Seshan (Carnegie Mellon University) Making Virtual Worlds Real Tuesday, June 4, 2002 2:00 PM, 322 Sieg Research Seminar * * * name name * * affiliation affiliation affiliation * * * * title title title title * * * date date date date * time time * location location * event-type event-type

7 Hyphenation  “Porcupine” ! “001010000”

8 Sequential Supervised Learning (SSL)  Given: A set of training examples of the form (X i,Y i ), where X i = hx i,1, …, x i,T i i and Y i = hy i,1, …, y i,T i i are sequences of length T i  Find: A function f for predicting new sequences: Y = f(X).

9 Examples as Sequential Supervised Learning DomainInput X i Output Y i Telephone Fraud sequence of calls sequence of labels {ok, fraud} Part-of-speech Tagging sequence of words sequence of parts of speech Information Extraction sequence of tokens sequence of field labels {name, …} Hyphenationsequence of letters sequence of {0,1} 1 = hyphen ok

10 Two Kinds of Relationships  Relationships between the x t ’s and y t ’s Example: “Friday” is usually a “date”  Relationships among the y t ’s Example: “name” is usually followed by “affiliation”  SSL can (and should) exploit both kinds of information

11 Two Other Tasks that are Not SSL  Sequence Classification  Time-Series Prediction

12 Sequence Classification  Given an input sequence, assign one label to the entire sequence  Example: Recognize a person from their handwriting. Input sequence: Sequence of pen strokes Output label: Name of person

13 Time-Series Prediction  Given: A sequence hy 1, …, y t i predict y t+1.  Example: Predict unemployment rate for next month based on history of unemployment rates.

14 Key Differences  In SSL, there is one label y i,t for each input x i,t  In SSL, we are given the entire X sequence before we need to predict any of the y t values  In SSL, we do not have any of the true y values when we predict y t+1

15 Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

16 Research Issues for SSL  Loss Functions How do we measure performance?  Feature Selection and Long-distance Interactions How do we model relationships among the y t ’s, especially long-distance effects?  Computational Cost How do we make it efficient?

17 Basic Loss Functions  Count the number of entire sequences Y i correctly predicted (i.e., every y i,t must be right)  Count the number of individual labels y i,t correctly predicted

18 More Complex Loss Functions x x x x x x x x x x x x x x x x x x x x x x x x x True labels: Phone calls: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Loss: Loss is computed for first “fraudulent” prediction

19 More Complex Loss Functions (2)  Hyphenation False positives are very bad Need at least one correct hyphen near middle of word

20 Hyphenation Loss  Perfect: “qual-i-fi-ca-tion”  Very good: “quali-fi-cation”  OK: “quali-fication”, “qualifi-cation”  Worse: “qual-ification”, “qualifica-tion”  Very bad: “qua-lification”, “qualificatio-n”

21 Feature Selection and Long Distance Effects  Any solution to SSL must employ some form of divide-and-conquer  How do we determine the information relevant for predicting y t ?

22 Long Distance Effects  Consider the text-to-speech problem: “photograph” => / f-Ot@graf- / “photography” =>/ f-@tAgr@f-i /  The letter “y” changes the pronunciation of all vowels in the word!

23 Standard Feature Selection Methods  Wrapper method with forward selection or backwards elimination  Optimize feature weights  Measures of feature influence  Fit simple models to test for relevance

24 Wrapper Methods  Stepwise Regression  Wrapper Methods (Kohavi, et al.)  Problem: Very inefficient with large numbers of possible features

25 Optimizing the Feature Weights  Start with all features in the model  Encourage the learning algorithm to remove irrelevant features  Problem: There are too many possible features. We can’t include them all in the model.

26 Measures of Feature Influence  Importance of single features Mutual information, Correlation  Importance of feature subsets Schema racing (Moore, et al.) RELIEFF (Kononenko, et al.)  Question: Will subset methods scale to thousands of features?

27 Fitting Simple Models  Fit simple models using all of the features. Analyze the resulting model to determine feature importance Belief networks and Markov blanket analysis L 1 Support Vector Machines  Prediction: These will be the most practical methods

28 Outline  Sequential Supervised Learning  Research Issues  Methods for Sequential Supervised Learning  Concluding Remarks

29 Methods for Sequential Supervised Learning  Sliding Windows  Recurrent Sliding Windows  Hidden Markov Models and company Maximum Entropy Markov Models Input-Output HMMs Conditional Random Fields

30 Sliding Windows ___Doyouwantfrieswiththat___ Doyou!verb youwantfries!verb wantfrieswith!noun frieswiththat!prep withthat___!pron Doyouwant!pron

31 Properties of Sliding Windows  Converts SSL to ordinary supervised learning  Only captures the relationship between (part of) X and y t. Does not explicitly model relations among the y t ’s  Does not capture long-distance interactions  Assumes each window is independent

32 Recurrent Sliding Windows ___Doyouwantfrieswiththat___ Doyou___!verb youwantfriespron!verb wantfrieswithverb!noun frieswiththatnoun!prep withthat___prep!pron Doyouwantverb!pron

33 Recurrent Sliding Windows  Key Idea: Include y t as input feature when computing y t+1.  During training: Use the correct value of y t Or train iteratively (especially recurrent neural networks)  During evaluation: Use the predicted value of y t

34 Properties of Recurrent Sliding Windows  Captures relationship among the y’s, but only in one direction!  Results on text-to-speech: MethodDirectionWordsLetters sliding windownone12.5%69.6% recurrent s. w.left-right17.0%67.9% recurrent s. w.right-left24.4%74.2%

35 Hidden Markov Models y2y2 y1y1 y3y3 x1x1 x2x2 x3x3  y t ’s are generated as a Markov chain  x t ’s are generated independently (as in naïve Bayes or Gaussian classifiers).

36 Hidden Markov Models (2)  Models both the x t $ y t relationships and the y t $ y t+1 relationships.  Does not handle long-distance effects Everything must be captured by the current label y t.  Does not permit rich X $ y t relationships Unlike the sliding window, we can’t use several x t ’s to predict y t.

37 Using HMMs  Training Extremely simple, because the y t ’s are known on the training set.  Execution: Dynamic Programming methods If the loss function depends on the whole sequence, use the Viterbi algorithm: argmax Y P(Y | X) If the loss function depends on individual y t predictions, use the forward-backward algorithm: argmax y t P(y t | X)

38 HMM Alternatives: Maximum Entropy Markov Models y2y2 y1y1 y3y3 x1x1 x2x2 x3x3

39 MEMM Properties  Permits complex X $ y t relationships by employing a sparse maximum entropy model of P(y t+1 |X, y t ): P(y t+1 |X,y t ) / exp(  b  b f b (X,y t,y t+1 )) where f b is a boolean feature.  Training can be expensive (gradient descent or iterative scaling)

40 HMM Alternatives (2): Input/Output HMM h2h2 h1h1 h3h3 x1x1 x2x2 x3x3 y1y1 y2y2 y3y3

41 IOHMM Properties  Hidden states permit “memory” of long distance effects (beyond what is captured by the class labels)  As with MEMM, arbitrary features of the input X can be used to predict y t.

42  Forward models that are normalized at each step exhibit a problem.  Consider a domain with only two sequences: “rib” ! “111” and “rob” ! “222”.  Consider what happens when an MEMM sees the sequence “rib”. Label Bias Problem

43 Label Bias Problem (2)  After “r”, both labels 1 and 2 have same probability. After “i”, label 2 must still send all of its probability forward, even though it was expecting “o”. Result: both output strings “111” and “222” have same probability. ib 111 222 r r ob

44 Conditional Random Fields y2y2 y1y1 y3y3 x1x1 x2x2 x3x3  The y t ’s form a Markov Random Field conditioned on X.

45 Representing the CRF parameters  Each undirected arc y t $ y t+1 represents a potential function: M(y t,y t+1 |X) = exp[  a a f a (y t,y t+1,X) +  b  b g b (y t,X)] where f a and g b are arbitrary boolean features.

46 Using CRFs P(Y|X) / M(y 1,y 2 |X) ¢ M(y 2,y 3 |X) ¢ … ¢ M(y T-1,y T |X)  Training: Gradient descent or iterative scaling

47 CRFs on Part-of-speech tagging HMMMEMMCRF baseline 5.696.375.55 spelling features 5.694.874.27 spelling features (OOV) 45.9926.9923.76 Lafferty, McCallum & Pereira (2001) (error rates in percent)

48 Summary of Methods IssueSWRSWHMMMEMMIOHMMCRF x t $ y t y t $ y t+1 NOPartlyYES long dist?NOPartlyNO YES?NO X $ y t rich?YES NOYES efficient?YES YES?NO label bias ok?YES NO YES

49 Loss Functions and Training  Kakade, Teh & Roweis (2002) show that if the loss function depends only on errors of y t, then MEMM’s, IOHMM’s, and CRF’s should be trained to maximize the likelihood P(y t | X) instead of P(Y|X) or P(X,Y).

50 Concluding Remarks  Many applications of pattern recognition can be formalized as Sequential Supervised Learning  Many methods have been developed specifically for SSL, but none is perfect  Similar issues arise in other complex learning problems (e.g., spatial and relational data)


Download ppt "Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 Machine Learning for Sequential."

Similar presentations


Ads by Google