Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTIONCS446 Fall ’14 CS 446: Machine Learning Dan Roth University of Illinois, Urbana-Champaign 3322.

Similar presentations


Presentation on theme: "INTRODUCTIONCS446 Fall ’14 CS 446: Machine Learning Dan Roth University of Illinois, Urbana-Champaign 3322."— Presentation transcript:

1 INTRODUCTIONCS446 Fall ’14 CS 446: Machine Learning Dan Roth University of Illinois, Urbana-Champaign danr@illinois.edu http://L2R.cs.uiuc.edu/~danr 3322 SC

2 INTRODUCTIONCS446 Fall ’14 CS446: Policies Cheating No. We take it very seriously. Homework:  Collaboration is encouraged  But, you have to write your own solution/program.  (Please don’t use old solutions) Late Policy: You have a credit of 4 days (4*24hours); That’s it. Grading:  Possibly separate for grads/undergrads.  5% Class work; 25% - homework; 30%-midterm; 40%-final;  Projects: 25% (4 hours) Questions? 2 Info page Note also the Schedule Page and our Notes on the main page

3 INTRODUCTIONCS446 Fall ’14 CS446 Team Dan Roth (3323 Siebel)  Tuesday, 2:00 PM – 3:00 PM TAs (4407 Siebel)  Kai-Wei Chang: Tuesday 5:00pm-6:00pm (3333 SC)  Chen-Tse Tsai: Tuesday 8:00pm-9:00pm (3333 SC)  Bryan Lunt: Tue 8:00pm-9:00pm (1121 SC)  Rongda Zhu: Thursday 4:00pm-5:00pm (2205 SC) Discussion Sections: (starting next week; no Monday)  Mondays: 5:00pm-6:00pm 3405-SC Bryan Lunt [A-F]  Tuesday: 5:00pm-6:00pm 3401-SC Chen-Tse Tsai [G-L]  Wednesdays: 5:30pm-6:30pm 3405-SC Kai-Wei Chang [M-S]  Thursdays: 5:00pm-6:00pm 3403-SC Rongda Zhu [T-Z] 3

4 INTRODUCTIONCS446 Fall ’14 CS446 on the web Check our class website:  Schedule, slides, videos, policies  http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-14/index.html http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-14/index.html  Sign up, participate in our Piazza forum:  Announcements and discussions  https://piazza.com/class#fall2014/cs446 https://piazza.com/class#fall2014/cs446  Log on to Compass:  Submit assignments, get your grades  https://compass2g.illinois.edu https://compass2g.illinois.edu 4 Registration to Class Homework: No need to submit Hw0; Later: submit electronically.

5 INTRODUCTIONCS446 Fall ’14 What is Learning The Badges Game…………  This is an example of the key learning protocol: supervised learning First question: Are you sure you got it right?  Why? Issues:  Prediction or Modeling?  Representation  Problem setting  Background Knowledge  When did learning take place?  Algorithm 5

6 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from an output space Y Output y ∈ Y An item y drawn from an output space Y Input x ∈ X An item x drawn from an input space X Input x ∈ X An item x drawn from an input space X System y = f(x) System y = f(x) Supervised Learning We consider systems that apply a function f() to input items x and return an output y = f(x). 6

7 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from an output space Y Output y ∈ Y An item y drawn from an output space Y Input x ∈ X An item x drawn from an input space X Input x ∈ X An item x drawn from an input space X System y = f(x) System y = f(x) Supervised Learning In (supervised) machine learning, we deal with systems whose f(x) is learned from examples. 7

8 INTRODUCTIONCS446 Fall ’14 Why use learning? We typically use machine learning when the function f(x) we want the system to apply is too complex to program by hand. 8

9 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from a label space Y Output y ∈ Y An item y drawn from a label space Y Input x ∈ X An item x drawn from an instance space X Input x ∈ X An item x drawn from an instance space X Learned Model y = g(x) Supervised learning Target function y = f(x) 9

10 INTRODUCTIONCS446 Fall ’14 Supervised learning: Training Give the learner examples in D train The learner returns a model g(x) 10 Labeled Training Data D train (x 1, y 1 ) (x 2, y 2 ) … (x N, y N ) Learned model g(x) Learning Algorithm Can you suggest other learning protocols?

11 INTRODUCTIONCS446 Fall ’14 Supervised learning: Testing Reserve some labeled data for testing 11 Labeled Test Data D test (x’ 1, y’ 1 ) (x’ 2, y’ 2 ) … (x’ M, y’ M )

12 INTRODUCTIONCS446 Fall ’14 Supervised learning: Testing Labeled Test Data D test (x’ 1, y’ 1 ) (x’ 2, y’ 2 ) … (x’ M, y’ M ) Test Labels Y test y’ 1 y’ 2... y’ M Raw Test Data X test x’ 1 x’ 2 …. x’ M

13 INTRODUCTIONCS446 Fall ’14 Test Labels Y test y’ 1 y’ 2... y’ M Raw Test Data X test x’ 1 x’ 2 …. x’ M Supervised learning: Testing Apply the model to the raw test data Can you use the test data otherwise? 13 Learned model g(x) Predicted Labels g( X test ) g(x’ 1 ) g(x’ 2 ) …. g(x’ M ) Can you suggest other evaluation protocols?

14 INTRODUCTIONCS446 Fall ’14 Supervised learning: Testing Evaluate the model by comparing the predicted labels against the test labels 14 Test Labels Y test y’ 1 y’ 2... y’ M Raw Test Data X test x’ 1 x’ 2 …. x’ M Predicted Labels g( X test ) g(x’ 1 ) g(x’ 2 ) …. g(x’ M ) Learned model g(x)

15 INTRODUCTIONCS446 Fall ’14 The Badges game 15

16 INTRODUCTIONCS446 Fall ’14 What is Learning The Badges Game…………  This is an example of the key learning protocol: supervised learning First question: Are you sure you got it?  Why? Issues:  Prediction or Modeling?  Representation  Problem setting  Background Knowledge  When did learning take place?  Algorithm 16

17 INTRODUCTIONCS446 Fall ’14 The Badges game Conference attendees to the 1994 Machine Learning conference were given name badges labeled with + or −. What function was used to assign these labels? + Naoki Abe- Eric Baum 17

18 INTRODUCTIONCS446 Fall ’14 Training data + Naoki Abe - Myriam Abramson + David W. Aha + Kamal M. Ali - Eric Allender + Dana Angluin - Chidanand Apte + Minoru Asada + Lars Asker + Javed Aslam + Jose L. Balcazar - Cristina Baroglio + Peter Bartlett - Eric Baum + Welton Becket - Shai Ben-David + George Berg + Neil Berkman + Malini Bhandaru + Bir Bhanu + Reinhard Blasig - Avrim Blum - Anselm Blumer + Justin Boyan + Carla E. Brodley + Nader Bshouty - Wray Buntine - Andrey Burago + Tom Bylander + Bill Byrne - Claire Cardie + John Case + Jason Catlett - Philip Chan - Zhixiang Chen - Chris Darken 18

19 INTRODUCTIONCS446 Fall ’14 Raw test data Gerald F. DeJong Chris Drummond Yolanda Gil Attilio Giordana Jiarong Hong J. R. Quinlan Priscilla Rasmussen Dan Roth Yoram Singer Lyle H. Ungar 19

20 INTRODUCTIONCS446 Fall ’14 Labeled test data + Gerald F. DeJong - Chris Drummond + Yolanda Gil - Attilio Giordana + Jiarong Hong - J. R. Quinlan - Priscilla Rasmussen + Dan Roth + Yoram Singer - Lyle H. Ungar 20

21 INTRODUCTIONCS446 Fall ’14 Course Overview Introduction: Basic problems and questions A detailed example: Linear threshold units Two Basic Paradigms:  PAC (Risk Minimization); Bayesian Theory Learning Protocols  Online/Batch; Supervised/Unsupervised/Semi-supervised Algorithms:  Decision Trees (C4.5)  [Rules and ILP (Ripper, Foil)]  Linear Threshold Units (Winnow, Perceptron; Boosting; SVMs; Kernels)  Neural Networks  Probabilistic Representations (naïve Bayes, Bayesian trees; density estimation)  Unsupervised/Semi-supervised: EM Clustering, Dimensionality Reduction 21

22 INTRODUCTIONCS446 Fall ’14 Supervised Learning Given: Examples (x,f(x)) of some unknown function f Find: A good approximation of f x provides some representation of the input  The process of mapping a domain element into a representation is called Feature Extraction. (Hard; ill- understood; important)  x 2 {0,1} n or x 2 < n The target function (label)  f(x) 2 {-1,+1} Binary Classification  f(x) 2 {1,2,3,.,k-1} Multi-class classification  f(x) 2 < Regression 22

23 INTRODUCTIONCS446 Fall ’14 Supervised Learning : Examples Disease diagnosis  x: Properties of patient (symptoms, lab tests)  f : Disease (or maybe: recommended therapy) Part-of-Speech tagging  x: An English sentence (e.g., The can will rust)  f : The part of speech of a word in the sentence Face recognition  x: Bitmap picture of person’s face  f : Name the person (or maybe: a property of) Automatic Steering  x: Bitmap picture of road surface in front of car  f : Degrees to turn the steering wheel Many problems that do not seem like classification problems can be decomposed to classification problems. E.g, Semantic Role Labeling Semantic Role Labeling 23

24 INTRODUCTIONCS446 Fall ’14 Key Issues in Machine Learning Modeling  How to formulate application problems as machine learning problems ? How to represent the data?  Learning Protocols (where is the data & labels coming from?) Representation  What are good hypothesis spaces ?  Any rigorous way to find these? Any general approach? Algorithms  What are good algorithms?  How do we define success?  Generalization Vs. over fitting  The computational problem 24

25 INTRODUCTIONCS446 Fall ’14 Using supervised learning What is our instance space?  Gloss: What kind of features are we using? What is our label space?  Gloss: What kind of learning task are we dealing with? What is our hypothesis space?  Gloss: What kind of model are we learning? What learning algorithm do we use?  Gloss: How do we learn the model from the labeled data? (What is our loss function/evaluation metric?)  Gloss: How do we measure success? 25

26 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from a label space Y Output y ∈ Y An item y drawn from a label space Y Input x ∈ X An item x drawn from an instance space X Input x ∈ X An item x drawn from an instance space X Learned Model y = g(x) Learned Model y = g(x) 1. The instance space X Designing an appropriate instance space X is crucial for how well we can predict y. 26

27 INTRODUCTIONCS446 Fall ’14 1. The instance space X When we apply machine learning to a task, we first need to define the instance space X. Instances x ∈ X are defined by features:  Boolean features:  Does this email contain the word ‘money’?  Numerical features:  How often does ‘money’ occur in this email? What is the width/height of this bounding box? 27

28 INTRODUCTIONCS446 Fall ’14 What’s X for the Badges game? Possible features: Gender/age/country of the person? Length of their first or last name? Does the name contain letter ‘x’? How many vowels does their name contain? Is the n-th letter a vowel? 28

29 INTRODUCTIONCS446 Fall ’14 X as a vector space X is an N-dimensional vector space (e.g. < N )  Each dimension = one feature. Each x is a feature vector (hence the boldface x). Think of x = [x 1 … x N ] as a point in X : 29 x1x1 x2x2

30 INTRODUCTIONCS446 Fall ’14 From feature templates to vectors When designing features, we often think in terms of templates, not individual features: What is the 2nd letter? N a oki → [1 0 0 0 …] A b e→ [0 1 0 0 …] S c rooge → [0 0 1 0 …] What is the i-th letter? Abe → [1 0 0 0 0… 0 1 0 0 0 0… 0 0 0 0 1 …]  26*2 positions in each group;  # of groups == upper bounds on length of names 30

31 INTRODUCTIONCS446 Fall ’14 Good features are essential The choice of features is crucial for how well a task can be learned.  In many application areas (language, vision, etc.), a lot of work goes into designing suitable features.  This requires domain expertise. CS446 can’t teach you what specific features to use for your task.  But we will touch on some general principles 31

32 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from a label space Y Output y ∈ Y An item y drawn from a label space Y Input x ∈ X An item x drawn from an instance space X Input x ∈ X An item x drawn from an instance space X Learned Model y = g(x) Learned Model y = g(x) 2. The label space Y The label space Y determines what kind of supervised learning task we are dealing with 32

33 INTRODUCTIONCS446 Fall ’14 The focus of CS446. But… Supervised learning tasks I Output labels y ∈ Y are categorical:  Binary classification: Two possible labels  Multiclass classification: k possible labels  Output labels y ∈ Y are structured objects (sequences of labels, parse trees, etc.)  Structure learning (CS546 next semester) 33

34 INTRODUCTIONCS446 Fall ’14 Supervised learning tasks II Output labels y ∈ Y are numerical:  Regression (linear/polynomial):  Labels are continuous-valued  Learn a linear/polynomial function f(x)  Ranking:  Labels are ordinal  Learn an ordering f(x1) > f(x2) over input 34

35 INTRODUCTIONCS446 Fall ’14 Output y ∈ Y An item y drawn from a label space Y Output y ∈ Y An item y drawn from a label space Y Input x ∈ X An item x drawn from an instance space X Input x ∈ X An item x drawn from an instance space X Learned Model y = g(x) Learned Model y = g(x) 3. The model g(x) We need to choose what kind of model we want to learn 35

36 INTRODUCTIONCS446 Fall ’14 A Learning Problem y = f (x 1, x 2, x 3, x 4 ) Unknown function x1x1 x2x2 x3x3 x4x4 Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0 Can you learn this function? What is it? 36

37 INTRODUCTIONCS446 Fall ’14 Hypothesis Space Complete Ignorance: There are 2 16 = 65536 possible functions over four input features. We can’t figure out which one is correct until we’ve seen every possible input-output pair. After seven examples we still have 2 9 possibilities for f Is Learning Possible? Example x 1 x 2 x 3 x 4 y 1 1 1 1 ? 0 0 0 0 ? 1 0 0 0 ? 1 0 1 1 ? 1 1 0 0 0 1 1 0 1 ? 1 0 1 0 ? 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 ? 0 0 1 1 1 0 0 1 0 0 0 0 0 1 ? 1 1 1 0 ? 37  There are |Y| |X| possible functions f(x) from the instance space X to the label space Y. L  Learners typically consider only a subset of the functions from X to Y, called the hypothesis space H. H ⊆ |Y| |X|

38 INTRODUCTIONCS446 Fall ’14 Hypothesis Space (2) Simple Rules: There are only 16 simple conjunctive rules of the form y=x i Æ x j Æ x k No simple rule explains the data. The same is true for simple clauses. 1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0 y =c x 1 1100 0 x 2 0100 0 x 3 0110 0 x 4 0101 1 x 1  x 2 1100 0 x 1  x 3 0011 1 x 1  x 4 0011 1 Rule Counterexample x 2  x 3 0011 1 x 2  x 4 0011 1 x 3  x 4 1001 1 x 1  x 2  x 3 0011 1 x 1  x 2  x 4 0011 1 x 1  x 3  x 4 0011 1 x 2  x 3  x 4 0011 1 x 1  x 2  x 3  x 4 0011 1 Rule Counterexample 38

39 INTRODUCTIONCS446 Fall ’14 Hypothesis Space (3) m-of-n rules: There are 32 possible rules of the form ”y = 1 if and only if at least m of the following n variables are 1” Found a consistent hypothesis. 1 0 0 1 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 2 0 1 0 0 0  x 1  3 - - -  x 2  2 - - -  x 3  1 - - -  x 4  7 - - -  x 1, x 2  2 3 - -  x 1, x 3  1 3 - -  x 1, x 4  6 3 - -  x 2, x 3  2 3 - - variables 1 -of 2 -of 3 -of 4 -of  x 2, x 4  2 3 - -  x 3, x 4  4 4 - -  x 1, x 2, x 3  1 3 3 -  x 1, x 2, x 4  2 3 3 -  x 1, x 3, x 4  1    3 -  x 2, x 3, x 4  1 5 3 -  x 1, x 2, x 3, x 4  1 5 3 3 variables 1 -of 2 -of 3 -of 4 -of 39

40 INTRODUCTIONCS446 Fall ’14 Views of Learning Learning is the removal of our remaining uncertainty:  Suppose we knew that the unknown function was an m-of-n Boolean function, then we could use the training data to infer which function it is. Learning requires guessing a good, small hypothesis class :  We can start with a very small class and enlarge it until it contains an hypothesis that fits the data. We could be wrong !  Our prior knowledge might be wrong:  y=x4  one-of (x1, x3) is also consistent  Our guess of the hypothesis class could be wrong If this is the unknown function, then we will make errors when we are given new examples, and are asked to predict the value of the function 40

41 INTRODUCTIONCS446 Fall ’14 General strategies for Machine Learning Develop representation languages for expressing concepts  Serve to limit the expressivity of the target models  E.g., Functional representation (n-of-m); Grammars; stochastic models; Develop flexible hypothesis spaces:  Nested collections of hypotheses. Decision trees, neural networks  Get flexibility by augmenting the feature space In either case: Develop algorithms for finding a hypothesis in our hypothesis space, that fits the data And hope that they will generalize well 41

42 INTRODUCTIONCS446 Fall ’14 Terminology Target function (concept): The true function f :X  {…Labels…} Concept: Boolean function. Example for which f (x)= 1 are positive examples; those for which f (x)= 0 are negative examples (instances) Hypothesis: A proposed function h, believed to be similar to f. The output of our learning algorithm. Hypothesis space: The space of all hypotheses that can, in principle, be output by the learning algorithm. Classifier: A discrete valued function produced by the learning algorithm. The possible value of f: {1,2,…K} are the classes or class labels. (In most algorithms the classifier will actually return a real valued function that we’ll have to interpret). Training examples: A set of examples of the form {(x, f (x))} 42

43 INTRODUCTIONCS446 Fall ’14 Key Issues in Machine Learning Modeling  How to formulate application problems as machine learning problems ? How to represent the data?  Learning Protocols (where is the data & labels coming from?) Representation:  What are good hypothesis spaces ?  Any rigorous way to find these? Any general approach? Algorithms:  What are good algorithms? (The Bio Exam ple)  How do we define success?  Generalization Vs. over fitting  The computational problem 43

44 INTRODUCTIONCS446 Fall ’14 Example: Generalization vs Overfitting What is a Tree ? A botanist Her brother A tree is something with A tree is a green thing leaves I’ve seen before Neither will generalize well 44

45 INTRODUCTIONCS446 Fall ’14 CS446: Policies Cheating No. We take it very seriously. Homework:  Collaboration is encouraged  But, you have to write your own solution/program.  (Please don’t use old solutions) Late Policy: You have a credit of 4 days (4*24hours); That’s it. Grading:  Possibly separate for grads/undergrads.  5% Class work; 25% - homework; 30%-midterm; 40%-final;  Projects: 25% (4 hours) Questions? 45 Info page Note also the Schedule Page and our Notes on the main page

46 INTRODUCTIONCS446 Fall ’14 CS446 Team Dan Roth (3323 Siebel)  Tuesday, 2:00 PM – 3:00 PM TAs (4407 Siebel)  Kai-Wei Chang: Tuesday 5:00pm-6:00pm (3333 SC)  Chen-Tse Tsai: Tuesday 8:00pm-9:00pm (3333 SC)  Bryan Lunt: Tue 8:00pm-9:00pm (1121 SC)  Rongda Zhu: Thursday 4:00pm-5:00pm (2205 SC) Discussion Sections: (starting next week; no Monday)  Mondays: 5:00pm-6:00pm 3405-SC Bryan Lunt [A-F]  Tuesday: 5:00pm-6:00pm 3401-SC Chen-Tse Tsai [G-L]  Wednesdays: 5:30pm-6:30pm 3405-SC Kai-Wei Chang [M-S]  Thursdays: 5:00pm-6:00pm 3403-SC Rongda Zhu [T-Z] 46

47 INTRODUCTIONCS446 Fall ’14 CS446 on the web Check our class website:  Schedule, slides, videos, policies  http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-14/index.html http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-14/index.html  Sign up, participate in our Piazza forum:  Announcements and discussions  https://piazza.com/class#fall2014/cs446 https://piazza.com/class#fall2014/cs446  Log on to Compass:  Submit assignments, get your grades  https://compass2g.illinois.edu https://compass2g.illinois.edu 47 Registration to Class Homework: No need to submit Hw0; Later: submit electronically.

48 INTRODUCTIONCS446 Fall ’14 An Example I don’t know {whether, weather} to laugh or cry How can we make this a learning problem? We will look for a function F: Sentences  {whether, weather} We need to define the domain of this function better. An option: For each word w in English define a Boolean feature x w : [x w =1] iff w is in the sentence This maps a sentence to a point in {0,1} 50,000 In this space: some points are whether points some are weather points Modeling Learning Protocol? Supervised? Unsupervised? This is the Modeling Step 48 What is the hypothesis space?

49 INTRODUCTIONCS446 Fall ’14 Representation Step: What’s Good? Learning problem: Find a function that best separates the data What function? What’s best? (How to find it?) A possibility: Define the learning problem to be: Find a (linear) function that best separates the data Representation Linear = linear in the feature space x= data representation; w = the classifier y = sgn {w T x} 49 Memorizing vs. Learning Accuracy vs. Simplicity How well will you do? On what? Impact on Generalization

50 INTRODUCTIONCS446 Fall ’14 Expressivity f(x) = sgn {x ¢ w -  } = sgn{  i=1 n w i x i -  } Many functions are Linear  Conjunctions:  y = x 1 Æ x 3 Æ x 5  y = sgn{1 ¢ x 1 + 1 ¢ x 3 + 1 ¢ x 5 - 3}; w = (1, 0, 1, 0, 1)   At least m of n:  y = at least 2 of {x 1,x 3, x 5 }  y = sgn{1 ¢ x 1 + 1 ¢ x 3 + 1 ¢ x 5 - 2} }; w = (1, 0, 1, 0, 1)  Many functions are not  Xor: y = x 1 Æ x 2 Ç : x 1 Æ : x 2  Non trivial DNF: y = x 1 Æ x 2 Ç x 3 Æ x 4 But can be made linear Representation Probabilistic Classifiers as well 50

51 INTRODUCTIONCS446 Fall ’14 Exclusive-OR (XOR) (x 1 Æ x 2) Ç ( : {x 1 } Æ : {x 2 }) In general: a parity function. x i 2 {0,1} f(x 1, x 2,…, x n ) = 1 iff  x i is even This function is not linearly separable. Representation x1x1 x 2 51

52 INTRODUCTIONCS446 Fall ’14 Functions Can be Made Linear Data are not linearly separable in one dimension Not separable if you insist on using a specific class of functions Representation x 52

53 INTRODUCTIONCS446 Fall ’14 Blown Up Feature Space Data are separable in space x x2x2 Key issue: Representation: what features to use. Computationally, can be done implicitly (kernels) Not always ideal. 53

54 INTRODUCTIONCS446 Fall ’14 Functions Can be Made Linear Discrete Case Weather Whether y 3 Ç y 4 Ç y 7 New discriminator is functionally simpler A real Weather/Whether example 54 x 1 x 2 x 4 Ç x 2 x 4 x 5 Ç x 1 x 3 x 7 Space: X= x 1, x 2,…, x n Input Transformation New Space: Y = {y 1,y 2,…} = {x i,x i x j, x i x j x j,… }

55 INTRODUCTIONCS446 Fall ’14 Third Step: How to Learn? A possibility: Local search  Start with a linear threshold function.  See how well you are doing.  Correct  Repeat until you converge. There are other ways that do not search directly in the hypotheses space  Directly compute the hypothesis Algorithms 55

56 INTRODUCTIONCS446 Fall ’14 A General Framework for Learning Goal: predict an unobserved output value y 2 Y based on an observed input vector x 2 X Estimate a functional relationship y~f(x) from a set {(x,y) i } i=1,n Most relevant - Classification: y  {0,1} (or y  {1,2,…k} )  (But, within the same framework can also talk about Regression, y 2 < ) What do we want f(x) to satisfy?  We want to minimize the Risk: L(f()) = E X,Y ( [f(x)  y] )  Where: E X,Y denotes the expectation with respect to the true distribution. Algorithms Simple loss function: # of mistakes […] is a indicator function 56

57 INTRODUCTIONCS446 Fall ’14 A General Framework for Learning (II) We want to minimize the Loss: L(f()) = E X,Y ( [f(X)  Y] ) Where: E X,Y denotes the expectation with respect to the true distribution. We cannot minimize this loss Instead, we try to minimize the empirical classification error. For a set of training examples {(X i,Y i )} i=1,n Try to minimize: L’(f()) = 1/n  i [f(X i )  Y i ]  (Issue I : why/when is this good enough? Not now) This minimization problem is typically NP hard. To alleviate this computational problem, minimize a new function – a convex upper bound of the classification error function I (f(x),y) =[f(x)  y] = {1 when f(x)  y; 0 otherwise} Algorithms Side note: If the distribution over X £ Y is known, predict: y = argmax y P(y|x) This is the best possible (the optimal Bayes' error). 57

58 INTRODUCTIONCS446 Fall ’14 Algorithmic View of Learning: an Optimization Problem A Loss Function L(f(x),y) measures the penalty incurred by a classifier f on example (x,y). There are many different loss functions one could define:  Misclassification Error: L(f(x),y) = 0 if f(x) = y; 1 otherwise  Squared Loss: L(f(x),y) = (f(x) –y) 2  Input dependent loss: L(f(x),y) = 0 if f(x)= y; c(x)otherwise. Algorithms A continuous convex loss function allows a simpler optimization algorithm. f(x) –y L 58

59 INTRODUCTIONCS446 Fall ’14 Loss Here f(x) is the prediction 2 < y 2 {-1,1} is the correct value 0-1 Loss L(y,f(x))= ½ (1-sgn(yf(x))) Log Loss 1/ln2 log (1+exp{-yf(x)}) Hinge Loss L(y, f(x)) = max(0, 1 - y f(x)) Square Loss L(y, f(x)) = (y - f(x)) 2 0-1 Loss x axis = yf(x) Log Loss = x axis = yf(x) Hinge Loss: x axis = yf(x) Square Loss: x axis = (y - f(x)+1) 59

60 INTRODUCTIONCS446 Fall ’14 Example Putting it all together: A Learning Algorithm

61 INTRODUCTIONCS446 Fall ’14 Third Step: How to Learn? A possibility: Local search  Start with a linear threshold function.  See how well you are doing.  Correct  Repeat until you converge. There are other ways that do not search directly in the hypotheses space  Directly compute the hypothesis Algorithms 61

62 INTRODUCTIONCS446 Fall ’14 Learning Linear Separators (LTU) f(x) = sgn {x T ¢ w -  } = sgn{  i=1 n w i x i -  } x T = (x 1,x 2,…,x n ) 2 {0,1} n is the feature based encoding of the data point w T = (w 1,w 2,…,w n ) 2 < n is the target function.  determines the shift with respect to the origin Algorithms w  62

63 INTRODUCTIONCS446 Fall ’14 Canonical Representation f(x) = sgn {w T ¢ x -  } = sgn{  i=1 n w i x i -  } sgn {w T ¢ x -  } ´ sgn {(w’) T ¢ x’} Where:  x’ = (x, -  ) and w’ = (w,  ) Moved from an n dimensional representation to an (n+1) dimensional representation, but now can look for hyperplanes that go through the origin. Algorithms 63

64 INTRODUCTIONCS446 Fall ’14 General Learning Principle Our goal is to minimize the expected risk J(w) = E X,Y Q(x, y, w) We cannot do it. Instead, we approximate J(w) using a finite training set of independent samples (x i, y i ) J(w) ~=~ 1/m   m Q(x i,y i, w) Via a batch gradient descent algorithm That is, we successively compute estimates w t of the optimal parameter vector w: w t+1 = w t - r ~J~(w) = w t - 1/m   m r Q(x i,y i, w) Algorithms w  64 The loss: a function of x T w and y The Risk: a function of w

65 INTRODUCTIONCS446 Fall ’14 Our Hypothesis Space is the collection of Linear Threshold Units Loss function:  Squared loss LMS (Least Mean Square, L 2 )  Q(x, y, w) = ½ (w T x – y) 2 LMS: An Optimization Algorithm Algorithms w  65

66 INTRODUCTIONCS446 Fall ’14 Gradient Descent We use gradient descent to determine the weight vector that minimizes J(w) = Err (w) ; Fixing the set D of examples, J=Err is a function of w j At each step, the weight vector is modified in the direction that produces the steepest descent along the error surface. Algorithms E(w) w w 4 w 3 w 2 w 1 66

67 INTRODUCTIONCS446 Fall ’14 LMS: An Optimization Algorithm (i (subscript) – vector component; j (superscript) - time; d – example #) Let w (j) be the current weight vector we have Our prediction on the d-th example x is: Let t d be the target value for this example (real value; represents u ¢ x) The error the current hypothesis makes on the data set is: Algorithms Assumption: x 2 R n ; u 2 R n is the target weight vector; the target (label) is t d = u ¢ x Noise has been added; so, possibly, no weight vector is consistent with the data. 67

68 INTRODUCTIONCS446 Fall ’14 Gradient Descent To find the best direction in the weight space we compute the gradient of E with respect to each of the components of This vector specifies the direction that produces the steepest increase in E; We want to modify in the direction of Where (with a fixed step size R): Algorithms 68

69 INTRODUCTIONCS446 Fall ’14 Gradient Descent: LMS We have: Therefore: Algorithms 69

70 INTRODUCTIONCS446 Fall ’14 Gradient Descent: LMS Weight update rule: Gradient descent algorithm for training linear units:  Start with an initial random weight vector  For every example d with target value t d do:  Evaluate the linear unit  Update by adding to each component  Continue until E below some threshold This algorithm always converges to a local minimum of J(w), for small enough steps. Here (LMS for linear regression), the surface contains only a single global minimum, so the algorithm converges to a weight vector with minimum error, regardless of whether the examples are linearly separable. The surface may have local minimum if the loss function is different. Algorithms 70

71 INTRODUCTIONCS446 Fall ’14 Weight update rule: Gradient descent algorithm for training linear units:  Start with an initial random weight vector  For every example d with target value t d do:  Evaluate the linear unit  update by incrementally by adding to each component (update without summing over all data)  Continue until E below some threshold Incremental (Stochastic) Gradient Descent: LMS Algorithms 71 Dropped the averaging operation. Instead of averaging the gradient of the loss over the complete training set, choose a sample (x,y) at random and update w t

72 INTRODUCTIONCS446 Fall ’14 Incremental (Stochastic) Gradient Descent: LMS Algorithms 72 Weight update rule: Gradient descent algorithm for training linear units:  Start with an initial random weight vector  For every example d with target value:  Evaluate the linear unit  update by incrementally adding to each component (update without summing over all data)  Continue until E below some threshold In general - does not converge to global minimum Decreasing R with time guarantees convergence But, on-line algorithms are sometimes advantageous… Dropped the averaging operation. Sometimes called “on-line” since we don’t need a reference to a training set: observe – predict – get feedback.

73 INTRODUCTIONCS446 Fall ’14 In the general (non-separable) case the learning rate R must decrease to zero to guarantee convergence. The learning rate is called the step size. There are more sophisticated algorithms that choose the step size automatically and converge faster. Choosing a better starting point also has impact. The gradient descent and its stochastic version are very simple algorithms, but almost all the algorithms we will learn in the class can be traced back to gradient decent algorithms for different loss functions and different hypotheses spaces. Algorithms Learning Rates and Convergence 73

74 INTRODUCTIONCS446 Fall ’14 Computational Issues Assume the data is linearly separable. Sample complexity:  Suppose we want to ensure that our LTU has an error rate (on new examples) of less than  with high probability (at least (1-  ))  How large does m (the number of examples) must be in order to achieve this ? It can be shown that for n dimensional problems m = O(1/  [ln(1/  ) + (n+1) ln(1/  ) ]. Computational complexity: What can be said?  It can be shown that there exists a polynomial time algorithm for finding consistent LTU (by reduction from linear programming).  [Contrast with the NP hardness for 0-1 loss optimization]  (On-line algorithms have inverse quadratic dependence on the margin) Algorithms 74

75 INTRODUCTIONCS446 Fall ’14 Other Methods for LTUs Fisher Linear Discriminant:  A direct computation method Probabilistic methods (naïve Bayes):  Produces a stochastic classifier that can be viewed as a linear threshold unit. Winnow/Perceptron  A multiplicative/additive update algorithm with some sparsity properties in the function space (a large number of irrelevant attributes) or features space (sparse examples) Logistic Regression, SVM…many other algorithms Algorithms 75


Download ppt "INTRODUCTIONCS446 Fall ’14 CS 446: Machine Learning Dan Roth University of Illinois, Urbana-Champaign 3322."

Similar presentations


Ads by Google