Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague.

Similar presentations


Presentation on theme: "1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague."— Presentation transcript:

1 1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague

2 Thanks Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz 2

3 3 Tutorial Context Online Learning Tutorial Optimization Theory Real-World Data SVMs

4 4 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

5 5 Online Learning Tyrannosaurus rex

6 6 Online Learning Triceratops

7 7 Online Learning Tyrannosaurus rex Velocireptor

8 8 Formal Setting – Binary Classification Instances –Images, Sentences Labels –Parse tree, Names Prediction rule –Linear predictions rules Loss –No. of mistakes

9 9 Predictions Discrete Predictions: –Hard to optimize Continuous predictions : –Label –Confidence

10 10 Loss Functions Natural Loss: –Zero-One loss: Real-valued-predictions loss: –Hinge loss: –Exponential loss (Boosting) –Log loss (Max Entropy, Boosting)

11 11 Loss Functions 1 1 Zero-One Loss Hinge Loss

12 12 Online Framework Initialize Classifier Algorithm works in rounds On round the online algorithm : – Receives an input instance – Outputs a prediction – Receives a feedback label – Computes loss – Updates the prediction rule Goal : –Suffer small cumulative loss

13 Online Learning Maintain Model M Get Instance x Predict Label y=M(x) Get True Label y Suffer Loss l(y,y) Update Model M

14 14 Any Features W.l.o.g. Binary Classifiers of the form Linear Classifiers Notation Abuse

15 15 Prediction : Confidence in prediction: Linear Classifiers (cntd.)

16 16 Linear Classifiers Input Instance to be classified Weight vector of classifier

17 17 Margin of an example with respect to the classifier : Note : The set is separable iff there exists such that Margin

18 18 Geometrical Interpretation

19 19 Geometrical Interpretation

20 20 Geometrical Interpretation

21 21 Geometrical Interpretation Margin >0 Margin <<0 Margin <0 Margin >>0

22 22 Hinge Loss

23 23 Why Online Learning? Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive Not as good as a well designed batch algorithms

24 24 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

25 25 The Perceptron Algorithm If No-Mistake –Do nothing If Mistake –Update Margin after update : Rosenblat 1958

26 26 Geometrical Interpretation

27 27 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

28 Gradient Descent Consider the batch problem Simple algorithm: –Initialize –Iterate, for –Compute –Set 28

29

30 Stochastic Gradient Descent Consider the batch problem Simple algorithm: –Initialize –Iterate, for –Pick a random index –Compute –Set 30

31 31

32 Stochastic Gradient Descent Hinge loss The gradient Simple algorithm: –Initialize –Iterate, for –Pick a random index –If then else –Set 32 The preceptron is a stochastic gradient descent algorithm with a sum of hinge-loss and a specific order of examples

33 33 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

34 34 Motivation Perceptron: No guaranties of margin after the update PA :Enforce a minimal non-zero margin after the update In particular : –If the margin is large enough (1), then do nothing –If the margin is less then unit, update such that the margin after the update is enforced to be unit

35 35 Input Space

36 36 Input Space vs. Version Space Input Space : –Points are input data –One constraint is induced by weight vector –Primal space –Half space = all input examples that are classified correctly by a given predictor (weight vector) Version Space : –Points are weight vectors –One constraints is induced by input data –Dual space –Half space = all predictors (weight vectors) that classify correctly a given input example

37 37 Weight Vector (Version) Space The algorithm forces to reside in this region

38 38 Passive Step Nothing to do. already resides on the desired side.

39 39 Aggressive Step The algorithm projects on the desired half-space

40 40 Aggressive Update Step Set to be the solution of the following optimization problem : Solution:

41 41 Perceptron vs. PA Common Update : Perceptron Passive-Aggressive

42 42 Perceptron vs. PA Margin Error N o - E r r o r, S m a l l M a r g i n No-Error, Large Margin

43 43 Perceptron vs. PA

44 44 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

45 45 Geometrical Assumption All examples are bounded in a ball of radius R

46 46 Separablity There exists a unit vector that classifies the data correctly

47 Simple case: positive points negative points Separating hyperplane Bound is : Perceptrons Mistake Bound The number of mistakes the algorithm makes is bounded by 47

48 48 Geometrical Motivation

49 SGD on such data 49

50 50 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

51 Second Order Perceptron Assume all inputs are given Compute whitening matrix Run the Perceptron on wightened data New whitening matrix 51 Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile, 2005

52 Second Order Perceptron Bound: Same simple case: Thus Bound is : 52 Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile, 2005

53 53 Second Order Perceptron If No-Mistake –Do nothing If Mistake –Update Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile, 2005

54 SGD on weightened data 54

55 55 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

56 56 The weight vector is a linear combination of examples Two rate schedules ( many many others ): –Perceptron algorithm, Conservative –Passive - Aggressive Span-based Update Rules Feature-value of input instance Target label Either -1 or 1 Learning rate Weight of feature f

57 57 Sentiment Classification Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered …. Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show …. … Very highly recommended! Pang, Lee, Vaithyanathan, EMNLP 2002

58 58 Sentiment Classification Many positive reviews with the word best W best Later negative review –boring book – best if you want to sleep in seconds Linear update will reduce both W best W boring But best appeared more than boring The model knows more about best than boring Better to reduce words in different rate W boring W best

59 59 Natural Language Processing Big datasets, large number of features Many features are only weakly correlated with target label Linear classifiers: features are associated with word-counts Heavy-tailed feature distribution Feature Rank Counts

60 Natural Language Processing

61 61 New Prediction Models Gaussian distributions over weight vectors The covariance is either full or diagonal In NLP we have many features and use a diagonal covariance

62 62 Classification Given a new example Stochastic: –D–Draw a weight vector –M–Make a prediction Collective: –A–Average weight vector –A–Average margin –A–Average prediction

63 63 The Margin is Random Variable The signed margin is random 1-d Gaussian Thus:

64 64 Linear Model Distribution over Linear Models Example Mean weight-vector

65 65 The algorithm forces that most of the values of would reside in this region Weight Vector (Version) Space

66 66 Nothing to do, most of the weight vectors already classifies the example correctly Passive Step

67 67 The mean is moved beyond the mistake-line (Large Margin) Aggressive Step The covariance is shrunk in the direction of the input example The algorithm projects the current Gaussian distribution on the half-space

68 68 Projection Update Vectors (aka PA): Distributions (New Update) : Confidence Parameter

69 69 Sum of two divergences of parameters : Convex in both arguments simultaneously Divergence Matrix Itakura-Saito Divergence Mahanabolis Distance

70 70 Constraint Probabilistic Constraint : Equivalent Margin Constraint : Convex in, concave in Solutions: –Linear approximation –Change variables to get a convex formulation –Relax (AROW) Dredze, Crammer, Pereira. ICML 2008 Crammer, Dredze, Pereira. NIPS 2008 Crammer, Dredze, Kulesza. NIPS 2009

71 71 Convexity Change variables Equivalent convex formulation Crammer, Dredze, Pereira. NIPS 2008

72 72 AROW PA: CW : Similar update form as CW Crammer, Dredze, Kulesza. NIPS 2009

73 73 Optimization update can be solved analytically Coefficients depend on specific algorithm The Update

74 Definitions 74

75 Updates AROWCW (Change Variables) CW (Linearization) 75

76 76 Per-feature Learning Rate Per-feature Learning rate Reducing the Learning rate and eigenvalues of covariance matrix

77 77 Diagonal Matrix Given a matrix we define to be only the diagonal part of the matrix, Make matrix diagonal Make inverse diagonal

78 78 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

79 (Back to) Stochastic Gradient Descent Consider the batch problem Simple algorithm: –Initialize –Iterate, for –Pick a random index –Compute –Set 79

80 Adaptive Stochastic Gradient Descent Consider the batch problem Simple algorithm: –Initialize –Iterate, for –Pick a random index –Compute –Set 80 Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

81 Adaptive Stochastic Gradient Descent Very general! Can be used to solve with various regularizations The matrix A can be either full or diagonal Comes with convergence and regret bounds Similar performance to AROW Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

82 Adaptive Stochastic Gradient Descent SGDAdaGrad Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

83 Special Case of a General Framework Any loss function Assume: Convex in first argument, non- negative Algorithm: online convex programming with shifting link function Orabona and Crammer, NIPS 2010

84 Special Case of a General Framework Orabona and Crammer, NIPS 2010

85 Our Algorithms as a Special Case Loss: Regularization functions

86 86 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

87 87 Kernels

88 Proof Show that we can write Induction 88

89 Proof (cntd) By update rule : Thus 89

90 Proof (cntd) By update rule : 90

91 Proof (cntd) Thus 91

92 92 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

93 93 Properties Eigenvalues of covariance matrix monotonically decrease Mean of signed-margin increases; variance decreases

94 94 Statistical Interpretation Margin Constraint : Distribution over weight-vectors : Assume input is corrupted with Gaussian noise

95 95 Statistical Interpretation Example Mean weight-vector Version Space Input Space Input Instance Linear Separator Good realization Bad realization

96 96 Mistake Bound For any reference weight vector, the number of mistakes made by AROW is upper bounded by where – set of example indices with a mistake – set of example indices with an update but not a mistake – Orabona and Crammer, NIPS 2010

97 97 Comment I Separable case and no updates: where

98 98 Comment II For large the bound becomes: When no updates are performed: Perceptron

99 Bound for Diagonal Algorithm No. of mistakes is bounded by Is low when either a feature is rare or non-informative Exactly as in NLP … Orabona and Crammer, NIPS 2010

100 100 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

101 101 Synthetic Data 20 features 2 informative (rotated skewed Gaussian) 18 noisy Using a single feature is as good as a random prediction

102 102 Synthetic Data (cntd.) Distribution after 50 examples (x1)

103 103 Synthetic Data (no noise) Perceptron PA SOP CW-full CW-diag

104 104 Synthetic Data (10% noise)

105 105 Outline Background: –Online learning + notation –Perceptron –Stochastic-gradient descent –Passive-aggressive Second-Order Algorithms –Second order Perceptron –Confidence-Weighted and AROW –AdaGrad Properties –Kernels –Analysis Empirical Evaluation –Synthetic –Real Data

106 106 Data Sentiment –Sentiment reviews from 6 Amazon domains ( Blitzer et al ) –Classify a product review as either positive or negative Reuters, pairs of labels –Three divisions: Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail. –Bag of words representation with binary features. 20 News Groups, pairs of labels –Three divisions: comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances. –Bag of words representation with binary features.

107 107 Experimental Design Online to batch : –Multiple passes over the training data –Evaluate on a different test set after each pass –Compute error/accuracy Set parameter using held-out data 10 Fold Cross-Validation ~2000 instances per problem Balanced class-labels

108 108 Results vs Online- Sentiment StdDev and Variance – always better than baseline Variance – 5/6 significantly better

109 109 Results vs Online – 20NG + Reuters StdDev and Variance – always better than baseline Variance – 4/6 significantly better

110 110 Results vs Batch - Sentiment always better than batch methods 3/6 significantly better

111 111 Results vs Batch - 20NG + Reuters 5/6 better than batch methods 3/5 significantly better, 1/1 significantly worse

112 112

113 113

114 114

115 115 Results - Sentiment CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes Passes of Training Data Accuracy O PA O CW O PA O CW O PA O CW O PA O CW O PA O CW O PA O CW

116 116 Results – Reuters + 20NG CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes Passes of Training Data Accuracy O PA O CW O PA O CW O PA O CW O PA O CW O PA O CW O PA O CW

117 117 Error Reduction by Multiple Passes PA benefits more from multiple passes (8/12) Amount of benefit is data dependent

118 Bayesian Logistic Regression BLR Covariance Mean CW/AROW Covariance Mean 118 T. Jaakkola and M. Jordan Based on the Variational Approximation Conceptually decoupled update Function of the margin/hinge-loss

119 Algorithms Summary 2 nd Order1 st Order SOPPerceptron CW+AROWPA AdaGradSGD LR Logisitic Regression Different motivation, similar algorithms All algorithms can be kernelized Work well for data NOT isotropic / symmetric State-of-the-art results in various domains Accompanied with theory 119


Download ppt "1 Second Order Learning Koby Crammer Department of Electrical Engineering ECML PKDD 2013 Prague."

Similar presentations


Ads by Google