Download presentation

Presentation is loading. Please wait.

Published byLillian Bennett Modified over 4 years ago

1
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

2
Linear Classification - margin

3
Linear Classification n vectors in d dimensions: A 1,…,A n 2 R d Can assume norms of the A i are bounded by 1 Labels y 1,…,y n 2 {-1,1} Find vector x such that: 8 i 2 [n], sign(A i x) = y i

4
The Perceptron Algorithm

5
[Rosenblatt 1957, Novikoff 1962, Minsky&Papert 1969] Iteratively: 1. Find vector A i for which sign(A i x) y i 2. Add A i to x: Note: can assume all y i = +1 by multiplying A i by y i

6
Thm [Novikoff 1962]: converges in 1/ 2 iterations Proof: Let x * be the optimal hyperplane, for which A i x * >=

7
For n vectors in d dimensions: 1/ 2 iterations Each – n £ d time, total time: New algorithm: Sublinear time with high probability, leading order term improvement (in running times, poly-log factors are omitted)

8
Why is it surprising ? - margin

9
More results - O(n/ε 2 + d/ε) time alg for minimum enclosing ball (MEB) assuming norms of input points known - Sublinear time kernel versions, e.g., polynomial kernel of deg q - Poly-log space / low pass / sublinear time algorithms for these problems All running times are tight up to polylog factors (we give information theoretic lower bounds)

10
Talk outline - primal-dual optimization in learning - l 2 sampling - MEB - Kernels

11
A Primal-dual Perceptron ´ = £ ~(ε) Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

12
The Primal-dual Perceptron distribution over examples

13
Optimization via learning game offline optimization problem Player 1: Low regret alg Player 2: Low regret alg Converges to the min-max solution Classification problem: max x in B min i in [n] A i x = max x in B min p in Δ p A i x = min p in Δ max x in B p A i x Reduction Low regret algorithm = after many game iterations, average payoff -> best attainable payoff in hindsight of a fixed strategy

14
Thm: iterations to converge to -approximate solution is bounded by T for which: Total time = # iterations £ time-per-iteration Advantages: - Generic optimization - Easy to apply randomization Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T

15
A Primal-dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

16
A Primal-dual Perceptron Total time ? Speed up via randomization: 1. Sufficient to look at one example 2. Sufficient to obtain crude estimates of inner products.

17
l 2 sampling Consider two vectors from the d-dim sphere u, v - Sample coordinate i w.p. v i 2 - Return Notice that - Expectation is correct - Variance at most one (magnitude can be d) - Time: O(d)

18
The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: Important: preprocess x t only once for all estimates Running time: O((n + d)/ε 2 ) p t+1 (i) = p t (i) ¢ e - ´ A i x t p t+1 (i+1) Ã p t (i) ¢ (1- ´ l 2 -sample(A i x t ) + ´ 2 l 2 -sample(A i x t ) 2 ) x t+1 = x t + ´ A i t

19
Analysis Some difficulties: - Non-trivial regret analysis due to sampling - Need new multiplicative update analysis for bounded variance - Analysis shows good solution with constant probability - Need a way to verify a solution to get high probability

20
Streaming Implementation - See rows one at a time - Cant afford to store x t or p t - Want few passes, poly(log n/ε) space, and sublinear time - Want to output succinct representation of hyperplane - list of 1/ε 2 row indices - In t-th iteration, when l 2 -sampling from x t, use the same index j t for all n rows - Store samples i 1, …, i T of rows chosen by dual player, and j 1, …, j T of l 2 -sampling indices of primal player - Sample in a stream using known algorithms

21
Lower Bound - Consider n x d matrix - First 1/ε 2 rows contain a random position equal to ε and all other values are 0 - Each of the remaining n-1/ε 2 rows is a copy of a random row among the first 1/ε 2 - With probability ½, - choose a random row, and replace the value ε by –ε With probability ½, - do nothing - Deciding which case youre in requires reading (n+d)/ ε 2 entries

22
MEB (minimum enclosing ball)

23
A Primal-dual algorithm Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i=1 n p t (i) (A i – x t ) p t+1 (i) = p t (i) ¢ e ´ ||A i -x t || 2

24
l 2 -sampling speed up Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(||A i -x t || 2 ) + ´ 2 l 2 -sample(||A i -x t| | 2 ) 2

25
Regret speed up Updates: # iterations remains But only in -fraction we have to do O(d) work, though in all iterations we do O(n) work O(n/ε 2 + d/ε) total time with probability ε: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(|A i -x t | 2 ) + ´ 2 l 2 -sample(|A i -x t | 2 ) 2

26
Kernels

27
Map input to higher dimensional space via non-linear mapping. i.e. polynomial: Classification via linear classifier in new space. Efficient classification and optimization if inner products can be computed efficiently (the kernel function)

28
The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ A i t

29
The Primal-Dual Kernel Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ © (A i t )

30
l 2 sampling for kernels Polynomial kernel: Kernel l 2 -sample = q independent l 2 -samples of x T y Running time increases by q Can also use Taylor expansion, to do, say, Gaussian kernels

Similar presentations

OK

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google