Download presentation

Presentation is loading. Please wait.

Published byLillian Bennett Modified over 2 years ago

1
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

2
Linear Classification - margin

3
Linear Classification n vectors in d dimensions: A 1,…,A n 2 R d Can assume norms of the A i are bounded by 1 Labels y 1,…,y n 2 {-1,1} Find vector x such that: 8 i 2 [n], sign(A i x) = y i

4
The Perceptron Algorithm

5
[Rosenblatt 1957, Novikoff 1962, Minsky&Papert 1969] Iteratively: 1. Find vector A i for which sign(A i x) y i 2. Add A i to x: Note: can assume all y i = +1 by multiplying A i by y i

6
Thm [Novikoff 1962]: converges in 1/ 2 iterations Proof: Let x * be the optimal hyperplane, for which A i x * >=

7
For n vectors in d dimensions: 1/ 2 iterations Each – n £ d time, total time: New algorithm: Sublinear time with high probability, leading order term improvement (in running times, poly-log factors are omitted)

8
Why is it surprising ? - margin

9
More results - O(n/ε 2 + d/ε) time alg for minimum enclosing ball (MEB) assuming norms of input points known - Sublinear time kernel versions, e.g., polynomial kernel of deg q - Poly-log space / low pass / sublinear time algorithms for these problems All running times are tight up to polylog factors (we give information theoretic lower bounds)

10
Talk outline - primal-dual optimization in learning - l 2 sampling - MEB - Kernels

11
A Primal-dual Perceptron ´ = £ ~(ε) Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

12
The Primal-dual Perceptron distribution over examples

13
Optimization via learning game offline optimization problem Player 1: Low regret alg Player 2: Low regret alg Converges to the min-max solution Classification problem: max x in B min i in [n] A i x = max x in B min p in Δ p A i x = min p in Δ max x in B p A i x Reduction Low regret algorithm = after many game iterations, average payoff -> best attainable payoff in hindsight of a fixed strategy

14
Thm: iterations to converge to -approximate solution is bounded by T for which: Total time = # iterations £ time-per-iteration Advantages: - Generic optimization - Easy to apply randomization Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T

15
A Primal-dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

16
A Primal-dual Perceptron Total time ? Speed up via randomization: 1. Sufficient to look at one example 2. Sufficient to obtain crude estimates of inner products.

17
l 2 sampling Consider two vectors from the d-dim sphere u, v - Sample coordinate i w.p. v i 2 - Return Notice that - Expectation is correct - Variance at most one (magnitude can be d) - Time: O(d)

18
The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: Important: preprocess x t only once for all estimates Running time: O((n + d)/ε 2 ) p t+1 (i) = p t (i) ¢ e - ´ A i x t p t+1 (i+1) Ã p t (i) ¢ (1- ´ l 2 -sample(A i x t ) + ´ 2 l 2 -sample(A i x t ) 2 ) x t+1 = x t + ´ A i t

19
Analysis Some difficulties: - Non-trivial regret analysis due to sampling - Need new multiplicative update analysis for bounded variance - Analysis shows good solution with constant probability - Need a way to verify a solution to get high probability

20
Streaming Implementation - See rows one at a time - Cant afford to store x t or p t - Want few passes, poly(log n/ε) space, and sublinear time - Want to output succinct representation of hyperplane - list of 1/ε 2 row indices - In t-th iteration, when l 2 -sampling from x t, use the same index j t for all n rows - Store samples i 1, …, i T of rows chosen by dual player, and j 1, …, j T of l 2 -sampling indices of primal player - Sample in a stream using known algorithms

21
Lower Bound - Consider n x d matrix - First 1/ε 2 rows contain a random position equal to ε and all other values are 0 - Each of the remaining n-1/ε 2 rows is a copy of a random row among the first 1/ε 2 - With probability ½, - choose a random row, and replace the value ε by –ε With probability ½, - do nothing - Deciding which case youre in requires reading (n+d)/ ε 2 entries

22
MEB (minimum enclosing ball)

23
A Primal-dual algorithm Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i=1 n p t (i) (A i – x t ) p t+1 (i) = p t (i) ¢ e ´ ||A i -x t || 2

24
l 2 -sampling speed up Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(||A i -x t || 2 ) + ´ 2 l 2 -sample(||A i -x t| | 2 ) 2

25
Regret speed up Updates: # iterations remains But only in -fraction we have to do O(d) work, though in all iterations we do O(n) work O(n/ε 2 + d/ε) total time with probability ε: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(|A i -x t | 2 ) + ´ 2 l 2 -sample(|A i -x t | 2 ) 2

26
Kernels

27
Map input to higher dimensional space via non-linear mapping. i.e. polynomial: Classification via linear classifier in new space. Efficient classification and optimization if inner products can be computed efficiently (the kernel function)

28
The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ A i t

29
The Primal-Dual Kernel Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ © (A i t )

30
l 2 sampling for kernels Polynomial kernel: Kernel l 2 -sample = q independent l 2 -samples of x T y Running time increases by q Can also use Taylor expansion, to do, say, Gaussian kernels

31

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google