# Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

## Presentation on theme: "Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden."— Presentation transcript:

Linear Classification - margin

Linear Classification n vectors in d dimensions: A 1,…,A n 2 R d Can assume norms of the A i are bounded by 1 Labels y 1,…,y n 2 {-1,1} Find vector x such that: 8 i 2 [n], sign(A i x) = y i

The Perceptron Algorithm

[Rosenblatt 1957, Novikoff 1962, Minsky&Papert 1969] Iteratively: 1. Find vector A i for which sign(A i x) y i 2. Add A i to x: Note: can assume all y i = +1 by multiplying A i by y i

Thm [Novikoff 1962]: converges in 1/ 2 iterations Proof: Let x * be the optimal hyperplane, for which A i x * >=

For n vectors in d dimensions: 1/ 2 iterations Each – n £ d time, total time: New algorithm: Sublinear time with high probability, leading order term improvement (in running times, poly-log factors are omitted)

Why is it surprising ? - margin

More results - O(n/ε 2 + d/ε) time alg for minimum enclosing ball (MEB) assuming norms of input points known - Sublinear time kernel versions, e.g., polynomial kernel of deg q - Poly-log space / low pass / sublinear time algorithms for these problems All running times are tight up to polylog factors (we give information theoretic lower bounds)

Talk outline - primal-dual optimization in learning - l 2 sampling - MEB - Kernels

A Primal-dual Perceptron ´ = £ ~(ε) Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

The Primal-dual Perceptron distribution over examples

Optimization via learning game offline optimization problem Player 1: Low regret alg Player 2: Low regret alg Converges to the min-max solution Classification problem: max x in B min i in [n] A i x = max x in B min p in Δ p A i x = min p in Δ max x in B p A i x Reduction Low regret algorithm = after many game iterations, average payoff -> best attainable payoff in hindsight of a fixed strategy

Thm: iterations to converge to -approximate solution is bounded by T for which: Total time = # iterations £ time-per-iteration Advantages: - Generic optimization - Easy to apply randomization Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T

A Primal-dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

A Primal-dual Perceptron Total time ? Speed up via randomization: 1. Sufficient to look at one example 2. Sufficient to obtain crude estimates of inner products.

l 2 sampling Consider two vectors from the d-dim sphere u, v - Sample coordinate i w.p. v i 2 - Return Notice that - Expectation is correct - Variance at most one (magnitude can be d) - Time: O(d)

The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: Important: preprocess x t only once for all estimates Running time: O((n + d)/ε 2 ) p t+1 (i) = p t (i) ¢ e - ´ A i x t p t+1 (i+1) Ã p t (i) ¢ (1- ´ l 2 -sample(A i x t ) + ´ 2 l 2 -sample(A i x t ) 2 ) x t+1 = x t + ´ A i t

Analysis Some difficulties: - Non-trivial regret analysis due to sampling - Need new multiplicative update analysis for bounded variance - Analysis shows good solution with constant probability - Need a way to verify a solution to get high probability

Streaming Implementation - See rows one at a time - Cant afford to store x t or p t - Want few passes, poly(log n/ε) space, and sublinear time - Want to output succinct representation of hyperplane - list of 1/ε 2 row indices - In t-th iteration, when l 2 -sampling from x t, use the same index j t for all n rows - Store samples i 1, …, i T of rows chosen by dual player, and j 1, …, j T of l 2 -sampling indices of primal player - Sample in a stream using known algorithms

Lower Bound - Consider n x d matrix - First 1/ε 2 rows contain a random position equal to ε and all other values are 0 - Each of the remaining n-1/ε 2 rows is a copy of a random row among the first 1/ε 2 - With probability ½, - choose a random row, and replace the value ε by –ε With probability ½, - do nothing - Deciding which case youre in requires reading (n+d)/ ε 2 entries

MEB (minimum enclosing ball)

A Primal-dual algorithm Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i=1 n p t (i) (A i – x t ) p t+1 (i) = p t (i) ¢ e ´ ||A i -x t || 2

l 2 -sampling speed up Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(||A i -x t || 2 ) + ´ 2 l 2 -sample(||A i -x t| | 2 ) 2

Regret speed up Updates: # iterations remains But only in -fraction we have to do O(d) work, though in all iterations we do O(n) work O(n/ε 2 + d/ε) total time with probability ε: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(|A i -x t | 2 ) + ´ 2 l 2 -sample(|A i -x t | 2 ) 2

Kernels

Map input to higher dimensional space via non-linear mapping. i.e. polynomial: Classification via linear classifier in new space. Efficient classification and optimization if inner products can be computed efficiently (the kernel function)

The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ A i t

The Primal-Dual Kernel Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ © (A i t )

l 2 sampling for kernels Polynomial kernel: Kernel l 2 -sample = q independent l 2 -samples of x T y Running time increases by q Can also use Taylor expansion, to do, say, Gaussian kernels