Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers (perceptrons)
Separating Hyperplanes
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machine (SVM) Classification
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Chapter 5 Part II 5.3 Spread of Data 5.4 Fisher Discriminant.
Binary Classification Problem Learn a Classifier from the Training Set
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Efficient Model Selection for Support Vector Machines
online convex optimization (with partial information)
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
The Message Passing Communication Model David Woodruff IBM Almaden.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Fernando G.S.L. Brandão MSR -> Caltech Faculty Summit 2016
Support vector machines
Support Vector Machines (SVM)
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Approximating the MST Weight in Sublinear Time
CS 4/527: Artificial Intelligence
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
CIS 700: “algorithms for Big Data”
CSCI B609: “Foundations of Data Science”
The Communication Complexity of Distributed Set-Joins
CSCI B609: “Foundations of Data Science”
Support vector machines
CSCI B609: “Foundations of Data Science”
Support Vector Machines
Support vector machines
Support vector machines
Linear Discrimination
University of Wisconsin - Madison
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden

Linear Classification - margin

Linear Classification n vectors in d dimensions: A 1,…,A n 2 R d Can assume norms of the A i are bounded by 1 Labels y 1,…,y n 2 {-1,1} Find vector x such that: 8 i 2 [n], sign(A i x) = y i

The Perceptron Algorithm

[Rosenblatt 1957, Novikoff 1962, Minsky&Papert 1969] Iteratively: 1. Find vector A i for which sign(A i x) y i 2. Add A i to x: Note: can assume all y i = +1 by multiplying A i by y i

Thm [Novikoff 1962]: converges in 1/ 2 iterations Proof: Let x * be the optimal hyperplane, for which A i x * >=

For n vectors in d dimensions: 1/ 2 iterations Each – n £ d time, total time: New algorithm: Sublinear time with high probability, leading order term improvement (in running times, poly-log factors are omitted)

Why is it surprising ? - margin

More results - O(n/ε 2 + d/ε) time alg for minimum enclosing ball (MEB) assuming norms of input points known - Sublinear time kernel versions, e.g., polynomial kernel of deg q - Poly-log space / low pass / sublinear time algorithms for these problems All running times are tight up to polylog factors (we give information theoretic lower bounds)

Talk outline - primal-dual optimization in learning - l 2 sampling - MEB - Kernels

A Primal-dual Perceptron ´ = £ ~(ε) Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

The Primal-dual Perceptron distribution over examples

Optimization via learning game offline optimization problem Player 1: Low regret alg Player 2: Low regret alg Converges to the min-max solution Classification problem: max x in B min i in [n] A i x = max x in B min p in Δ p A i x = min p in Δ max x in B p A i x Reduction Low regret algorithm = after many game iterations, average payoff -> best attainable payoff in hindsight of a fixed strategy

Thm: iterations to converge to -approximate solution is bounded by T for which: Total time = # iterations £ time-per-iteration Advantages: - Generic optimization - Easy to apply randomization Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T Player 1 regret: Tε · max x 2 B t p t A x · t p t A x t + Regret 1 Player 2 regret: t p t Ax t · min p 2 ¢ t p A x t + Regret 2 So, min i in [n] t A i x t ¸ Tε – Regret 1 – Regret 2 Output t x t /T

A Primal-dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i n p t (i) A i p t+1 (i) = p t (i) ¢ e - ´ A i x t

A Primal-dual Perceptron Total time ? Speed up via randomization: 1. Sufficient to look at one example 2. Sufficient to obtain crude estimates of inner products.

l 2 sampling Consider two vectors from the d-dim sphere u, v - Sample coordinate i w.p. v i 2 - Return Notice that - Expectation is correct - Variance at most one (magnitude can be d) - Time: O(d)

The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: Important: preprocess x t only once for all estimates Running time: O((n + d)/ε 2 ) p t+1 (i) = p t (i) ¢ e - ´ A i x t p t+1 (i+1) Ã p t (i) ¢ (1- ´ l 2 -sample(A i x t ) + ´ 2 l 2 -sample(A i x t ) 2 ) x t+1 = x t + ´ A i t

Analysis Some difficulties: - Non-trivial regret analysis due to sampling - Need new multiplicative update analysis for bounded variance - Analysis shows good solution with constant probability - Need a way to verify a solution to get high probability

Streaming Implementation - See rows one at a time - Cant afford to store x t or p t - Want few passes, poly(log n/ε) space, and sublinear time - Want to output succinct representation of hyperplane - list of 1/ε 2 row indices - In t-th iteration, when l 2 -sampling from x t, use the same index j t for all n rows - Store samples i 1, …, i T of rows chosen by dual player, and j 1, …, j T of l 2 -sampling indices of primal player - Sample in a stream using known algorithms

Lower Bound - Consider n x d matrix - First 1/ε 2 rows contain a random position equal to ε and all other values are 0 - Each of the remaining n-1/ε 2 rows is a copy of a random row among the first 1/ε 2 - With probability ½, - choose a random row, and replace the value ε by –ε With probability ½, - do nothing - Deciding which case youre in requires reading (n+d)/ ε 2 entries

MEB (minimum enclosing ball)

A Primal-dual algorithm Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ i=1 n p t (i) (A i – x t ) p t+1 (i) = p t (i) ¢ e ´ ||A i -x t || 2

l 2 -sampling speed up Iteratively: 1. Primal player supplies point x t 2. Dual player supplies distribution p t 3. Updates: # iterations via regret of OGD/MW: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(||A i -x t || 2 ) + ´ 2 l 2 -sample(||A i -x t| | 2 ) 2

Regret speed up Updates: # iterations remains But only in -fraction we have to do O(d) work, though in all iterations we do O(n) work O(n/ε 2 + d/ε) total time with probability ε: x t+1 = x t + ´ A i t p t+1 (i) = p t (i) (1+ ´ l 2 -sample(|A i -x t | 2 ) + ´ 2 l 2 -sample(|A i -x t | 2 ) 2

Kernels

Map input to higher dimensional space via non-linear mapping. i.e. polynomial: Classification via linear classifier in new space. Efficient classification and optimization if inner products can be computed efficiently (the kernel function)

The Primal-Dual Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ A i t

The Primal-Dual Kernel Perceptron Iteratively: 1. Primal player supplies hyperplane x t, l 2 -sample from x t 2. Dual player supplies distribution p t, sample from it i t 3. Updates: x t+1 Ã x t + ´ © (A i t )

l 2 sampling for kernels Polynomial kernel: Kernel l 2 -sample = q independent l 2 -samples of x T y Running time increases by q Can also use Taylor expansion, to do, say, Gaussian kernels