Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff.

Slides:



Advertisements
Similar presentations
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Advertisements

Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Online learning, minimizing regret, and combining expert advice
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Support Vector Machines and Kernel Methods
Machine Learning Week 2 Lecture 2.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
online convex optimization (with partial information)
SVM by Sequential Minimal Optimization (SMO)
Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Carnegie Mellon University
Static Optimality and Dynamic Search Optimality in Lists and Trees
Distributed Machine Learning
CSCI B609: “Foundations of Data Science”
Maria Florina Balcan 03/04/2010
Support Vector Machines
CSCI B609: “Foundations of Data Science”
Presentation transcript:

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012] And Other Cool Stuff

Itinerary Stop 1: Minimizing regret and combining advice. –Randomized Wtd Majority / Multiplicative Weights alg –Connections to game theory Stop 2: Extensions –Online learning from limited feedback (bandit algs) –Algorithms for large action spaces, sleeping experts Stop 3: Powerful online LTF algorithms –Winnow, Perceptron Stop 4: Powerful tools for using these algorithms –Kernels and Similarity functions Stop 5: Something completely different –Distributed machine learning

Powerful tools for learning: Kernels and Similarity Functions

2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise function K(, ).  Can think about & analyze kernels in terms of implicit mappings, building on margin analysis we just did for Perceptron (and similar for SVMs).  Can also directly analyze directly as similarity functions, building on analysis we just did for Winnow. [Balcan-B’06] [Balcan-B-Srebro’08]

Kernel functions and Learning Back to our generic classification problem. E.g., given a set of images, labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data]Back to our generic classification problem. E.g., given a set of images, labeled by gender, learn a rule to distinguish men from women. [Goal: do well on new data] Problem: our best algorithms learn linear separators, but might not be good for data in its natural representation.Problem: our best algorithms learn linear separators, but might not be good for data in its natural representation. –Old approach: use a more complex class of functions. –More recent approach: use a kernel.                  

What’s a kernel? A kernel K is a legal def of dot-product: fn s.t. there exists an implicit mapping  K such that K(, )=  K ( ) ¢  K ( ). E.g., K(x,y) = (x ¢ y + 1) d. –  K :(n-diml space) ! (n d -diml space). Point is: many learning algs can be written so only interact with data via dot-products. –E.g., Perceptron: w = x (1) + x (2) – x (5) + x (9). w ¢ x = (x (1) + x (2) – x (5) + x (9) ) ¢ x. –If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. Kernel should be pos. semi- definite (PSD)

E.g., for the case of n=2, d=2, the kernel K(x,y) = (1 + x ¢ y) d corresponds to the mapping: x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z2z2 z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Example

Moreover, generalize well if good margin E.g., follows directly from mistake bound we proved for Perceptron. Kernels found to be useful in practice for dealing with many, many different kinds of data. If data is lin. separable by margin  in  -space, then need sample size only Õ(1/  2 ) to get confidence in generalization. Assume |  (x)| ·  

But there is a little bit of a disconnect... In practice, kernels constructed by viewing as a measure of similarity: K(x,y) 2 [-1,1], with some extra reqts. But Theory talks about margins in implicit high- dimensional  -space. K(x,y) =  (x) ¢  (y). Can we give an explanation for desirable properties of a similarity function that doesn’t use implicit spaces? And even remove the PSD requirement? xyxy Moreover, generalize well if good margin

Goal: notion of “good similarity function” for a learning problem that… 1.Talks in terms of more intuitive properties (no implicit high-diml spaces, no requirement of positive-semidefiniteness, etc) 2.If K satisfies these properties for our given problem, then has implications to learning 3.Includes usual notion of “good kernel” (one that induces a large margin separator in  -space).

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  “most x are on average more similar to points y of their own type than to points y of the other type” Average similarity to points of opposite label gap Average similarity to points of the same label

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  Note: it’s possible to satisfy this and not be PSD. Average similarity to points of opposite label gap Average similarity to points of the same label

Defn satisfying (1) and (2): Say have a learning problem P (distribution D over examples labeled by unknown target f). Sim fn K:(x,y) ! [-1,1] is ( ,  )-good for P if at least a 1-  fraction of examples x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  How can we use it? Average similarity to points of opposite label gap Average similarity to points of the same label

How to use it At least a 1-  prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+ 

How to use it At least a 1-  prob mass of x satisfy: E y~D [K(x,y)| l (y)= l (x)] ¸ E y~D [K(x,y)| l (y)  l (x)]+  Proof: –For any given “good x”, prob of error over draw of S ,S  at most  . –So, at most  chance our draw is bad on more than  fraction of “good x”. With prob ¸ 1- , error rate ·  + .

But not broad enough K(x,y)=x ¢ y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos) ++ _ Avg simil to negs is ½, but to pos is only ½ ¢ 1+½ ¢ (-½) = ¼.

But not broad enough Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. (even if don’t know R in advance) ++ _

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy Formally, say K is ( , ,  )-good if have hinge- loss , and Pr(R + ), Pr(R - ) ¸ . Claim 1: this is a legitimate way to think about good (large margin) kernels: –If  -good kernel then ( ,  2,  )-good here. –If  -good here and PSD then  -good kernel E y [K(x,y)| l (y)= l (x), R(y)] ¸ E y [K(x,y)| l (y)  l (x), R(y)]+ 

Broader defn… Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy Formally, say K is ( , ,  )-good if have hinge- loss , and Pr(R + ), Pr(R - ) ¸ . Claim 2: even if not PSD, can still use for learning. –So, don’t need to have implicit-space interpretation to be useful for learning. –But, maybe not with SVM/Perceptron directly… E y [K(x,y)| l (y)= l (x), R(y)] ¸ E y [K(x,y)| l (y)  l (x), R(y)]+ 

How to use such a sim fn? Ask that exists a set R of “reasonable” y ( allow probabilistic ) s. t. almost all x satisfy –Draw S = {y 1,…,y n }, n ¼ 1/(  2  ). –View as “landmarks”, use to map new data: F(x) = [K(x,y 1 ), …,K(x,y n )]. w *= [0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] –Whp, exists separator of good L 1 margin in this space: w *= [0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] take new set of examples, project to this space, and run good L 1 alg (e.g., Winnow)! –So, take new set of examples, project to this space, and run good L 1 alg (e.g., Winnow)! E y [K(x,y)| l (y)= l (x), R(y)] ¸ E y [K(x,y)| l (y)  l (x), R(y)]+  could be unlabeled

How to use such a sim fn? If K is ( , ,  )-good, then can learn to error  ’  with O((1/(  ’  2 )) log(n)) labeled examples. w *= [0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] –Whp, exists separator of good L 1 margin in this space: w *= [0,0,1/n +,1/n +,0,0,0,-1/n -,0,0] take new set of examples, project to this space, and run good L 1 alg (e.g., Winnow)! –So, take new set of examples, project to this space, and run good L 1 alg (e.g., Winnow)!

Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Algorithm Draw S={y 1, , y n } set of landmarks. Concatenate features. F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y n ),…,K r (x,y n )]. Run same L 1 optimization algorithm as before (or Winnow) in this new feature space.

Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Guarantee: Whp the induced distribution F(P) in R nr has a separator of error ·  +  at L 1 margin at least Algorithm Draw S={y 1, , y n } set of landmarks. Concatenate features. Sample complexity is roughly: O((1/(  2 )) log(nr)) F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y n ),…,K r (x,y n )]. Only increases by log(r) factor!

Interesting fact: because property defined in terms of L 1, no change in margin. –Only log(r) penalty for concatenating feature spaces. –If L 2, margin would drop by factor r 1/2, giving O(r) penalty in sample complexity. Algorithm is also very simple (just concatenate). Learning with Multiple Similarity Functions

Applications/extensions Bellet, A.; Habrard, A.; Sebban, M. ICTAI 2011: notion fits well with string edit similarities. –If use directly this way rather than converting to PSD kernel, comparable performance and models much sparser. (They use L 1 -normalized SVM). Bellet, A.; Habrard, A.; Sebban, M. MLJ 2012, ICML 2012: efficient algorithms for learning ( , ,  )-good similarity functions in different contexts.

Summary Kernels and similarity functions are powerful tools for learning. –Can analyze kernels using theory of L 2 margins, plug in to Perceptron or SVM –Can also analyze more general similarity fns (not nec. PSD) without implicit spaces, connecting with L 1 margins and Winnow, L 1 -SVM. –Second notion includes 1 st notion as well (modulo some loss in parameters). –Potentially other interesting suffic. conditions too. E.g., [WangYangFeng07] motivated by boosting.

Itinerary Stop 1: Minimizing regret and combining advice. –Randomized Wtd Majority / Multiplicative Weights alg –Connections to game theory Stop 2: Extensions –Online learning from limited feedback (bandit algs) –Algorithms for large action spaces, sleeping experts Stop 3: Powerful online LTF algorithms –Winnow, Perceptron Stop 4: Powerful tools for using these algorithms –Kernels and Similarity functions Stop 5: Something completely different –Distributed machine learning

Distributed PAC Learning Maria-Florina Balcan Avrim Blum Shai Fine Yishay Mansour Georgia Tech CMU IBM Tel-Aviv [In COLT 2012]

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations.

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Click data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Customer data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Scientific data

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Each has only a piece of the overall data pie

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. In order to learn over the combined D, holders will need to communicate.

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Classic ML question: how much data is needed to learn a given class of functions well?

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. These settings bring up a new question: how much communication? Plus issues like privacy, etc.

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting]

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting] Players can sample (x,f(x)) from their own D i. 1 2 … k D 1 D 2 … D k D = (D 1 + D 2 + … + D k )/k

The distributed PAC learning model Goal is to learn unknown function f 2 C given labeled data from some distribution D. However, D is arbitrarily partitioned among k entities (players) 1,2,…,k. [k=2 is interesting] Players can sample (x,f(x)) from their own D i. 1 2 … k D 1 D 2 … D k Goal: learn good h over D, using as little communication as possible.

The distributed PAC learning model Interesting special case to think about: –K=2. –One has the positives and one has the negatives. –How much communication to learn, e.g., a good linear separator?

The distributed PAC learning model Assume learning a class C of VC-dimension d. Some simple baselines. [viewing k << d] Baseline #1: based on fact that can learn any class of VC-dim d to error ² from O(d/ ² log 1/ ² ) samples –Each player sends 1/k fraction to player 1. –Player 1 finds consistent h 2 C, whp has error · ² with respect to D. Sends h to others. –Total: 1 round, O(d/ ² log 1/ ² ) examples communic. D 1 D 2 … D k

The distributed PAC learning model Baseline #2: –Suppose C is learnable by an online algorithm A with mistake-bound M. –Player 1 runs A, broadcasts current hypothesis. –If any player has a counterexample, sends to player 1. Player 1 updates, re-broadcasts. –At most M examples and hypotheses communicated. D 1 D 2 … D k

The distributed PAC learning model Baseline #2: –Suppose C is learnable by an online algorithm A with mistake-bound M. –In fact, can have all players maintain same state of A (so only need to broadcast counterexs) –Total: M rounds, M counterexs. D 1 D 2 … D k

Dependence on 1/ ² Had linear dependence in d and 1/ ², or M and no dependence on 1/ ². Can you get O(d log 1/ ² ) examples of communication? Yes! Distributed boosting. D 1 D 2 … D k

Recap of Adaboost For t=1,2, …,T Weak learning algorithm A. Key points: To achieve weak learning it suffices to use O(d) examples.

Distributed Adaboost For t=1,2, …,T SiSi SjSj htht htht htht SiSi SjSj w i,t w j,t (h t may do better on some than others)

Distributed Adaboost Final result: O(d) examples of communication per round + O(k log d) extra bits to send weights & request + 1 hypothesis sent per round O(log 1/ ² ) rounds of communication. So, O(d log 1/ ² ) examples of communication in total plus low order extra info.

Agnostic learning Recent result of [Balcan-Hanneke] gives robust halving alg that can be implemented in distributed setting. Get error 2 OPT(C) + ² using total of only O(k log|C| log(1/ ² )) examples. Not computationally efficient in general, but says O(log(1/ ² )) possible in principle.

Can we do better for specific classes of interest? E.g., conjunctions over {0,1} d. f(x) = x 2 x 5 x 9 x 15. These generic methods give O(d) examples, or O(d 2 ) bits total. Can you do better? Again, thinking of k << d.

Can we do better for specific classes of interest? E.g., conjunctions over {0,1} d. f(x) = x 2 x 5 x 9 x 15. These generic methods give O(d) examples, or O(d 2 ) bits total. Can you do better? Sure: each entity intersects its positives. Sends to player Player 1 intersects & broadcasts.

Can we do better for specific classes of interest? E.g., conjunctions over {0,1} d. f(x) = x 2 x 5 x 9 x 15. These generic methods give O(d) examples, or O(d 2 ) bits total. Can you do better? Only O(k) examples sent. O(kd) bits.

Can we do better for specific classes of interest? General principle: can learn any intersection closed class (well-defined “tightest wrapper” around positives) this way

Interesting class: parity functions Examples x 2 {0,1} d. f(x) = x ¢ v f mod 2, for unknown v f. Interesting for k=2. Classic communication LB for determining if two subspaces intersect. Implies O(d 2 ) bits LB for proper learning. What if we allow hyps that aren’t parities?

Interesting class: parity functions Examples x 2 {0,1} d. f(x) = x ¢ v f mod 2, for unknown v f. Parity has interesting property that: (a) Can be properly PAC-learned. [Given dataset S of size O(d/ ² ), just solve the linear system] (a) Can be non-properly learned in reliable-useful model of Rivest-Sloan’88. [if x in subspace spanned by S, predict accordingly, else say “??”] Sh 2 C S x f(x) ??

Interesting class: parity functions Examples x 2 {0,1} d. f(x) = x ¢ v f mod 2, for unknown v f. Algorithm: –Each player i properly PAC-learns over D i to get parity function g i. Also improperly R-U learns to get rule h i. Sends g i to other player. –Uses rule: “if h i predicts, use it; else use g 3-i.” –Can one extend to k=3??

Linear Separators Can one do better? Linear separators over near-uniform D over B d. VC-bound, margin bound, Perceptron mistake-bound all give O(d) examples needed to learn, so O(d) examples of communication using baselines (for constant k, ² ).

Linear Separators Thm: Over any non-concentrated D [density bounded by c ¢ unif], can achieve #vectors communicated of O((d log d) 1/2 ) rather than O(d) (for constant k, ² ). Algorithm: Run a margin-version of perceptron in round-robin. –Player i receives h from prev player. –If err(h) ¸ ² on D i then update until f(x)(w ¢ x) ¸ 1 for most x from D i. –Then pass to next player.

Linear Separators Thm: Over any non-concentrated D [density bounded by c ¢ unif], can achieve #vectors communicated of O((d log d) 1/2 ) rather than O(d) (for constant k, ² ). Algorithm: Run a margin-version of perceptron in round-robin. Proof idea: Non-concentrated D ) examples nearly-orthogonal whp ( |cos(x,x’)| = O((log(d)/d) 1/2 ) So updates by player j don’t hurt i too much: after player i finishes, if less than (d/log(d)) 1/2 updates by others, player i is still happy. Implies at most O((d log d) 1/2 ) rounds.

Conclusions and Open Questions As we move to large distributed datasets, communication becomes increasingly crucial. Rather than only ask “how much data is needed to learn well”, we ask “how much communication do we need?” Also issues like privacy become more central. (Didn’t discuss here, but see paper) Open questions: Linear separators of margin ° in general? Other classes? [parity with k=3?] Incentives?