Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

Similar presentations


Presentation on theme: "Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)"— Presentation transcript:

1 Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

2 Learning We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - a function f in C is selected, and also a probability distribution D over {+1,−1} n - the learning algorithm gets access to random examples, where the x ’s are drawn from D - goal: efficiently output a hypothesis h such that w.h.p., Pr x← D [ f(x) ≠ h(x)] < ε.

3 Learning example Example: C is the class of all conjunctions of variables. Perhaps the concept selected is: x 1 AND x 2 AND x 4. One might see examples: What is a learning algorithm for this class?

4 Halfspaces Let h be a hyperplane in R n : h = {x : ∑w i x i = θ}. h naturally induces a boolean function: f : {+1,−1} n → {+1,−1}, f (x) = sgn(∑w i x i − θ). We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example ( w i ≡ 1, θ = 0). i=1 n

5 Learning halfspaces Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62]. The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89]. Indeed, this works over any distribution on R n, including those singling out {+1, −1} n.

6 Learning halfspaces Basic idea: given a bunch of examples, find a halfspace which classifies them correctly. By some learning theory technology (“Occam’s Razor”), this is a good algorithm. Consider the coefficients of a hypothesis halfspace to be unknowns, a 1, …, a n, θ. Each example induces some linear constraints: e.g., induces a 1 +a 2 −a 3 +a 4 −a 5 −a 6 > θ. Solve LP.

7 Learning intersections of halfspaces The next logical extension of this, and a very important one, is learning intersections of halfspaces. Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas… Learning them is also an important problem for computer vision, study of perceptrons. But very little is known.

8 Prior work - [Baum91]: poly time algorithm for intersection of two halfspaces through the origin under symmetric distributions (those satisfying D (x) = D (−x) ). - [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere: - not relevant for boolean halfspaces -[KwekPitt98] gave a polynomial time alg., but requires membership queries -also not relevant for boolean halfspaces

9 Our results Theorem 1: The concept class of arbitrary functions of k boolean halfspaces over {+1,−1} n is learnable under the uniform distribution to accuracy 1−ε in time: n O(k²/ε²). This is polynomial time if k = O(1), ε = Ω(1). (Prior to this, no algorithm could learn even an intersection of 2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

10 Our results Theorem 2: The concept class of intersections of k boolean halfspaces with weight bound W is learnable under any probability distribution to accuracy 1−ε in time: n O(k log k log W) / ε. So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

11 More results FunctionHalfspacesDistrib.Time any fcn. of k weight W any n O(k² log k logW) /ε weight k threshold (e.g., inters. of k ) weight W any n O(k log k logW) /ε intersection of k weight W any n O(√W log k) /ε read-once intersection of k arbitraryuniform n O((log(k)/ε)²) read-once majority of k arbitraryuniform n Õ((log(k)/ε) ) 4

12 Sketch of techniques For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations. For uniform distribution results: show that functions of halfspaces have low noise sensitivity. Both conclusions imply learning results generically.

13 Talk outline Plan for the rest of the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions. (Sketch other arbit. dist. results.) 2.Prove n O(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution. (Sketch other unif. dist. results.)

14 Polynomial threshold functions A (multilinear) polynomial p : R n → R is a PTF for f if it sign-represents f : f(x) = sgn(p(x)) for all x  {+1, −1} n. - every boolean halfspace is a degree 1 PTF for itself - every boolean function has a degree n PTF By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time n O(d) /ε.

15 PTFs for intersections of halfspaces Suppose f and g are hyperplanes, f(x) = ∑w i x i −θ, g(x) = ∑w i ' x i −θ'. We would like a PTF for sgn(f)  sgn(g). Failed attempt 1: - try f(x)g(x) :is >0 if f(x)>0 and g(x)>0 is >0 if f(x)<0 and g(x)<0  Failed attempt 2: - try f(x)+g(x) : is >0 if f(x)>0 and g(x)>0 is 0 and g(x)<0 

16 PTFs for intersections of halfspaces The solution: apply a (polynomial?) function to f and g to make them look more like their sign. Assume ∑|w i | < W. Then for all x  {+1,−1} n, f(x), g(x)  [-W,-1] ∪ [1,W]. Beigel et al. [BRS95] showed how to construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

17 BRS’s sgn -approximator p(x)=(x-1)(x-2) 2 (x-4) 2 (x-8) 2 (x-16) 2 (x-32) 2 q(x) = Q is a rational function of degree O(log k log W) such that: Q(x)  [1, 1+1/k] for x  [1,W], Q(x)  [-1-1/k, -1] for x  [-W,-1]. p(-x)-p(x) p(-x)+p(x)

18 PTFs for intersections of halfspaces Now given weight W halfspaces h 1, …, h k, sgn(Q(h 1 (x)) + … + Q(h k (x)) − (k−½)) is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W). Easy to get a polynomial: sgn(p/q)=sgn(pq). So we have a PTF for the intersection of k weight- W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time n O(k log k log W).

19 Talk outline Plan for the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions.  2.Prove n O(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

20 Noise sensitivity Let f : {+1,−1} n → {+1,−1} be a boolean function. Pick x  {+1,−1} n uniformly at random, and let y be an ε -corruption of x : flip each bit of x independently with probability ε. defn: The noise sensitivity of f is: NS ε (f) = Pr[f(x) ≠ f(y)].

21 Noise sensitivity examples Let f be a projection to one bit, f(x 1, …, x n ) = x 1. Then NS ε (f) = ε. Suppose f depends on only k bits. Then NS ε (f) ≤ k ε. PARITY is the most noise-sensitive function: NS ε (PARITY n ) = ½ − ½(1−2ε) n.

22 Noise sensitivity – study and apps. [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity [Kahn-Kalai-Linial-88] – random walks on the hypercube [Håstad-97] – probabilistically checkable proofs [Bshouty-Jackson-Tamon-99] – learning theory under noise [O-02] – Yao’s XOR Lemma, average case hardness of NP [Bourgain-02, Kindler-Safra-02, FKRSS-02] – study of juntas, Fourier analysis of boolean fcns.

23 Low noise sens.  fast learning We show that if the noise sensitivity of all f in C is uniformly bounded: NS ε (f) ≤ α(ε), then C is learnable under the uniform distribution in time: n O(1) / α (ε/3). Intuition: if f is not too noise sensitive, nearby points are highly correlated, so a net of examples works. −1

24 Proof of NS-learning connection Actually, the intuition is wrong. Here is the proper proof sketch: Low noise sensitivity  Fourier spectrum concentrated at low levels; this uses the formula: NS ε (f ) = ½−½ Σ(1−2ε) |S| f(S) 2 and a Markovish inequality. Low level Fourier concentration  efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93]. ˆ

25 Noise sensitivity of halfspaces Function NS ε proof one boolean halfspace O(√ε) Y. Peres, ’98 any function of k halfspaces O(k√ε) union bound read-once intersection of k halfspaces O(√ε log k) difficult probabilistic analysis read-once majority of k halfspaces Õ((ε log k) ¼ )

26 Consequences Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f  C have NS ε (f) ≤ α(ε). α −1 (ε/3) = O(ε 2 /k 2 ). Hence we get Theorem 1: a uniform distribution learning algorithm running in time n O(k²/ε²).

27 Noise sensitivity of a halfspace We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε). Suppose the halfspace is f = sgn(∑w i x i −θ). Without (much) loss of generality, one can assume θ = 0. Recall that x i ’s are selected randomly from {+1,−1} and the sum is formed; then each x i is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

28 Noise sensitivity of a halfspace With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.) We now model the problem thus: Pick signs x i at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X 1 = ∑w i x i, X 2 = ∑w i x i, etc. i=1…k i=k+1…2k

29 Noise sensitivity of a halfspace Write S = X 1 + … + X n/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X 1, so the sum before flipping is S'+X 1, and the sum after flipping is S'−X 1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X 1 |.

30 Noise sensitivity of a halfspace sgn(X 1 ) and S' are independent, so: Pr[sgn(X 1 ) ≠ sgn(S')] = ½. sgn(X 1 ) and |X 1 | are independent, so: Pr[sgn(X 1 ) ≠ sgn(S') | |S'| > |X 1 |] = ½  Pr[sgn(X 1 ) ≠ sgn(S) | |S'| > |X 1 |] = ½  Pr[sgn(X 1 ) ≠ sgn(S) & no flop ] = ½(1−P)  Pr[sgn(X 1 ) ≠ sgn(S)] = ½(1−P)  P = 2 E[½ – I[sgn(X 1 ) ≠ sgn(S)]].

31 Noise sensitivity of a halfspace Of course, there was nothing special about block X 1 as opposed to any other block. So in fact, P = 2 E[½ – I[sgn(X i ) ≠ sgn(S)]]. for all i = 1…n/k. Write τ=sgn(S), σ i =sgn(X i ), and average: P = 2 E[½ – (k/n) ∑ i I[τ ≠ σ i ]].

32 Noise sensitivity of a halfspace P = 2 E[½ – (k/n) ∑ i I[τ ≠ σ i ]] The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑ i I[1 ≠ σ i ] or ½ – (k/n) ∑ i I[−1 ≠ σ i ]. If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise: P ≤ 2 E[|½ – (k/n) ∑ i I[σ i =1]| + |½ – (k/n) ∑ i I[σ i =−1]|].

33 Noise sensitivity of a halfspace P ≤ 2 E[ |½ – ε ∑ I[σ i =1]| + |½ – ε ∑ I[σ i =−1]| ] But the σ i ’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/ε samples of an unbiased 0/1 random variable – i.e., O(√ε). i=1…1/ε

34 Extensions This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows. To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NS ε (h ) ≤ min{2p, C p (ε log(1/p)) ½ }.

35 Talk outline Plan for the talk: 1.Prove n O(k log k log W) bound for learning intersections of k weight- W halfspaces under arbitrary distributions.  2.Prove n O(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

36 Open technical challenges Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces. (For a new lower bound, see my talk tomorrow!) Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k) ½ ) ?

37 The huge open problem It still remains open how to learn an intersection of two arbitrary boolean halfspaces under an arbitrary distribution in subexponential time!


Download ppt "Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)"

Similar presentations


Ads by Google