Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Similar presentations


Presentation on theme: "Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard."— Presentation transcript:

1 Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard

2 What’s a junta? junta: –A council or committee for political or governmental purposes –A group of persons controlling a government –A junto junta: –A Boolean function which depends on only k << n Boolean variables

3 Example: a 3-junta f(x 1,...,x 10 ) = x 3 OR (x 6 AND x 7 ) x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x)

4 Learning juntas The problem: you get data labeled according to some k-junta. What’s the junta? x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x)

5 Motivation Warm-ups Our results How we do it Future work Outline of talk

6 Why learn juntas? Natural, general problem (no assumptions on f ) Real-world learning problems often have lots of irrelevant information Important special case of notorious open questions in learning theory: learning DNF, learning decision trees...

7 Learning decision trees x5x5 x3x3 x1x1 x2x2 x1x1 x4x4 100 x6 x6 110 Given data labeled according to some decision tree, what’s the tree? 01

8 Learning decision trees (cont) Any k-junta is expressible as a decision tree of size 2 k So to learn poly(n)-size decision trees, must be able to learn log(n)-juntas. Big open question: are decision trees of size poly(n) learnable in poly(n) time? Similar situation for learning DNF.

9 Learning decision trees (cont) If we can learn log(n)-juntas, can learn decision trees of size log(n)…even this would be a big step forward. So progress on juntas is necessary for progress on decision trees. It’s also sufficient! Again, similar situation for DNF.

10 The problem: PAC learn k- juntas under uniform Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that wvhp Pr[h(x)  f(x)] < .

11 The problem refined Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that Pr[h(x)  f(x)] < . Equivalent goal: output h = f Equivalent goal: find k relevant variables of f

12 What’s known? Easy lower bound: need at least 2 k + k log n examples Easy information-theoretic upper bound: 2 k + k log n examples are sufficient Easy computational upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) = O(n k ) time Can we learn in time poly(n,2 k )? nknk nknk

13 Variant #1: membership queries If learner can make queries, can learn in poly(n,2 k ) time. –Draw random points. If all positive or all negative, done. Otherwise, “walk” from positive point to negative point to identify relevant variable: –Recurse ; ; ; ; ; 0

14 Variant #2: monotone functions If junta is monotone, can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = 1 | x i = 1] = Pr[f(x) = 1 | x i = 0]. –If x i is relevant, have Pr[f(x) = 1 | x i = 1] > Pr[f(x) = 1 | x i = 0]. –Each probability is integer multiple of 1/2 k. –So can test each variable in poly(2 k) time.

15 Variant #3: random functions If junta is random, whp can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = x i ] = 1/2 for sure. –If x i is relevant, have Pr[Pr[f(x) = x i ] = 1/2] 1/2 k/2. –Each probability is integer multiple of 1/2 k. –So whp can find the relevant variables this way. ~ ~

16 Back to real problem Lower bound: need at least 2 k + k log n examples Upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) time Can we learn in time poly(n,2 k )? nknk nknk

17 Previous work [Blum & Langley, 1994] suggested problem Little progress until…. [Kalai & Mansour, 2001] gave algorithm that learns in time n k - k 1/2

18 Our result We give an algorithm that learns in time n k   ~ ~ where  is the matrix multiplication exponent. So currently n.704k.

19 The main idea Let g be the hidden k-bit function Look at two different representations for g: –Only weird functions are hard to learn under first representation –Only perverse functions are hard to learn under second representation –No function is both weird and perverse

20 First representation: real polynomial View inputs, outputs as  1/  1 valued Fact: every Boolean function g: {  1,  1} k {  1,  1} has unique interpolating real polynomial g R (x 1,x 2,….,x k ) –g R coefficients are Fourier coefficients of g –Examples: parity on x 1,x 2,….,x k : polynomial is x 1 x 2 ….x k x 1 AND x 2 : polynomial is (1 + x 1 + x 2 - x 1 x 2 )/2

21 Real polynomials Fourier coefficients measure correlation of g with corresponding parities: E[g(x)x T ] = coefficient of x T in g R So given a set T of variables, can estimate coefficient of x T via sampling –Nonzero only if every variable in T is relevant –Problem: may have to test all sets of up to k variables to find a nonzero coefficient

22 First technical theorem: Let g be a Boolean function on k variables such that g R has nonzero constant term: g R (x) = c 0  c T x T. (s = degree of smallest nontrivial monomial) Then s < 2k/3.  |T|>s _ _

23 Second representation: GF 2 polynomial View inputs, outputs as 0/1 valued Fact: every Boolean function g: {0,1} k {0,1} has unique interpolating GF 2 polynomial g 2 (x 1,x 2,…., x k ) Examples: parity on x 1,x 2,….,x k : polynomial is x 1 + x 2 +….+ x k x 1 AND …. AND x k : polynomial is x 1 x 2 ….x k

24 Learning parities Suppose g is some parity function, e.g. g(x)=parity( x 1,x 2, x 4 ) Can add labeled examples mod 2: ; ; ; 1

25 Learning parities (cont) Given a set of labeled examples, can do Gaussian elimination to obtain Will have b =1 iff x 1 is in parity Repeat for x 2,…,x n to learn parity ; b

26 Learning GF 2 polynomials Given any g: {0,1} k {0,1}, can view g 2 as parity over monomials (ANDs) If deg(g 2 ) = d, have k d monomials In junta setting, have n d monomials –Problem: d could be k

27 Second technical theorem: Let g be a Boolean function on k variables such that g R has zero constant term: g R (x) = c T x T. Then deg( g 2 ) < k-s.  |T|>s _ _

28 Algorithm to learn k-juntas Sample to test whether f is constant If not, sample to estimate Fourier coefficient of all sets of up to  k variables –Nonzero coefficient of size m: recurse on all 2 m settings of those variables –All small coefficients zero: run parity-learning algorithm with monomials of size up to (1  )k

29 Why does it work? If f unbalanced, will find nonzero coefficient of size at most 2k/3 <  k If f balanced, parity learning algorithm guaranteed to succeed So either way, make progress Take  > 2/3.

30 Running time Checking sets of of (up to)  k variables takes n  k time Running Gaussian elimination on monomials of size (1  )k takes time n  k (  = matrix multiplication exponent) So best  is 

31 What else can we do? Restrictions: Can look at f under “small” restrictions x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 0 x x 5 x 6 x 7 x 8 0 x 10 

32 A question Suppose g: {  1,  1} k {  1,  1} is g R (x) = g T x T. Must there be some restriction fixing at most 2k/3 variables such that g(  (x)) is a parity function? If yes, can learn k-juntas in time n 2k/3  |T|>k

33 Future work Faster algorithms? Non-binary input alphabets? –(non-binary outputs easy) Non-uniform distributions? –Product distributions? –General distributions?


Download ppt "Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard."

Similar presentations


Ads by Google