# Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

## Presentation on theme: "Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard."— Presentation transcript:

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard

What’s a junta? junta: –A council or committee for political or governmental purposes –A group of persons controlling a government –A junto junta: –A Boolean function which depends on only k << n Boolean variables

Example: a 3-junta 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 f(x 1,...,x 10 ) = x 3 OR (x 6 AND x 7 ) x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x)

Learning juntas The problem: you get data labeled according to some k-junta. What’s the junta? x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x) 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 0

Motivation Warm-ups Our results How we do it Future work Outline of talk

Why learn juntas? Natural, general problem (no assumptions on f ) Real-world learning problems often have lots of irrelevant information Important special case of notorious open questions in learning theory: learning DNF, learning decision trees...

Learning decision trees x5x5 x3x3 x1x1 x2x2 x1x1 x4x4 100 x6 x6 110 Given data labeled according to some decision tree, what’s the tree? 01

Learning decision trees (cont) Any k-junta is expressible as a decision tree of size 2 k So to learn poly(n)-size decision trees, must be able to learn log(n)-juntas. Big open question: are decision trees of size poly(n) learnable in poly(n) time? Similar situation for learning DNF.

Learning decision trees (cont) If we can learn log(n)-juntas, can learn decision trees of size log(n)…even this would be a big step forward. So progress on juntas is necessary for progress on decision trees. It’s also sufficient! Again, similar situation for DNF.

The problem: PAC learn k- juntas under uniform Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that wvhp Pr[h(x)  f(x)] < .

The problem refined Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that Pr[h(x)  f(x)] < . Equivalent goal: output h = f Equivalent goal: find k relevant variables of f

What’s known? Easy lower bound: need at least 2 k + k log n examples Easy information-theoretic upper bound: 2 k + k log n examples are sufficient Easy computational upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) = O(n k ) time Can we learn in time poly(n,2 k )? nknk nknk

Variant #1: membership queries If learner can make queries, can learn in poly(n,2 k ) time. –Draw random points. If all positive or all negative, done. Otherwise, “walk” from positive point to negative point to identify relevant variable: –Recurse. 1 1 0 1 0 0 1 0 1 ; 1 0 1 0 1 1 0 0 1 1 ; 0 0 1 0 1 0 0 1 0 1 ; 1 0 1 0 1 1 0 1 0 1 ; 1 0 1 0 1 1 0 0 0 1 ; 0

Variant #2: monotone functions If junta is monotone, can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = 1 | x i = 1] = Pr[f(x) = 1 | x i = 0]. –If x i is relevant, have Pr[f(x) = 1 | x i = 1] > Pr[f(x) = 1 | x i = 0]. –Each probability is integer multiple of 1/2 k. –So can test each variable in poly(2 k) time.

Variant #3: random functions If junta is random, whp can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = x i ] = 1/2 for sure. –If x i is relevant, have Pr[Pr[f(x) = x i ] = 1/2] 1/2 k/2. –Each probability is integer multiple of 1/2 k. –So whp can find the relevant variables this way. ~ ~

Back to real problem Lower bound: need at least 2 k + k log n examples Upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) time Can we learn in time poly(n,2 k )? nknk nknk

Previous work [Blum & Langley, 1994] suggested problem Little progress until…. [Kalai & Mansour, 2001] gave algorithm that learns in time n k - k 1/2

Our result We give an algorithm that learns in time n k   ~ ~ where  2.376 is the matrix multiplication exponent. So currently n.704k.

The main idea Let g be the hidden k-bit function Look at two different representations for g: –Only weird functions are hard to learn under first representation –Only perverse functions are hard to learn under second representation –No function is both weird and perverse

First representation: real polynomial View inputs, outputs as  1/  1 valued Fact: every Boolean function g: {  1,  1} k {  1,  1} has unique interpolating real polynomial g R (x 1,x 2,….,x k ) –g R coefficients are Fourier coefficients of g –Examples: parity on x 1,x 2,….,x k : polynomial is x 1 x 2 ….x k x 1 AND x 2 : polynomial is (1 + x 1 + x 2 - x 1 x 2 )/2

Real polynomials Fourier coefficients measure correlation of g with corresponding parities: E[g(x)x T ] = coefficient of x T in g R So given a set T of variables, can estimate coefficient of x T via sampling –Nonzero only if every variable in T is relevant –Problem: may have to test all sets of up to k variables to find a nonzero coefficient

First technical theorem: Let g be a Boolean function on k variables such that g R has nonzero constant term: g R (x) = c 0  c T x T. (s = degree of smallest nontrivial monomial) Then s < 2k/3.  |T|>s _ _

Second representation: GF 2 polynomial View inputs, outputs as 0/1 valued Fact: every Boolean function g: {0,1} k {0,1} has unique interpolating GF 2 polynomial g 2 (x 1,x 2,…., x k ) Examples: parity on x 1,x 2,….,x k : polynomial is x 1 + x 2 +….+ x k x 1 AND …. AND x k : polynomial is x 1 x 2 ….x k

Learning parities Suppose g is some parity function, e.g. g(x)=parity( x 1,x 2, x 4 ) Can add labeled examples mod 2: 0 1 0 1 0 0 1 0 1 ; 0 1 1 1 1 1 0 1 0 1 ; 1 1 0 1 0 1 0 0 0 0 ; 1

Learning parities (cont) Given a set of labeled examples, can do Gaussian elimination to obtain Will have b =1 iff x 1 is in parity Repeat for x 2,…,x n to learn parity 1 0 0 0 0 0 0 0 0 ; b

Learning GF 2 polynomials Given any g: {0,1} k {0,1}, can view g 2 as parity over monomials (ANDs) If deg(g 2 ) = d, have k d monomials In junta setting, have n d monomials –Problem: d could be k

Second technical theorem: Let g be a Boolean function on k variables such that g R has zero constant term: g R (x) = c T x T. Then deg( g 2 ) < k-s.  |T|>s _ _

Algorithm to learn k-juntas Sample to test whether f is constant If not, sample to estimate Fourier coefficient of all sets of up to  k variables –Nonzero coefficient of size m: recurse on all 2 m settings of those variables –All small coefficients zero: run parity-learning algorithm with monomials of size up to (1  )k

Why does it work? If f unbalanced, will find nonzero coefficient of size at most 2k/3 <  k If f balanced, parity learning algorithm guaranteed to succeed So either way, make progress Take  > 2/3.

Running time Checking sets of of (up to)  k variables takes n  k time Running Gaussian elimination on monomials of size (1  )k takes time n  k (  = matrix multiplication exponent) So best  is 

What else can we do? Restrictions: Can look at f under “small” restrictions x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 0 x 2 1 0 x 5 x 6 x 7 x 8 0 x 10 

A question Suppose g: {  1,  1} k {  1,  1} is g R (x) = g T x T. Must there be some restriction fixing at most 2k/3 variables such that g(  (x)) is a parity function? If yes, can learn k-juntas in time n 2k/3  |T|>k

Future work Faster algorithms? Non-binary input alphabets? –(non-binary outputs easy) Non-uniform distributions? –Product distributions? –General distributions?

Download ppt "Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard."

Similar presentations