Download presentation
Presentation is loading. Please wait.
Published byJaliyah Paige Modified over 9 years ago
1
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard
2
What’s a junta? junta: –A council or committee for political or governmental purposes –A group of persons controlling a government –A junto junta: –A Boolean function which depends on only k << n Boolean variables
3
Example: a 3-junta 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 f(x 1,...,x 10 ) = x 3 OR (x 6 AND x 7 ) x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x)
4
Learning juntas The problem: you get data labeled according to some k-junta. What’s the junta? x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 f(x) 1 1 0 1 1 1 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 0
5
Motivation Warm-ups Our results How we do it Future work Outline of talk
6
Why learn juntas? Natural, general problem (no assumptions on f ) Real-world learning problems often have lots of irrelevant information Important special case of notorious open questions in learning theory: learning DNF, learning decision trees...
7
Learning decision trees x5x5 x3x3 x1x1 x2x2 x1x1 x4x4 100 x6 x6 110 Given data labeled according to some decision tree, what’s the tree? 01
8
Learning decision trees (cont) Any k-junta is expressible as a decision tree of size 2 k So to learn poly(n)-size decision trees, must be able to learn log(n)-juntas. Big open question: are decision trees of size poly(n) learnable in poly(n) time? Similar situation for learning DNF.
9
Learning decision trees (cont) If we can learn log(n)-juntas, can learn decision trees of size log(n)…even this would be a big step forward. So progress on juntas is necessary for progress on decision trees. It’s also sufficient! Again, similar situation for DNF.
10
The problem: PAC learn k- juntas under uniform Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that wvhp Pr[h(x) f(x)] < .
11
The problem refined Setup: we get random examples (x 1,f(x 1 )), (x 2,f(x 2 )),….where –each x i is uniform from {0,1} n –f is an unknown k-junta Goal: output h such that Pr[h(x) f(x)] < . Equivalent goal: output h = f Equivalent goal: find k relevant variables of f
12
What’s known? Easy lower bound: need at least 2 k + k log n examples Easy information-theoretic upper bound: 2 k + k log n examples are sufficient Easy computational upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) = O(n k ) time Can we learn in time poly(n,2 k )? nknk nknk
13
Variant #1: membership queries If learner can make queries, can learn in poly(n,2 k ) time. –Draw random points. If all positive or all negative, done. Otherwise, “walk” from positive point to negative point to identify relevant variable: –Recurse. 1 1 0 1 0 0 1 0 1 ; 1 0 1 0 1 1 0 0 1 1 ; 0 0 1 0 1 0 0 1 0 1 ; 1 0 1 0 1 1 0 1 0 1 ; 1 0 1 0 1 1 0 0 0 1 ; 0
14
Variant #2: monotone functions If junta is monotone, can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = 1 | x i = 1] = Pr[f(x) = 1 | x i = 0]. –If x i is relevant, have Pr[f(x) = 1 | x i = 1] > Pr[f(x) = 1 | x i = 0]. –Each probability is integer multiple of 1/2 k. –So can test each variable in poly(2 k) time.
15
Variant #3: random functions If junta is random, whp can learn in poly(n,2 k ) time. –If x i is irrelevant, have Pr[f(x) = x i ] = 1/2 for sure. –If x i is relevant, have Pr[Pr[f(x) = x i ] = 1/2] 1/2 k/2. –Each probability is integer multiple of 1/2 k. –So whp can find the relevant variables this way. ~ ~
16
Back to real problem Lower bound: need at least 2 k + k log n examples Upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2 O(k) ( ) time Can we learn in time poly(n,2 k )? nknk nknk
17
Previous work [Blum & Langley, 1994] suggested problem Little progress until…. [Kalai & Mansour, 2001] gave algorithm that learns in time n k - k 1/2
18
Our result We give an algorithm that learns in time n k ~ ~ where 2.376 is the matrix multiplication exponent. So currently n.704k.
19
The main idea Let g be the hidden k-bit function Look at two different representations for g: –Only weird functions are hard to learn under first representation –Only perverse functions are hard to learn under second representation –No function is both weird and perverse
20
First representation: real polynomial View inputs, outputs as 1/ 1 valued Fact: every Boolean function g: { 1, 1} k { 1, 1} has unique interpolating real polynomial g R (x 1,x 2,….,x k ) –g R coefficients are Fourier coefficients of g –Examples: parity on x 1,x 2,….,x k : polynomial is x 1 x 2 ….x k x 1 AND x 2 : polynomial is (1 + x 1 + x 2 - x 1 x 2 )/2
21
Real polynomials Fourier coefficients measure correlation of g with corresponding parities: E[g(x)x T ] = coefficient of x T in g R So given a set T of variables, can estimate coefficient of x T via sampling –Nonzero only if every variable in T is relevant –Problem: may have to test all sets of up to k variables to find a nonzero coefficient
22
First technical theorem: Let g be a Boolean function on k variables such that g R has nonzero constant term: g R (x) = c 0 c T x T. (s = degree of smallest nontrivial monomial) Then s < 2k/3. |T|>s _ _
23
Second representation: GF 2 polynomial View inputs, outputs as 0/1 valued Fact: every Boolean function g: {0,1} k {0,1} has unique interpolating GF 2 polynomial g 2 (x 1,x 2,…., x k ) Examples: parity on x 1,x 2,….,x k : polynomial is x 1 + x 2 +….+ x k x 1 AND …. AND x k : polynomial is x 1 x 2 ….x k
24
Learning parities Suppose g is some parity function, e.g. g(x)=parity( x 1,x 2, x 4 ) Can add labeled examples mod 2: 0 1 0 1 0 0 1 0 1 ; 0 1 1 1 1 1 0 1 0 1 ; 1 1 0 1 0 1 0 0 0 0 ; 1
25
Learning parities (cont) Given a set of labeled examples, can do Gaussian elimination to obtain Will have b =1 iff x 1 is in parity Repeat for x 2,…,x n to learn parity 1 0 0 0 0 0 0 0 0 ; b
26
Learning GF 2 polynomials Given any g: {0,1} k {0,1}, can view g 2 as parity over monomials (ANDs) If deg(g 2 ) = d, have k d monomials In junta setting, have n d monomials –Problem: d could be k
27
Second technical theorem: Let g be a Boolean function on k variables such that g R has zero constant term: g R (x) = c T x T. Then deg( g 2 ) < k-s. |T|>s _ _
28
Algorithm to learn k-juntas Sample to test whether f is constant If not, sample to estimate Fourier coefficient of all sets of up to k variables –Nonzero coefficient of size m: recurse on all 2 m settings of those variables –All small coefficients zero: run parity-learning algorithm with monomials of size up to (1 )k
29
Why does it work? If f unbalanced, will find nonzero coefficient of size at most 2k/3 < k If f balanced, parity learning algorithm guaranteed to succeed So either way, make progress Take > 2/3.
30
Running time Checking sets of of (up to) k variables takes n k time Running Gaussian elimination on monomials of size (1 )k takes time n k ( = matrix multiplication exponent) So best is
31
What else can we do? Restrictions: Can look at f under “small” restrictions x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 0 x 2 1 0 x 5 x 6 x 7 x 8 0 x 10
32
A question Suppose g: { 1, 1} k { 1, 1} is g R (x) = g T x T. Must there be some restriction fixing at most 2k/3 variables such that g( (x)) is a parity function? If yes, can learn k-juntas in time n 2k/3 |T|>k
33
Future work Faster algorithms? Non-binary input alphabets? –(non-binary outputs easy) Non-uniform distributions? –Product distributions? –General distributions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.