Download presentation

Presentation is loading. Please wait.

Published byTierra Ledyard Modified over 3 years ago

1
LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006

2
Re: How to make $1000! A Grand of George W.s: A Hundred Hamiltons: A Cool Cleveland:

3
The junta learning problem f : {1,+1} n ! {1,+1} is an unknown Boolean function. f depends on only k ¿ n bits. May generate examples, h x, f(x) i, where x is generated uniformly at random. Task: Identify the k relevant variables., Identify f exactly., Identify one relevant variable. DNA

4
Run time efficiency Information theoretically: Algorithmically: Naive algorithm: Time n k. Best known algorithm: Time = n.704 k [Mossel-O-Servedio 04] Need only ¼ 2 k log n examples. Seem to need n (k) time steps.

5
How to get the money Learning log n-juntas in poly(n) time gets you $1000. Learning log log n-juntas in poly(n) time gets you $1000. Learning n (1)-juntas in poly(n) time gets you $200. The case k = log n is a subproblem of the problem ofLearning polynomial-size DNF under the uniform distribution. http://www.thesmokinggun.com/archive/bushbill1.html

6
Time: n Algorithmic attempts For each x i, measure empirical correlation with f(x): E[ f(x) x i ]. Different from 0 ) x i must be relevant. Converse false: x i can be influential but uncorrelated. (e.g., k = 4, f = exactly 2 out of 4 bits are +1) Try measuring f s correlation with pairs of variables: E[ f(x) x i x j ]. Different from 0 ) both x i and x j must be relevant. Still might not work. (e.g., k ¸ 3, f = parity on k bits) So try measuring correlation with all triples of variables… Time: n 2 Time: n 3

7
A result In time n d, you can check correlation with all d-bit functions. What kind of Boolean functions on k bits could be uncorrelated with all functions on d or fewer bits?? [Mossel-O-Servedio 04]: Proves structure theorem about such functions. (They must be expressible as parities of ANDs of small size.) Can apply a parity-learning algorithm in that case. End result: An algorithm running in time (Well, parities on > d bits, e.g.…) Uniform-distribution learning results often implied by structural results about Boolean functions. ÆÆÆÆ

8
PAC Learning PAC Learning: There is an unknown f : {1,+1} n ! {1,+1}. Algorithm gets i.i.d. examples, h x, f(x) i Task: Learn. Given, find a hypothesis function h which is (w.h.p.) -close to f. Goal: Running-time efficiency. CIRCUITS OF THE MIND unknown dist. Uniform Distribution

9
Running-time efficiency The more complex f is, the more time its fair to allow. Fix some measure of complexity or size, s = s( f ). Goal: run in time poly(n, 1/, s). Often focus on fixing s = poly(n), learning in poly(n) time. e.g., size of smallest DNF formula

10
The junta problem Fits into the formulation (slightly strangely): is fixed to 0. (Equivalently, 2k.) Measure of size is 2 (# of relevant variables). s = 2 k. [Mossel-O-Servedio 04] had running time essentially Even under this extremely conservative notion of size, we dont know how to learn in poly(n) time for s = poly(n).

11
complexity measure sfastest known algorithm DNF sizen O(log s) [V 90] 2 (# of relevant variables) n.704 log 2 s [MOS 04] depth d circuit sizen O(log d-1 s) [LMN 93, H 02] Assuming factoring is hard, n log (d) s time is necessary. Even with queries. [K 93] Decision Tree size n O(log s) [EH 89] Any algorithm that works in the Statistical Query model requires time n k. [BF 02]

12
What to do? 1. Give Learner extra help: Queries: Learner can ask for f(x) for any x. ) Can learn DNF in time poly(n, s). [Jackson 94] More structured data: Examples are not i.i.d., are generated by a standard random walk. Examples come in pairs, h x, f(x) i, h x', f(x') i, where x, x' share a > ½ fraction of coordinates. ) Can learn DNF in time poly(n, s). [Bshouty-Mossel-O-Servedio 05]

13
What to do? (rest of the talk) 2. Give up on trying to learn all functions. Rest of the talk: Focus on just learn monotone functions. f is monotone, changing a 1 to a +1 in the input can only make f go from 1 to +1, not the reverse Long history in PAC learning [HM91, KLV94, KMP 94, B95, BT96, BCL98, V98, SM00, S04, JS05...] f has DNF size s and is monotone ) f has a size s monotone DNF:

14
Why does monotonicity help? 1. More structured. 2. You can identify relevant variables. Fact: If f is monotone, then f depends on x i iff it has correlation with x i ; i.e., E[ f(x) x i ] 0. Proof: If f is monotone, its variables have only nonnegative correlations.

15
complexity measure sfastest known algorithm DNF sizepoly(n, s log s ) [Servedio 04] 2 (# of relevant variables) poly(n, 2 k ) = poly(n, s) depth d circuit size Decision Tree size poly(n, s) [O-Servedio 06] Monotone case any function

16
Learning Decision Trees Non-monotone (general) case: Structural result: Every size s decision tree (# of leaves = s) is -close to a decision tree with depth d := log 2 (s/ ). Proof: Truncate to depth d. Probability any input would use a longer path is · 2d = /s. There are at most s such paths. Use the union bound. x3x3 x5x5 x1x1 x1x1 x5x5 x4x4 1 +11 1 1 x2x2 1

17
Learning Decision Trees Structural result: Any depth d decision tree can be expressed as a degree d (multilinear) polynomial over R. Proof: Given a path in the tree, e.g.,x 1 = +1, x 3 = 1, x 6 = +1, output +1, there is a degree d expression in the variables which is: 0 if the path is not followed, path-output if the path is followed. Now just add these.

18
Learning Decision Trees Cor: Every size s decision tree is -close to a degree log(s/ ) multilinear polynomial. Least-squares polynomial regression (Low Degree Algorithm) Draw a bunch of data. Try to fit it to degree d multilinear polynomial over R. Minimizing L 2 error is a linear least-squares problem over n d many variables (the unknown coefficients). ) learn size s DTs in time poly(n d ) = poly(n log s ).

19
Learning monotone Decision Trees [O-Servedio 0?]: 1.Structural theorem on DTs: For any size s decision tree (not nec. monotone), the sum of the n degree 1 correlations is at most 2.Easy fact weve seen: For monotone functions, variable correlations = variable influence. 3.Theorem of [Friedgut 96]: If the total influence of f is at most t, then f essentially has at most 2 O(t) relevant variables. 4.Folklore Fourier analysis fact: If the total influence of f is at most t, then f is close to a degree-O(t) polynomial.

20
Learning monotone Decision Trees Conclusion: If f is monotone and has a size s decision tree, then it has essentially only relevant variable and essentially only degree Algorithm: Identify the essentially relevant variables (by correlation estimation). Run the Polynomial Regression algorithm up to degree, but only using those relevant variables. Total time:

21
Open problem Learn monotone DNF under uniform in polynomial time! A source of help: There is a poly-time algorithm for learning almost all randomly chosen monotone DNF of size up to n 3. [Servedio-Jackson 05] Structured monotone DNF – monotone DTs – are efficiently learnable. Typical-looking monotone DNF are efficiently learnable (at least up to size n 3 ). So… all monotone DTs are efficiently learnable? I think this problem is great because it is: a) Possibly tractable. b) Possibly true. c) Interesting to complexity theory people. d) Would close the book on learning monotone fcns under uniform!

Similar presentations

OK

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google