# LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

## Presentation on theme: "LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006."— Presentation transcript:

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006

Re: How to make \$1000! A Grand of George W.s: A Hundred Hamiltons: A Cool Cleveland:

The junta learning problem f : {1,+1} n ! {1,+1} is an unknown Boolean function. f depends on only k ¿ n bits. May generate examples, h x, f(x) i, where x is generated uniformly at random. Task: Identify the k relevant variables., Identify f exactly., Identify one relevant variable. DNA

Run time efficiency Information theoretically: Algorithmically: Naive algorithm: Time n k. Best known algorithm: Time = n.704 k [Mossel-O-Servedio 04] Need only ¼ 2 k log n examples. Seem to need n (k) time steps.

How to get the money Learning log n-juntas in poly(n) time gets you \$1000. Learning log log n-juntas in poly(n) time gets you \$1000. Learning n (1)-juntas in poly(n) time gets you \$200. The case k = log n is a subproblem of the problem ofLearning polynomial-size DNF under the uniform distribution. http://www.thesmokinggun.com/archive/bushbill1.html

Time: n Algorithmic attempts For each x i, measure empirical correlation with f(x): E[ f(x) x i ]. Different from 0 ) x i must be relevant. Converse false: x i can be influential but uncorrelated. (e.g., k = 4, f = exactly 2 out of 4 bits are +1) Try measuring f s correlation with pairs of variables: E[ f(x) x i x j ]. Different from 0 ) both x i and x j must be relevant. Still might not work. (e.g., k ¸ 3, f = parity on k bits) So try measuring correlation with all triples of variables… Time: n 2 Time: n 3

A result In time n d, you can check correlation with all d-bit functions. What kind of Boolean functions on k bits could be uncorrelated with all functions on d or fewer bits?? [Mossel-O-Servedio 04]: Proves structure theorem about such functions. (They must be expressible as parities of ANDs of small size.) Can apply a parity-learning algorithm in that case. End result: An algorithm running in time (Well, parities on > d bits, e.g.…) Uniform-distribution learning results often implied by structural results about Boolean functions. ÆÆÆÆ

PAC Learning PAC Learning: There is an unknown f : {1,+1} n ! {1,+1}. Algorithm gets i.i.d. examples, h x, f(x) i Task: Learn. Given, find a hypothesis function h which is (w.h.p.) -close to f. Goal: Running-time efficiency. CIRCUITS OF THE MIND unknown dist. Uniform Distribution

Running-time efficiency The more complex f is, the more time its fair to allow. Fix some measure of complexity or size, s = s( f ). Goal: run in time poly(n, 1/, s). Often focus on fixing s = poly(n), learning in poly(n) time. e.g., size of smallest DNF formula

The junta problem Fits into the formulation (slightly strangely): is fixed to 0. (Equivalently, 2k.) Measure of size is 2 (# of relevant variables). s = 2 k. [Mossel-O-Servedio 04] had running time essentially Even under this extremely conservative notion of size, we dont know how to learn in poly(n) time for s = poly(n).

complexity measure sfastest known algorithm DNF sizen O(log s) [V 90] 2 (# of relevant variables) n.704 log 2 s [MOS 04] depth d circuit sizen O(log d-1 s) [LMN 93, H 02] Assuming factoring is hard, n log (d) s time is necessary. Even with queries. [K 93] Decision Tree size n O(log s) [EH 89] Any algorithm that works in the Statistical Query model requires time n k. [BF 02]

What to do? 1. Give Learner extra help: Queries: Learner can ask for f(x) for any x. ) Can learn DNF in time poly(n, s). [Jackson 94] More structured data: Examples are not i.i.d., are generated by a standard random walk. Examples come in pairs, h x, f(x) i, h x', f(x') i, where x, x' share a > ½ fraction of coordinates. ) Can learn DNF in time poly(n, s). [Bshouty-Mossel-O-Servedio 05]

What to do? (rest of the talk) 2. Give up on trying to learn all functions. Rest of the talk: Focus on just learn monotone functions. f is monotone, changing a 1 to a +1 in the input can only make f go from 1 to +1, not the reverse Long history in PAC learning [HM91, KLV94, KMP 94, B95, BT96, BCL98, V98, SM00, S04, JS05...] f has DNF size s and is monotone ) f has a size s monotone DNF:

Why does monotonicity help? 1. More structured. 2. You can identify relevant variables. Fact: If f is monotone, then f depends on x i iff it has correlation with x i ; i.e., E[ f(x) x i ] 0. Proof: If f is monotone, its variables have only nonnegative correlations.

complexity measure sfastest known algorithm DNF sizepoly(n, s log s ) [Servedio 04] 2 (# of relevant variables) poly(n, 2 k ) = poly(n, s) depth d circuit size Decision Tree size poly(n, s) [O-Servedio 06] Monotone case any function

Learning Decision Trees Non-monotone (general) case: Structural result: Every size s decision tree (# of leaves = s) is -close to a decision tree with depth d := log 2 (s/ ). Proof: Truncate to depth d. Probability any input would use a longer path is · 2d = /s. There are at most s such paths. Use the union bound. x3x3 x5x5 x1x1 x1x1 x5x5 x4x4 1 +11 1 1 x2x2 1

Learning Decision Trees Structural result: Any depth d decision tree can be expressed as a degree d (multilinear) polynomial over R. Proof: Given a path in the tree, e.g.,x 1 = +1, x 3 = 1, x 6 = +1, output +1, there is a degree d expression in the variables which is: 0 if the path is not followed, path-output if the path is followed. Now just add these.

Learning Decision Trees Cor: Every size s decision tree is -close to a degree log(s/ ) multilinear polynomial. Least-squares polynomial regression (Low Degree Algorithm) Draw a bunch of data. Try to fit it to degree d multilinear polynomial over R. Minimizing L 2 error is a linear least-squares problem over n d many variables (the unknown coefficients). ) learn size s DTs in time poly(n d ) = poly(n log s ).

Learning monotone Decision Trees [O-Servedio 0?]: 1.Structural theorem on DTs: For any size s decision tree (not nec. monotone), the sum of the n degree 1 correlations is at most 2.Easy fact weve seen: For monotone functions, variable correlations = variable influence. 3.Theorem of [Friedgut 96]: If the total influence of f is at most t, then f essentially has at most 2 O(t) relevant variables. 4.Folklore Fourier analysis fact: If the total influence of f is at most t, then f is close to a degree-O(t) polynomial.

Learning monotone Decision Trees Conclusion: If f is monotone and has a size s decision tree, then it has essentially only relevant variable and essentially only degree Algorithm: Identify the essentially relevant variables (by correlation estimation). Run the Polynomial Regression algorithm up to degree, but only using those relevant variables. Total time:

Open problem Learn monotone DNF under uniform in polynomial time! A source of help: There is a poly-time algorithm for learning almost all randomly chosen monotone DNF of size up to n 3. [Servedio-Jackson 05] Structured monotone DNF – monotone DTs – are efficiently learnable. Typical-looking monotone DNF are efficiently learnable (at least up to size n 3 ). So… all monotone DTs are efficiently learnable? I think this problem is great because it is: a) Possibly tractable. b) Possibly true. c) Interesting to complexity theory people. d) Would close the book on learning monotone fcns under uniform!

Download ppt "LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006."

Similar presentations