# Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia.

## Presentation on theme: "Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia."— Presentation transcript:

Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia

Part I: Decision trees have large influences

Does anything print? Can print from Notepad? Right size paper? Printer mis-setup? File too complicated? Network printer? Driver OK? Solved Driver OK? Solved Call tech support Printer troubleshooter

f : {Attr 1 } × {Attr 2 } × ∙∙∙ × {Attr n } → {−1,1}. What’s the “best” DT for f, and how to find it? Depth = worst case # of questions. Expected depth = avg. # of questions. Decision tree complexity

1.Identify the most ‘influential’/‘decisive’/‘relevant’ variable. 2.Put it at the root. 3.Recursively build DTs for its children. Almost all real-world learning algs based on this – CART, C4.5, … Almost no theoretical (PAC-style) learning algs based on this – [Blum92, KM93, BBVKV97, PTF-folklore, OS04] – no; [EH89, SJ03] – sorta. Conj’d to be good for some problems (e.g., percolation [SS04]) but unprovable… Building decision trees

Boolean DTs f : {−1,1} n → {−1,1}. D(f) = min depth of a DT for f. 0 ≤ D(f) ≤ n. x1x1 x2x2 x3x3 −1 1 1 x2x2 x3x3 1 Maj 3

Boolean DTs {−1,1} n viewed as a probability space, with uniform probability distribution. uniformly random path down a DT, plus a uniformly random setting of the unqueried variables, defines a uniformly random input expected depth : δ(f).

Influences influence of coordinate j on f = the probability that x j is relevant for f I j (f) = Pr[ f(x) ≠ f(x ( ⊕ j) ) ]. 0 ≤ I j (f) ≤ 1.

Main question: If a function f has a “shallow” decision tree, does it have a variable with “significant” influence?

Main question: No. But for a silly reason: Suppose f is highly biased; say Pr[f = 1] = p ≪ 1. Then for any j, I j (f) = Pr[f(x) = 1, f(x (  j) ) = −1] + Pr[f(x) = −1, f(x (  j) ) = 1] ≤ Pr[f(x) = 1] + Pr[f(x (  j) ) = 1] ≤ p + p = 2p.

Variance ⇒ Influences are always at most 2 min{p,q}. Analytically nicer expression: Var[f]. Var[f] = E[f 2 ] – E[f] 2 = 1 – (p – q) 2 = 1 – (2p − 1) 2 = 4p(1 – p) = 4pq. 2 min{p,q} ≤ 4pq ≤ 4 min{p,q}. It’s 1 for balanced functions. So I j (f) ≤ Var[f], and it is fair to say I j (f) is “significant” if it’s a significant fraction of Var[f].

Main question: If a function f has a “shallow” decision tree, does it have a variable with influence at least a “significant” fraction of Var[f]?

Notation τ(d) = min max { I j (f) / Var[f] }. f : D(f) ≤ dj

Known lower bounds Suppose f : {−1,1} n → {−1,1}. An elementary old inequality states Var[f] ≤ I j (f). Thus f has a variable with influence at least Var[f]/n. A deep inequality of [KKL88] shows there is always a coord. j such that I j (f) ≥ Var[f] · Ω(log n / n). If D(f) = d then f really has at most 2 d variables. Hence we get τ(d) ≥ 1/2 d from the first, and τ(d) ≥ Ω(d/2 d ) from KKL. j = 1 n Σ

Our result τ(d) ≥ 1/d. This is tight: Then Var[SEL] = 1, d = 2, all three variables have infl. ½. (Form recursive version, SEL(SEL, SEL, SEL) etc., gives Var 1 fcn with d = 2 h, all influences 2 −h for any h.) x1x1 x2x2 −11 x3x3 1 “SEL”

Our actual main theorem Given a decision tree f, let δ j (f) = Pr[tree queries x j ]. Then Var[f] ≤ δ j (f) I j (f). Cor: Fix the tree with smallest expected depth. Then δ j (f) = E[depth of a path] =: δ(f) ≤ D(f). ⇒ Var[f] ≤ max I j · δ j = max I j · δ(f) ⇒ max I j ≥ Var[f] / δ(f) ≥ Var[f] / D(f). j = 1 n Σ n Σ n Σ

Proof Pick a random path in the tree. This gives some set of variables, P = ( x J 1, …, x J T ), along with an assignment to them, β P. Call the remaining set of variables P and pick a random assignment β P for them too. Let X be the (uniformly random string) given by combining these two assignments, (β P, β P ). Also, define J T+1, …, J n = ┴.

Proof Let β’ P be an independent random asgn to vbls in P. Let Z = (β’ P, β P ). Note: Z is also uniformly random. x J 1 = –1 x J 2 = 1 x J 3 = -1 –1 x J T = 1 X = (-1, 1, -1, …, 1, ) Z = (, ) 1, -1, 1, -1 J1J1 J2J2 J3J3 JTJT J T+1 = ··· = J n = ┴ P P 1,-1, -1, …,-1

Proof Finally, for t = 0…T, let Y t be the same string as X, except that Z’s assignments ( β’ P ) for variables x J 1, …, x J t are swapped in. Note: Y 0 = X, Y T = Z. Y 0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · Y T = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) Also define Y T+1 = · · · = Y n = Z.

Var[f] = E[f 2 ] – E[f] 2 = E[ f(X)f(X) ] – E[ f(X)f(Z) ] = E[ f(X)f(Y 0 ) – f(X)f(Y n ) ] = E[ f(X) (f(Y t−1 ) – f(Y t )) ] ≤ E[ |f(Y t−1 ) – f(Y t )| ] = 2 Pr[f(Y t−1 ) ≠ f(Y t )] = Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] t = 1.. n Σ Σ Σ Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ

Proof …= Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] Utterly Crucial Observation: Conditioned on J t = j, (Y t−1, Y t ) are jointly distributed exactly as (W, W’), where W is uniformly random, and W’ is W with jth bit rerandomized. j = 1.. n Σ t = 1.. n Σ

Y 0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · Y T = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) x J 1 = –1 x J 2 = 1 x J 3 = 1 –1 x J T = 1 X = (-1, 1, -1, …, 1, ) Z = (, ) 1, -1, 1, -1 J1J1 J2J2 J3J3 JTJT J T+1 = ··· = J n = ┴ P P 1,-1, -1, …,-1

Proof …= Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] = Pr[J t = j] · 2 Pr[f(W) ≠ f(W’)] = Pr[J t = j] · I j (f) = I j · Pr[J t = j] = I j δ j. j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ Σ t = 1.. n Σ

Part II: Lower bounds for monotone graph properties

Monotone graph properties Consider graphs on v vertices; let n = ( ). “Nontrivial monotone graph property”: “nontrivial property”: a (nonempty, nonfull) subset of all v-vertex graphs “graph property”: closed under permutations of the vertices (  no edge is ‘distinguished’) monotone: adding edges can only put you into the property, not take you out e.g.: Contains-A-Triangle, Connected, Has-Hamiltonian- Path, Non-Planar, Has-at-least-n/2-edges, … v2v2

Aanderaa-Karp-Rosenberg conj. Every nontrivial monotone graph propery has D(f) = n. [Rivest-Vuillemin-75]: ≥ v 2 /16. [Kleitman-Kwiatowski-80] ≥ v 2 /9. [Kahn-Saks-Sturtevant-84] ≥ n/2, = n, if v is a prime power. [Topology + group theory!] [Yao-88] = n in the bipartite case.

Randomized DTs Have ‘coin flip’ nodes in the trees that cost nothing. Or, probability distribution over deterministic DTs. Note: We want both 0-sided error and worst-case input. R(f) = min, over randomized DTs that compute f with 0- error, of max over inputs x, of expected # of queries. The expectation is only over the DT’s internal coins.

D(Maj 3 ) = 3. Pick two inputs at random, check if they’re the same. If not, check the 3rd.  R(Maj 3 ) ≤ 8/3. Let f = recursive-Maj 3 [ Maj 3 (Maj 3, Maj 3, Maj 3 ), etc…] For depth-h version (n = 3 h ), D(f) = 3 h. R(f) ≤ (8/3) h. (Not best possible…!) Maj 3 :

Randomized AKR / Yao conj. Yao conjectured in ’77 that every nontrivial monotone graph property f has R(f) ≥ Ω(v 2 ). Lower bound Ω( · )Who v[Yao-77] v log 1/12 v[Yao-87] v 5/4 [King-88] v 4/3 [Hajnal-91] v 4/3 log 1/3 v[Chakrabarti-Khot-01] min{ v/p, v 2 /log v }[Fried.-Kahn-Wigd.-02] v 4/3 / p 1/3 [us]

Outline Extend main inequality to the p-biased case. (Then LHS is 1.) Use Yao’s minmax principle: Show that under p-biased {−1,1} n, δ = Σ δ j = avg # queries is large for any tree. Main inequality: max influence is small ⇒ δ is large. Graph property  all vbls have the same influence. Hence: sum of influences is small ⇒ δ is large. [OS04]: f monotone ⇒ sum of influences ≤ √ δ. Hence: sum of influences is large ⇒ δ is large. So either way, δ is large.

Generalizing the inequality Var[f] ≤ δ j (f) I j (f). Generalizations (which basically require no proof change): holds for randomized DTs holds for randomized “subcube partitions” holds for functions on any product probability space f : Ω 1 × ∙∙∙ × Ω n → {−1,1} (with notion of “influence” suitably generalized) holds for real-valued functions with (necessary) loss of a factor, at most √ δ j = 1 n Σ

Closing thought It’s funny that our bound gets stuck roughly at the same level as Hajnal / Chakrabarti-Khot, n 2/3 = v 4/3. Note that n 2/3 [I believe] cannot be improved by more than a log factor merely for monotone transitive functions, due to [BSW04]. Thus to get better than v 4/3 for monotone graph properties, you must use the fact that it’s a graph property. Chakrabarti-Khot does definitely use the fact that it’s a graph property (all sorts of graph packing lemmas). Or do they? Since they get stuck at essentially v 4/3, I wonder if there’s any chance their result doesn’t truly need the fact that it’s a graph property…

Download ppt "Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia."

Similar presentations