Download presentation

Presentation is loading. Please wait.

Published byMarisol Gerold Modified over 2 years ago

1
Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia

2
Part I: Decision trees have large influences

3
Does anything print? Can print from Notepad? Right size paper? Printer mis-setup? File too complicated? Network printer? Driver OK? Solved Driver OK? Solved Call tech support Printer troubleshooter

4
f : {Attr 1 } × {Attr 2 } × ∙∙∙ × {Attr n } → {−1,1}. What’s the “best” DT for f, and how to find it? Depth = worst case # of questions. Expected depth = avg. # of questions. Decision tree complexity

5
1.Identify the most ‘influential’/‘decisive’/‘relevant’ variable. 2.Put it at the root. 3.Recursively build DTs for its children. Almost all real-world learning algs based on this – CART, C4.5, … Almost no theoretical (PAC-style) learning algs based on this – [Blum92, KM93, BBVKV97, PTF-folklore, OS04] – no; [EH89, SJ03] – sorta. Conj’d to be good for some problems (e.g., percolation [SS04]) but unprovable… Building decision trees

6
Boolean DTs f : {−1,1} n → {−1,1}. D(f) = min depth of a DT for f. 0 ≤ D(f) ≤ n. x1x1 x2x2 x3x3 −1 1 1 x2x2 x3x3 1 Maj 3

7
Boolean DTs {−1,1} n viewed as a probability space, with uniform probability distribution. uniformly random path down a DT, plus a uniformly random setting of the unqueried variables, defines a uniformly random input expected depth : δ(f).

8
Influences influence of coordinate j on f = the probability that x j is relevant for f I j (f) = Pr[ f(x) ≠ f(x ( ⊕ j) ) ]. 0 ≤ I j (f) ≤ 1.

9
Main question: If a function f has a “shallow” decision tree, does it have a variable with “significant” influence?

10
Main question: No. But for a silly reason: Suppose f is highly biased; say Pr[f = 1] = p ≪ 1. Then for any j, I j (f) = Pr[f(x) = 1, f(x ( j) ) = −1] + Pr[f(x) = −1, f(x ( j) ) = 1] ≤ Pr[f(x) = 1] + Pr[f(x ( j) ) = 1] ≤ p + p = 2p.

11
Variance ⇒ Influences are always at most 2 min{p,q}. Analytically nicer expression: Var[f]. Var[f] = E[f 2 ] – E[f] 2 = 1 – (p – q) 2 = 1 – (2p − 1) 2 = 4p(1 – p) = 4pq. 2 min{p,q} ≤ 4pq ≤ 4 min{p,q}. It’s 1 for balanced functions. So I j (f) ≤ Var[f], and it is fair to say I j (f) is “significant” if it’s a significant fraction of Var[f].

12
Main question: If a function f has a “shallow” decision tree, does it have a variable with influence at least a “significant” fraction of Var[f]?

13
Notation τ(d) = min max { I j (f) / Var[f] }. f : D(f) ≤ dj

14
Known lower bounds Suppose f : {−1,1} n → {−1,1}. An elementary old inequality states Var[f] ≤ I j (f). Thus f has a variable with influence at least Var[f]/n. A deep inequality of [KKL88] shows there is always a coord. j such that I j (f) ≥ Var[f] · Ω(log n / n). If D(f) = d then f really has at most 2 d variables. Hence we get τ(d) ≥ 1/2 d from the first, and τ(d) ≥ Ω(d/2 d ) from KKL. j = 1 n Σ

15
Our result τ(d) ≥ 1/d. This is tight: Then Var[SEL] = 1, d = 2, all three variables have infl. ½. (Form recursive version, SEL(SEL, SEL, SEL) etc., gives Var 1 fcn with d = 2 h, all influences 2 −h for any h.) x1x1 x2x2 −11 x3x3 1 “SEL”

16
Our actual main theorem Given a decision tree f, let δ j (f) = Pr[tree queries x j ]. Then Var[f] ≤ δ j (f) I j (f). Cor: Fix the tree with smallest expected depth. Then δ j (f) = E[depth of a path] =: δ(f) ≤ D(f). ⇒ Var[f] ≤ max I j · δ j = max I j · δ(f) ⇒ max I j ≥ Var[f] / δ(f) ≥ Var[f] / D(f). j = 1 n Σ n Σ n Σ

17
Proof Pick a random path in the tree. This gives some set of variables, P = ( x J 1, …, x J T ), along with an assignment to them, β P. Call the remaining set of variables P and pick a random assignment β P for them too. Let X be the (uniformly random string) given by combining these two assignments, (β P, β P ). Also, define J T+1, …, J n = ┴.

18
Proof Let β’ P be an independent random asgn to vbls in P. Let Z = (β’ P, β P ). Note: Z is also uniformly random. x J 1 = –1 x J 2 = 1 x J 3 = -1 –1 x J T = 1 X = (-1, 1, -1, …, 1, ) Z = (, ) 1, -1, 1, -1 J1J1 J2J2 J3J3 JTJT J T+1 = ··· = J n = ┴ P P 1,-1, -1, …,-1

19
Proof Finally, for t = 0…T, let Y t be the same string as X, except that Z’s assignments ( β’ P ) for variables x J 1, …, x J t are swapped in. Note: Y 0 = X, Y T = Z. Y 0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · Y T = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) Also define Y T+1 = · · · = Y n = Z.

20
Var[f] = E[f 2 ] – E[f] 2 = E[ f(X)f(X) ] – E[ f(X)f(Z) ] = E[ f(X)f(Y 0 ) – f(X)f(Y n ) ] = E[ f(X) (f(Y t−1 ) – f(Y t )) ] ≤ E[ |f(Y t−1 ) – f(Y t )| ] = 2 Pr[f(Y t−1 ) ≠ f(Y t )] = Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] t = 1.. n Σ Σ Σ Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ

21
Proof …= Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] Utterly Crucial Observation: Conditioned on J t = j, (Y t−1, Y t ) are jointly distributed exactly as (W, W’), where W is uniformly random, and W’ is W with jth bit rerandomized. j = 1.. n Σ t = 1.. n Σ

22
Y 0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 1 = ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y 2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · Y T = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) x J 1 = –1 x J 2 = 1 x J 3 = 1 –1 x J T = 1 X = (-1, 1, -1, …, 1, ) Z = (, ) 1, -1, 1, -1 J1J1 J2J2 J3J3 JTJT J T+1 = ··· = J n = ┴ P P 1,-1, -1, …,-1

23
Proof …= Pr[J t = j] · 2 Pr[f(Y t−1 ) ≠ f(Y t ) | J t = j] = Pr[J t = j] · 2 Pr[f(W) ≠ f(W’)] = Pr[J t = j] · I j (f) = I j · Pr[J t = j] = I j δ j. j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ t = 1.. n Σ j = 1.. n Σ Σ t = 1.. n Σ

24
Part II: Lower bounds for monotone graph properties

25
Monotone graph properties Consider graphs on v vertices; let n = ( ). “Nontrivial monotone graph property”: “nontrivial property”: a (nonempty, nonfull) subset of all v-vertex graphs “graph property”: closed under permutations of the vertices ( no edge is ‘distinguished’) monotone: adding edges can only put you into the property, not take you out e.g.: Contains-A-Triangle, Connected, Has-Hamiltonian- Path, Non-Planar, Has-at-least-n/2-edges, … v2v2

26
Aanderaa-Karp-Rosenberg conj. Every nontrivial monotone graph propery has D(f) = n. [Rivest-Vuillemin-75]: ≥ v 2 /16. [Kleitman-Kwiatowski-80] ≥ v 2 /9. [Kahn-Saks-Sturtevant-84] ≥ n/2, = n, if v is a prime power. [Topology + group theory!] [Yao-88] = n in the bipartite case.

27
Randomized DTs Have ‘coin flip’ nodes in the trees that cost nothing. Or, probability distribution over deterministic DTs. Note: We want both 0-sided error and worst-case input. R(f) = min, over randomized DTs that compute f with 0- error, of max over inputs x, of expected # of queries. The expectation is only over the DT’s internal coins.

28
D(Maj 3 ) = 3. Pick two inputs at random, check if they’re the same. If not, check the 3rd. R(Maj 3 ) ≤ 8/3. Let f = recursive-Maj 3 [ Maj 3 (Maj 3, Maj 3, Maj 3 ), etc…] For depth-h version (n = 3 h ), D(f) = 3 h. R(f) ≤ (8/3) h. (Not best possible…!) Maj 3 :

29
Randomized AKR / Yao conj. Yao conjectured in ’77 that every nontrivial monotone graph property f has R(f) ≥ Ω(v 2 ). Lower bound Ω( · )Who v[Yao-77] v log 1/12 v[Yao-87] v 5/4 [King-88] v 4/3 [Hajnal-91] v 4/3 log 1/3 v[Chakrabarti-Khot-01] min{ v/p, v 2 /log v }[Fried.-Kahn-Wigd.-02] v 4/3 / p 1/3 [us]

30
Outline Extend main inequality to the p-biased case. (Then LHS is 1.) Use Yao’s minmax principle: Show that under p-biased {−1,1} n, δ = Σ δ j = avg # queries is large for any tree. Main inequality: max influence is small ⇒ δ is large. Graph property all vbls have the same influence. Hence: sum of influences is small ⇒ δ is large. [OS04]: f monotone ⇒ sum of influences ≤ √ δ. Hence: sum of influences is large ⇒ δ is large. So either way, δ is large.

31
Generalizing the inequality Var[f] ≤ δ j (f) I j (f). Generalizations (which basically require no proof change): holds for randomized DTs holds for randomized “subcube partitions” holds for functions on any product probability space f : Ω 1 × ∙∙∙ × Ω n → {−1,1} (with notion of “influence” suitably generalized) holds for real-valued functions with (necessary) loss of a factor, at most √ δ j = 1 n Σ

32
Closing thought It’s funny that our bound gets stuck roughly at the same level as Hajnal / Chakrabarti-Khot, n 2/3 = v 4/3. Note that n 2/3 [I believe] cannot be improved by more than a log factor merely for monotone transitive functions, due to [BSW04]. Thus to get better than v 4/3 for monotone graph properties, you must use the fact that it’s a graph property. Chakrabarti-Khot does definitely use the fact that it’s a graph property (all sorts of graph packing lemmas). Or do they? Since they get stuck at essentially v 4/3, I wonder if there’s any chance their result doesn’t truly need the fact that it’s a graph property…

Similar presentations

OK

1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)

1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on tamper resistant bolts Convert pdf to ppt online free without email Ppt on x-ray tube History of sociology ppt on slides Transparent lcd display ppt online Download ppt on endangered species in india Ppt on active listening skills Ppt on producers consumers and decomposers in a food Ppt on primary and secondary data Ppt on current account deficit