Download presentation

Presentation is loading. Please wait.

Published byAlexandra Cantrell Modified over 4 years ago

1
Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin 01 0 0 1 1 10 X1X1 X2X2 X3X3 0011 0 100 1 1

2
Computational Learning

4
Learning: Predict f from examples. x, f(x) f:{0,1} n ! {0,1}

5
Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Halfspaces: + - + + + + + + + - - - - - - - - - -

6
Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Decision Trees: X1X1 X2X2 X3X3 0011 0 100 1 1

7
The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) f:{0,1} n ! {0,1} No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X3 0011 0 100 1 1 X1X1

8
The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X3 0011 0 100 1 1 X1X1

9
Agnostic Model = Noisy Learning f:{0,1} n ! {0,1} += Concept: Message Truth table: Encoding Function f: Received word. Coding: Recover the Message. Learning: Predict f. X2X2 X3X3 0011 0 100 1 1 X1X1

10
Uniform Distribution Learning for Decision Trees Noiseless Setting: – No queries: n log n [Ehrenfeucht-Haussler89]. – With queries: poly(n). [Kushilevitz-Mansour91] Reconstruction for sparse real polynomials in the l 1 norm. Agnostic Setting: Polynomial time, uses queries. [G.-Kalai-Klivans08]

11
The Fourier Transform Method Powerful tool for uniform distribution learning. Powerful tool for uniform distribution learning. Introduced by Linial-Mansour-Nisan. Introduced by Linial-Mansour-Nisan. – Small depth circuits [Linial-Mansour-Nisan89] – DNFs [Jackson95] – Decision trees [Kushilevitz-Mansour94, ODonnell- Servedio06, G.-Kalai-Klivans08] – Halfspaces, Intersections [Klivans-ODonnell- Servedio03, Kalai-Klivans-Mansour-Servedio05] – Juntas [Mossel-ODonnell-Servedio03] – Parities [Feldman-G.-Khot-Ponnsuswami06]

12
The Fourier Polynomial Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1. Standard Basis Function f Parities

13
The Fourier Polynomial c( ) : Weight of. Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1.

14
Low Degree Functions Sparse Functions: Most of the weight lies on small subsets. Halfspaces, Small-depth circuits. Halfspaces, Small-depth circuits. Low-degree algorithm. [Linial- Mansour-Nisan] Low-degree algorithm. [Linial- Mansour-Nisan] Finds the low-degree Fourier coefficients. Finds the low-degree Fourier coefficients. Least Squares Regression: Find low-degree P minimizing E x [ |P(x) – f(x)| 2 ].

15
Sparse Functions Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ].

16
Sparse Regression Sparse l 2 Regression Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ]. Finding large coefficients: Hadamard decoding. [Kushilevitz-Mansour91, Goldreich-Levin89]

17
Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 f:{-1,1} n ! {-1,1}

18
Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 X2X2 X3X3 0011 0 100 1 1 X1X1

19
Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Target f Best Tree

20
+1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Agnostic Learning via Regression? Agnostic Learning via l 1 Regression?

21
+1 Agnostic Learning via Regression Agnostic Learning via l 1 Regression Thm [KKMS05]: l 1 Regression always gives a good predictor. l 1 regression for low degree polynomials via Linear Programming. Target f Best Tree

22
Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. Why is this Harder: l 2 is basis independent, l 1 is not. Dont know the support of P. Agnostically Learning Decision Trees [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse Regression. [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse l 1 Regression.

23
The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| P(x) = c( ) (x) f(x) Q(x) = d( ) (x) L 1 (P,Q) = |c( ) – d( )| L 2 (P,Q) = [ (c( ) –d( )) 2 ] 1/2

24
The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

25
Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

26
Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

27
The Gradient g(x) = sgn[f(x) - P(x)] P(x) := P(x) + g(x). Increase P(x) if low. Decrease P(x) if high. +1 f(x) P(x)

28
The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

29
Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

30
Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

31
Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

32
Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

33
Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

34
Analysis of Gradient-Projection [Zinkevich03] Progress measure: Squared L 2 distance from optimum P *. Key Equation: |P t – P * | 2 - |P t+1 – P * | 2 ¸ 2 (L(P t ) – L(P * )) Within of optimal in 1/ 2 iterations. Good L 2 approximation to P t suffices. – 2 How suboptimal current soln is. Progress made in this step.

35
+1 f(x) P(x)GradientProjection g(x) = sgn[f(x) - P(x)].

36
The Gradient g(x) = sgn[f(x) - P(x)]. +1 f(x) P(x) Compute sparse approximation g = KM(g). Is g a good L 2 approximation to g? No. Initially g = f. L 2 (g,g) can be as large 1.

37
Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Approximat e Gradient Sparse Regression Sparse l 1 Regression

38
Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Projection Compensate s Sparse Regression Sparse l 1 Regression

39
KM as Approximation KM as l 2 Approximation The KM Algorithm: Input: g:{-1,1} n ! {-1,1}, and t. Output: A t-sparse polynomial g minimizing E x [|g(x) – g(x)| 2 ] Run Time: poly(n,t).

40
KM as L 1 Approximation The KM Algorithm: Input: A Boolean function g = c( ) (x). A error bound Output: Approximation g = c( ) (x) s.t |c( ) – c( )| · for all ½ [n]. Run Time: poly(n,1/ )

41
KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0. Only 1/ 2

42
KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0.

43
Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other.

44
Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other. Else, Blue dominates Red.

45
Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball does not increase L 1 distance.

46
Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball preserves L 1 distance.

47
Sparse Regression Sparse l 1 Regression Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| L 1 (P, P) · 2 L 1 (P, P) · 2t L 2 (P, P) 2 · 4 t PP Can take = 1/t 2.

48
Sparse L 1 Regression: Find a sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Can get within of optimum in poly(t,1/ ) iterations. Can get within of optimum in poly(t,1/ ) iterations. Algorithm for Sparse l 1 Regression. Algorithm for Sparse l 1 Regression. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. Agnostically Learning Decision Trees

49
Function f: D ! [-1,1], Orthonormal Basis B. Sparse l 2 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| 2 ]. Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Given solution to, can solve Regression. [G.-Kalai-Klivans08]: Given solution to l 2 Regression, can solve l 1 Regression. Regression from Regression l 1 Regression from l 2 Regression

50
Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries) Noiseless Setting: Jacksons Harmonic Sieve. Implies weak learner for depth-3 circuits. Beyond current Fourier techniques. Agnostically Learning DNFs?

Similar presentations

OK

Jeff Howbert Introduction to Machine Learning Winter 2012 1 Regression Linear Regression.

Jeff Howbert Introduction to Machine Learning Winter 2012 1 Regression Linear Regression.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google