Presentation is loading. Please wait.

# Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin 01 0 0 1.

## Presentation on theme: "Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin 01 0 0 1."— Presentation transcript:

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin 01 0 0 1 1 10 X1X1 X2X2 X3X3 0011 0 100 1 1

Computational Learning

Learning: Predict f from examples. x, f(x) f:{0,1} n ! {0,1}

Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Halfspaces: + - + + + + + + + - - - - - - - - - -

Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Decision Trees: X1X1 X2X2 X3X3 0011 0 100 1 1

The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) f:{0,1} n ! {0,1} No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X3 0011 0 100 1 1 X1X1

The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X3 0011 0 100 1 1 X1X1

Agnostic Model = Noisy Learning f:{0,1} n ! {0,1} += Concept: Message Truth table: Encoding Function f: Received word. Coding: Recover the Message. Learning: Predict f. X2X2 X3X3 0011 0 100 1 1 X1X1

Uniform Distribution Learning for Decision Trees Noiseless Setting: – No queries: n log n [Ehrenfeucht-Haussler89]. – With queries: poly(n). [Kushilevitz-Mansour91] Reconstruction for sparse real polynomials in the l 1 norm. Agnostic Setting: Polynomial time, uses queries. [G.-Kalai-Klivans08]

The Fourier Transform Method Powerful tool for uniform distribution learning. Powerful tool for uniform distribution learning. Introduced by Linial-Mansour-Nisan. Introduced by Linial-Mansour-Nisan. – Small depth circuits [Linial-Mansour-Nisan89] – DNFs [Jackson95] – Decision trees [Kushilevitz-Mansour94, ODonnell- Servedio06, G.-Kalai-Klivans08] – Halfspaces, Intersections [Klivans-ODonnell- Servedio03, Kalai-Klivans-Mansour-Servedio05] – Juntas [Mossel-ODonnell-Servedio03] – Parities [Feldman-G.-Khot-Ponnsuswami06]

The Fourier Polynomial Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1. Standard Basis Function f Parities

The Fourier Polynomial c( ) : Weight of. Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1.

Low Degree Functions Sparse Functions: Most of the weight lies on small subsets. Halfspaces, Small-depth circuits. Halfspaces, Small-depth circuits. Low-degree algorithm. [Linial- Mansour-Nisan] Low-degree algorithm. [Linial- Mansour-Nisan] Finds the low-degree Fourier coefficients. Finds the low-degree Fourier coefficients. Least Squares Regression: Find low-degree P minimizing E x [ |P(x) – f(x)| 2 ].

Sparse Functions Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ].

Sparse Regression Sparse l 2 Regression Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ]. Finding large coefficients: Hadamard decoding. [Kushilevitz-Mansour91, Goldreich-Levin89]

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 f:{-1,1} n ! {-1,1}

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 X2X2 X3X3 0011 0 100 1 1 X1X1

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Target f Best Tree

+1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Agnostic Learning via Regression? Agnostic Learning via l 1 Regression?

+1 Agnostic Learning via Regression Agnostic Learning via l 1 Regression Thm [KKMS05]: l 1 Regression always gives a good predictor. l 1 regression for low degree polynomials via Linear Programming. Target f Best Tree

Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. Why is this Harder: l 2 is basis independent, l 1 is not. Dont know the support of P. Agnostically Learning Decision Trees [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse Regression. [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse l 1 Regression.

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| P(x) = c( ) (x) f(x) Q(x) = d( ) (x) L 1 (P,Q) = |c( ) – d( )| L 2 (P,Q) = [ (c( ) –d( )) 2 ] 1/2

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

The Gradient g(x) = sgn[f(x) - P(x)] P(x) := P(x) + g(x). Increase P(x) if low. Decrease P(x) if high. +1 f(x) P(x)

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

Analysis of Gradient-Projection [Zinkevich03] Progress measure: Squared L 2 distance from optimum P *. Key Equation: |P t – P * | 2 - |P t+1 – P * | 2 ¸ 2 (L(P t ) – L(P * )) Within of optimal in 1/ 2 iterations. Good L 2 approximation to P t suffices. – 2 How suboptimal current soln is. Progress made in this step.

+1 f(x) P(x)GradientProjection g(x) = sgn[f(x) - P(x)].

The Gradient g(x) = sgn[f(x) - P(x)]. +1 f(x) P(x) Compute sparse approximation g = KM(g). Is g a good L 2 approximation to g? No. Initially g = f. L 2 (g,g) can be as large 1.

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Approximat e Gradient Sparse Regression Sparse l 1 Regression

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Projection Compensate s Sparse Regression Sparse l 1 Regression

KM as Approximation KM as l 2 Approximation The KM Algorithm: Input: g:{-1,1} n ! {-1,1}, and t. Output: A t-sparse polynomial g minimizing E x [|g(x) – g(x)| 2 ] Run Time: poly(n,t).

KM as L 1 Approximation The KM Algorithm: Input: A Boolean function g = c( ) (x). A error bound Output: Approximation g = c( ) (x) s.t |c( ) – c( )| · for all ½ [n]. Run Time: poly(n,1/ )

KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0. Only 1/ 2

KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other. Else, Blue dominates Red.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball does not increase L 1 distance.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball preserves L 1 distance.

Sparse Regression Sparse l 1 Regression Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| L 1 (P, P) · 2 L 1 (P, P) · 2t L 2 (P, P) 2 · 4 t PP Can take = 1/t 2.

Sparse L 1 Regression: Find a sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Can get within of optimum in poly(t,1/ ) iterations. Can get within of optimum in poly(t,1/ ) iterations. Algorithm for Sparse l 1 Regression. Algorithm for Sparse l 1 Regression. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. Agnostically Learning Decision Trees

Function f: D ! [-1,1], Orthonormal Basis B. Sparse l 2 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| 2 ]. Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Given solution to, can solve Regression. [G.-Kalai-Klivans08]: Given solution to l 2 Regression, can solve l 1 Regression. Regression from Regression l 1 Regression from l 2 Regression

Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries) Noiseless Setting: Jacksons Harmonic Sieve. Implies weak learner for depth-3 circuits. Beyond current Fourier techniques. Agnostically Learning DNFs?

Similar presentations

Ads by Google