Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin 01 0 0 1.

Slides:

Advertisements

Similar presentations

1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )

Advertisements

How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.

Reductions to the Noisy Parity Problem TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Vitaly Feldman Parikshit.

Hardness of Reconstructing Multivariate Polynomials. Parikshit Gopalan U. Washington Parikshit Gopalan U. Washington Subhash Khot NYU/Gatech Rishi Saket.

Hardness Amplification within NP against Deterministic Algorithms Parikshit Gopalan U Washington & MSR-SVC Venkatesan Guruswami U Washington & IAS.

Numerical Linear Algebra in the Streaming Model

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.

Ryan ODonnell Carnegie Mellon University Karl Wimmer CMU & Duquesne Amir Shpilka Technion Rocco Servedio Columbia Parikshit Gopalan UW & Microsoft SVC.

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Computational Applications of Noise Sensitivity Ryan O’Donnell.

Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff.

DNF Sparsification and Counting Raghu Meka (IAS, Princeton) Parikshit Gopalan (MSR, SVC) Omer Reingold (MSR, SVC)

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

List decoding Reed-Muller codes up to minimal distance: Structure and pseudo- randomness in coding theory Abhishek Bhowmick (UT Austin) Shachar Lovett.

Counting Algorithms for Knapsack and Related Problems 1 Raghu Meka (UT Austin, work done at MSR, SVC) Parikshit Gopalan (Microsoft Research, SVC) Adam.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Data Modeling and Parameter Estimation Nov 9, 2005 PSCI 702.

Chapter 10 Curve Fitting and Regression Analysis

Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

Exact or stable image\signal reconstruction from incomplete information Project guide: Dr. Pradeep Sen UNM (Abq) Submitted by: Nitesh Agarwal IIT Roorkee.

Sketching for M-Estimators: A Unified Approach to Robust Regression

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University

CS151 Complexity Theory Lecture 10 April 29, 2004.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

Sparse vs. Ensemble Approaches to Supervised Learning

REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.

Classification and Prediction: Regression Analysis

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Shang-Hua Teng* University of Southern California Alex Samorodnitsky* Hebrew.

An Introduction to Support Vector Machines (M. Law)

Linear Models for Classification

CpSc 881: Machine Learning

Fourier Transforms in Computer Science. Can a function [0,2  ] z R be expressed as a linear combination of sin nx, cos nx ? If yes, how do we find the.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.

Chapter 7. Classification and Prediction

Vitaly Feldman and Jan Vondrâk IBM Research - Almaden

Introduction to Machine Learning

Boosting and Additive Trees (2)

CSE 4705 Artificial Intelligence

Circuit Lower Bounds A combinatorial approach to P vs NP

A Simple Artificial Neuron

Roberto Battiti, Mauro Brunato

Tight Fourier Tails for AC0 Circuits

Linear sketching with parities

CSCI B609: “Foundations of Data Science”

Online Learning Kernels

Linear sketching over

Learning, testing, and approximating halfspaces

Biointelligence Laboratory, Seoul National University

Fourier Analysis and Boolean Function Learning

including joint work with:

Machine learning overview

Presentation transcript:

Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin X1X1 X2X2 X3X

Computational Learning

Learning: Predict f from examples. x, f(x) f:{0,1} n ! {0,1}

Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Halfspaces:

Valiants Model x, f(x) f:{0,1} n ! {0,1} Assumption: f comes from a nice concept class. Decision Trees: X1X1 X2X2 X3X

The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) f:{0,1} n ! {0,1} No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X X1X1

The Agnostic Model [Kearns-Schapire-Sellie94] x, f(x) No assumptions about f. Learner should do as well as best decision tree. Decision Trees: X2X2 X3X X1X1

Agnostic Model = Noisy Learning f:{0,1} n ! {0,1} += Concept: Message Truth table: Encoding Function f: Received word. Coding: Recover the Message. Learning: Predict f. X2X2 X3X X1X1

Uniform Distribution Learning for Decision Trees Noiseless Setting: – No queries: n log n [Ehrenfeucht-Haussler89]. – With queries: poly(n). [Kushilevitz-Mansour91] Reconstruction for sparse real polynomials in the l 1 norm. Agnostic Setting: Polynomial time, uses queries. [G.-Kalai-Klivans08]

The Fourier Transform Method Powerful tool for uniform distribution learning. Powerful tool for uniform distribution learning. Introduced by Linial-Mansour-Nisan. Introduced by Linial-Mansour-Nisan. – Small depth circuits [Linial-Mansour-Nisan89] – DNFs [Jackson95] – Decision trees [Kushilevitz-Mansour94, ODonnell- Servedio06, G.-Kalai-Klivans08] – Halfspaces, Intersections [Klivans-ODonnell- Servedio03, Kalai-Klivans-Mansour-Servedio05] – Juntas [Mossel-ODonnell-Servedio03] – Parities [Feldman-G.-Khot-Ponnsuswami06]

The Fourier Polynomial Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1. Standard Basis Function f Parities

The Fourier Polynomial c( ) : Weight of. Let f:{-1,1} n ! {-1,1}. Let f:{-1,1} n ! {-1,1}. Write f as a polynomial. Write f as a polynomial. – AND: ½ + ½X 1 + ½X 2 - ½X 1 X 2 – Parity: X 1 X 2 Parity of ½ [n]: (x) = i 2 X i Parity of ½ [n]: (x) = i 2 X i Write f(x) = c( ) (x) Write f(x) = c( ) (x) – c( ) =1.

Low Degree Functions Sparse Functions: Most of the weight lies on small subsets. Halfspaces, Small-depth circuits. Halfspaces, Small-depth circuits. Low-degree algorithm. [Linial- Mansour-Nisan] Low-degree algorithm. [Linial- Mansour-Nisan] Finds the low-degree Fourier coefficients. Finds the low-degree Fourier coefficients. Least Squares Regression: Find low-degree P minimizing E x [ |P(x) – f(x)| 2 ].

Sparse Functions Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ].

Sparse Regression Sparse l 2 Regression Sparse Functions: Most of the weight lies on a few subsets. Decision trees. Decision trees. t leaves ) O(t) subsets Sparse Algorithm. Sparse Algorithm.[Kushilevitz-Mansour91] Sparse l 2 Regression: Find t-sparse P minimizing E x [ |P(x) – f(x)| 2 ]. Finding large coefficients: Hadamard decoding. [Kushilevitz-Mansour91, Goldreich-Levin89]

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 f:{-1,1} n ! {-1,1}

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 X2X2 X3X X1X1

Agnostic Learning via Regression? Agnostic Learning via l 2 Regression? +1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Target f Best Tree

+1 l 2 Regression: Loss |P(x) –f(x)| 2 Pay 1 for indecision. Pay 4 for a mistake. l 1 Regression: [KKMS05] Loss |P(x) –f(x)| Pay 1 for indecision. Pay 2 for a mistake. Agnostic Learning via Regression? Agnostic Learning via l 1 Regression?

+1 Agnostic Learning via Regression Agnostic Learning via l 1 Regression Thm [KKMS05]: l 1 Regression always gives a good predictor. l 1 regression for low degree polynomials via Linear Programming. Target f Best Tree

Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. Why is this Harder: l 2 is basis independent, l 1 is not. Dont know the support of P. Agnostically Learning Decision Trees [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse Regression. [G.-Kalai-Klivans] : Polynomial time algorithm for Sparse l 1 Regression.

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| P(x) = c( ) (x) f(x) Q(x) = d( ) (x) L 1 (P,Q) = |c( ) – d( )| L 2 (P,Q) = [ (c( ) –d( )) 2 ] 1/2

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

The Gradient g(x) = sgn[f(x) - P(x)] P(x) := P(x) + g(x). Increase P(x) if low. Decrease P(x) if high. +1 f(x) P(x)

The Gradient-Projection Method Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Gradient Projection The Gradient-Projection Method

Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

Projection onto the L 1 ball Currently: |c( )| > t Want: |c( )| · t.

Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

Projection onto the L 1 ball Below cutoff: Set to 0. Above cutoff: Subtract.

Analysis of Gradient-Projection [Zinkevich03] Progress measure: Squared L 2 distance from optimum P *. Key Equation: |P t – P * | 2 - |P t+1 – P * | 2 ¸ 2 (L(P t ) – L(P * )) Within of optimal in 1/ 2 iterations. Good L 2 approximation to P t suffices. – 2 How suboptimal current soln is. Progress made in this step.

+1 f(x) P(x)GradientProjection g(x) = sgn[f(x) - P(x)].

The Gradient g(x) = sgn[f(x) - P(x)]. +1 f(x) P(x) Compute sparse approximation g = KM(g). Is g a good L 2 approximation to g? No. Initially g = f. L 2 (g,g) can be as large 1.

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Approximat e Gradient Sparse Regression Sparse l 1 Regression

Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| Projection Compensate s Sparse Regression Sparse l 1 Regression

KM as Approximation KM as l 2 Approximation The KM Algorithm: Input: g:{-1,1} n ! {-1,1}, and t. Output: A t-sparse polynomial g minimizing E x [|g(x) – g(x)| 2 ] Run Time: poly(n,t).

KM as L 1 Approximation The KM Algorithm: Input: A Boolean function g = c( ) (x). A error bound Output: Approximation g = c( ) (x) s.t |c( ) – c( )| · for all ½ [n]. Run Time: poly(n,1/ )

KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0. Only 1/ 2

KM as L 1 Approximation 1)Identify coefficients larger than. 2) Estimate via sampling, set rest to 0.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Both lines stop within of each other. Else, Blue dominates Red.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball does not increase L 1 distance.

Projection Preserves L 1 Distance L 1 distance at most 2 after projection. Projecting onto the L 1 ball preserves L 1 distance.

Sparse Regression Sparse l 1 Regression Variables: c( )s. Constraint: |c( ) | · t Minimize: E x |P(x) – f(x)| L 1 (P, P) · 2 L 1 (P, P) · 2t L 2 (P, P) 2 · 4 t PP Can take = 1/t 2.

Sparse L 1 Regression: Find a sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Can get within of optimum in poly(t,1/ ) iterations. Can get within of optimum in poly(t,1/ ) iterations. Algorithm for Sparse l 1 Regression. Algorithm for Sparse l 1 Regression. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. First polynomial time algorithm for Agnostically Learning Sparse Polynomials. Agnostically Learning Decision Trees

Function f: D ! [-1,1], Orthonormal Basis B. Sparse l 2 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| 2 ]. Sparse l 1 Regression: Find a t-sparse polynomial P minimizing E x [ |P(x) – f(x)| ]. [G.-Kalai-Klivans08]: Given solution to, can solve Regression. [G.-Kalai-Klivans08]: Given solution to l 2 Regression, can solve l 1 Regression. Regression from Regression l 1 Regression from l 2 Regression

Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries) Noiseless Setting: Jacksons Harmonic Sieve. Implies weak learner for depth-3 circuits. Beyond current Fourier techniques. Agnostically Learning DNFs?