Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University.

Slides:



Advertisements
Similar presentations
1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Advertisements

How to Solve Longstanding Open Problems In Quantum Computing Using Only Fourier Analysis Scott Aaronson (MIT) For those who hate quantum: The open problems.
Agnostically Learning Decision Trees Parikshit Gopalan MSR-Silicon Valley, IITB00. Adam Tauman Kalai MSR-New England Adam R. Klivans UT Austin
Reductions to the Noisy Parity Problem TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Vitaly Feldman Parikshit.
LEARNIN HE UNIFORM UNDER DISTRIBUTION – Toward DNF – Ryan ODonnell Microsoft Research January, 2006.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
Heuristics for the Hidden Clique Problem Robert Krauthgamer (IBM Almaden) Joint work with Uri Feige (Weizmann)
Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua On Agnostic Boosting and Parity Learning.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Using Nondeterminism to Amplify Hardness Emanuele Viola Joint work with: Alex Healy and Salil Vadhan Harvard University.
Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas Dana Ron Alex Samorodnitsky.
Analysis of Boolean Functions Fourier Analysis, Projections, Influence, Junta, Etc… And (some) applications Slides prepared with help of Ricky Rosen.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Derandomization: New Results and Applications Emanuele Viola Harvard University March 2006.
Fourier Analysis, Projections, Influences, Juntas, Etc…
On Uniform Amplification of Hardness in NP Luca Trevisan STOC 05 Paper Review Present by Hai Xu.
Probably Approximately Correct Model (PAC)
Evaluating Hypotheses
Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,
Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.
Correlation Immune Functions and Learning Lisa Hellerstein Polytechnic Institute of NYU Brooklyn, NY Includes joint work with Bernard Rosell (AT&T), Eric.
Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University
1 On The Learning Power of Evolution Vitaly Feldman.
Fourier Analysis of Boolean Functions Juntas, Projections, Influences Etc.
Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
On Constructing Parallel Pseudorandom Generators from One-Way Functions Emanuele Viola Harvard University June 2005.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Lattice-based cryptography and quantum Oded Regev Tel-Aviv University.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Learnability of DNF with Representation-Specific Queries Liu Yang Joint work with Avrim Blum & Jaime Carbonell Carnegie Mellon University 1© Liu Yang 2012.
Learning abductive reasoning using random examples Brendan Juba Washington University in St. Louis.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Pseudorandomness: New Results and Applications Emanuele Viola IAS April 2007.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Complexity Theory and Explicit Constructions of Ramsey Graphs Rahul Santhanam University of Edinburgh.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Chapter 7. Classification and Prediction
Dana Ron Tel Aviv University
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Introduction to Machine Learning
Circuit Lower Bounds A combinatorial approach to P vs NP
Tight Fourier Tails for AC0 Circuits
Pseudo-derandomizing learning and approximation
Linear sketching with parities
Computational Learning Theory
Linear sketching over
Learning, testing, and approximating halfspaces
Computational Learning Theory
Linear sketching with parities
Fourier Analysis and Boolean Function Learning
including joint work with:
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Switching Lemmas and Proof Complexity
On Derandomizing Algorithms that Err Extremely Rarely
Pseudorandomness: New Results and Applications
Presentation transcript:

Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University

Themes Harmonic analysis is central to learning theoretic results in wide variety of models –Results generally strongest known for learning with respect to uniform distribution Work on learning problems has led to some new harmonic results –Spectral properties of Boolean function classes –Algorithms for approximating Boolean functions

Uniform Learning Model Boolean Function Class F (e.g., DNF) Example Oracle EX(f) Target function f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0

Circuit Classes Constant-depth AND/OR circuits (AC 0 without the polynomial-size restriction; call this CDC) DNF: depth-2 circuit with OR at root    ... } d levels  v 1 v 2 v 3 v n...  Negations allowed

Decision Trees v3v3 v1v1 v2v2 v4v

v3v3 v1v1 v2v2 v4v x 3 = 0 x = 11001

Decision Trees v3v3 v1v1 v2v2 v4v x 1 = 1 x = 11001

Decision Trees v3v3 v1v1 v2v2 v4v x = f(x) = 1

Function Size Each function representation has a natural size measure: –CDC, DNF: # of gates –DT: # of leaves Size s F (f) of f with respect to class F is size of smallest representation of f within F –For all Boolean f, s CDC (f) ≤ s DNF (f) ≤ s DT (f)

Efficient Uniform Learning Model Boolean Function Class F (e.g., DNF) Example Oracle EX(f) Target function f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0 Time poly(n,s F,1/ε)

Harmonic-Based Uniform Learning [LMN]: constant-depth circuits are quasi- efficiently (n polylog(s/ε) -time) uniform learnable [BT]: monotone Boolean functions are uniform learnable in time roughly 2 √n logn –Monotone: For all x, i: f(x| xi=0 ) ≤ f(x| xi=1 ) –Also exponential in 1/ε (so assumes ε constant) –But independent of any size measure

Notation Assume f: {0,1} n  {-1,1} For all a in {0,1} n, χ a (x) ≡ (-1) a · x For all a in {0,1} n, Fourier coefficient f(a) of f at a is: Sometimes write, e.g., f({1}) for f(10…0) ^ ^ ^

Fourier Properties of Classes [LMN]: f is a constant-depth circuit of depth d and S = { a : |a| < log d (s/ε) } ( |a| ≡ # of 1’s in a ) [BT]: f is a monotone Boolean function and S = { a : |a| < √n / ε) }

Spectral Properties

Proof Techniques [LMN]: Hastad’s Switching Lemma + harmonic analysis [BT]: Based on [KKL] –Define AS(f) ≡ n · Pr x,i [f(x| xi=0 ) ≠ f(x |xi =1)] –If S = {a : |a| < AS(f)/ε} then Σ a  S f 2 (a) < ε –For monotone f, harmonic analysis + Cauchy- Schwartz shows AS(f) ≤ √n –Note: This is tight for MAJ ^

Function Approximation For all Boolean f, For S  {0,1} n, define [LMN]:

“The” Fourier Learning Algorithm Given: ε (and perhaps s, d) Determine k such that for S = {a : |a| < k}, Σ a  S f 2 (a) < ε Draw sufficiently large sample of examples to closely estimate f(a) for all a  S –Chernoff bounds: ~n k /ε sample size sufficient Output h ≡ sign(Σ a  S f(a) χ a ) Run time ~ n 2k /ε ^ ~ ^

Halfspaces [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant) –Halfspace:  w  R n+1 s.t. f(x) = sign(w · (x º 1)) –If S = {a : |a| < (21/ε) 2 } then  a  S f 2 (a) < ε –Apply LMN algorithm Similar result applies for arbitrary function applied to constant number of halfspaces –Intersection of halfspaces key learning pblm ^

Halfspace Techniques [O] (cf. [BKS], [BJTa]): –Noise sensitivity of f at γ is probability that corrupting each bit of x with probability γ changes f(x) –NS γ (f) = ½(1-  a (1-2 γ) |a| f 2 (a)) [KOS]: –If S = {a : |a| < 1/ γ} then  a  S f 2 (a) < 3 NS γ (f) –If f is halfspace then NS ε < 9√ ε ^ ^

Monotone DT [OS]: Monotone functions are efficiently learnable given: –ε is constant –s DT (f) is used as the size measure Techniques: –Harmonic analysis: for monotone f, AS(f) ≤ √log s DT (f) –[BT]: If S = {a : |a| < AS(f)/ε} then Σ a  S f 2 (a) < ε –Friedgut:  |T| ≤ 2 AS(f)/ε s.t. Σ A  T f 2 (A) < ε ^ ^

Weak Approximators KKL also show that if f is monotone, there is an i such that -f({i}) ≥ log 2 n/n Therefore Pr[f(x) = -χ {i} (x)] ≥ ½ + log 2 n/2n In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f If A outputs a weak approximator for every f in F, then F is weakly learnable ^

Uniform Learning Model Boolean Function Class F (e.g., DNF) Example Oracle EX(f) Target function f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0

Weak Uniform Learning Model Boolean Function Class F (e.g., DNF) Example Oracle EX(f) Target function f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s)

Efficient Weak Learning Algorithm for Monotone Boolean Functions Draw set of ~n 2 examples For i = 1 to n –Estimate f({i}) Output h ≡ argmax f({i}) (-χ {i} ) ^ ^

Weak Approximation for MAJ of Constant-Depth Circuits Note that adding a single MAJ to a CDC destroys the LMN spectral property [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable – If f is a MAJ of CDC’s of depth d, and if the number of gates in f is s, then there is a set A  {0,1} n such that |A| < log d s ≡ k Pr[f(x) = χ A (x)] ≥ ½ +1/4sn k

Weak Learning Algorithm Compute k = log d s Draw ~sn k examples Repeat for |A| < k –Estimate f(A) Until find A s.t. f(A) > 1/2sn k Output h ≡ χ A Run time ~n polylog(s) ^ ^

Weak Approximator Proof Techniques “Discriminator Lemma” (HMPST) –Implies one of the CDC’s is a weak approximator to f LMN spectral characterization of CDC Harmonic analysis Beigel result used to extend weak learning to CDC with polylog MAJ gates

Boosting In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [F], …) –Need to learn weakly with respect to near- uniform distributions For near-uniform distribution D, find weak h j s.t. Pr x~D [h j = f] > ½ + 1/poly(n,s) –Final h typically MAJ of weak approximators

Strong Learning for MAJ of Constant-Depth Circuits [JKS]: MAJ of CDC is quasi-efficiently uniform learnable –Show that for near-uniform distributions, some parity function is a weak approximator –Beigel result again extends to CDC with poly- log MAJ gates [KP] + boosting: there are distributions for which no parity is a weak approximator

Uniform Learning from a Membership Oracle Boolean Function Class F (e.g., DNF) Membership Oracle MEM(f) Target function f : {0,1} n  {0,1} Learning Algorithm A f(x)f(x) Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0 x

Uniform Membership Learning of Decision Trees [KM] –L 1 (f) ≡  a |f(a)| ≤ s DT (f) –If S = {a : |f(a)| ≥ ε/L 1 (f)} then Σ a  S f 2 (a) < ε –[GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ 6 –So can efficiently uniform membership learn DT –Output h same form as LMN: h ≡ sign(Σ a  S f(a) χ a ) ^ ^^ ^ ^ ~ ^

Uniform Membership Learning of DNF [J] –  (distributions D)  χ a s.t. Pr x~D [f(x) = χ a (x)] ≥ ½ + 1/6s DNF –Modified [GL] can efficiently locate such χ a given oracle for near-uniform D Boosters can provide such an oracle when uniform learning –Boosting provides strong learning [BJTb] (see also [KS]): –Modified Levin algo finds χ a in time ~ns 2

Uniform Learning from a Classification Noise Oracle Boolean Function Class F (e.g., DNF) Classification Noise Oracle EX η (f) Target function f : {0,1} n  {0,1} Learning Algorithm A Pr[ ]=1-η Pr[ ]=η Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0 Uniform random x Error rate η > 0

Uniform Learning from a Statistical Query Oracle Boolean Function Class F (e.g., DNF) Statistical Query Oracle SQ(f) Target function f : {0,1} n  {0,1} Learning Algorithm A E U [q(x, f(x))] ± τ Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0 ( q(), τ )

SQ and Classification Noise Learning [K] –If F is uniform SQ learnable in time poly(n, s F,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, s F,1/ε, 1/τ, 1/(1-2η)) –Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters) Exception: F = PAR n ≡ {χ a : a  {0,1} n, |a| ≤ n}

Uniform SQ Hardness for PAR [BFJKMR] –Harmonic analysis shows that for any q, χ a : E U [q(x, χ a (x))] = q(0 n+1 ) + q(a º 1) –Thus adversarial SQ response to (q,τ) is q(0 n+1 ) whenever |q(a º 1)| < τ –Parseval: |q(b º 1)| < τ for all but 1/τ 2 Fourier coefficients –So ‘bad’ query eliminates only poly coefficients –Even PAR log n not efficiently SQ learnable ^^ ^^ ^

Uniform Learning from an Attribute Noise Oracle Boolean Function Class F (e.g., DNF) Attribute Noise Oracle EX D N (f) Target function f : {0,1} n  {0,1} Learning Algorithm A, r~D N Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0 Uniform random x Noise model D N

Uniform Learning with Independent Attribute Noise [BJTa]: –LMN algorithm produces estimates of f(a) · E r~D N [χ a (r)] Example application –Assume noise process D N is a product distribution: D N (x) = ∏ i (p i (x i ) + (1-p i )(1-x i )) –Assume p i < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions) –Then modified LMN uniform learns attribute noisy AC 0 in quasi-poly time ^

Agnostic Learning Model Arbitrary Boolean Function Example Oracle EX(f) Target function f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] minimized

Near-Agnostic Learning via LMN [KKM]: –Let f be an arbitrary Boolean function –Fix any set S  {1..n} and fix ε –Let g be any function s.t. Σ a  S g 2 (a) < ε and Pr[f ≠ g] is minimized (call this η) –Then for h learned by LMN by estimating coefficients of f over S: Pr[f ≠ h] < 4η + ε ^

Average Case Uniform Learning Model Boolean Function Class F (e.g., DNF) Example Oracle EX(f) D -random f : {0,1} n  {0,1} Learning Algorithm A Uniform Random Examples Hypothesis h:{0,1} n  {0,1} s.t. Pr x~U [f(x) ≠ h(x) ] < ε Accuracy ε > 0

Average Case Learning of DT [JSa]: –D : uniform over complete, non-redundant log-depth DT’s –DT efficiently uniform learnable on average –Output is a DT (proper learning)

Average Case Learning of DT Technique –[KM]: All Fourier coefficients of DT with min depth d are rational with denominator 2 d –In average-case tree, coefficient f({i}) for at least one variable v i has odd numerator So log(denominator) is min depth of tree –Try all variables at root and find depth of child trees, choosing root with shallowest children –Recurse on child trees to choose their roots ^

Average Case Learning of DNF [JSb]: –D : s terms, each term uniform from terms of length log s –Monotone DNF with <n 2 terms and DNF with <n 1.5 terms properly and efficiently uniform learnable on average Harmonic property –In average-case DNF, sign of f({i,j}) (usually) indicates whether v i and v j are in a common term or not ^

Summary Most uniform-learning results depend on harmonic analysis Learning theory provides motivation for new harmonic observations Even very “weak” harmonic results can be useful in learning-theory algorithms

Some Open Problems Efficient uniform learning of monotone DNF –Best to date for small s DNF is [S], time ~ns log s (based on [BT], [M], [LMN]) Non-uniform learning –Relatively easy to extend many results to product distributions, e.g. [FJS] extends [LMN] –Key issue in real-world applicability

Open Problems (cont’d) Weaker dependence on ε –Several algorithms fully exponential (or worse) in 1/ε Additional proper learning results –Allows for interpretation of learned hypothesis