Fourier Analysis and Boolean Function Learning

Fourier Analysis and Boolean Function Learning
Jeff Jackson Duquesne University

Themes Fourier analysis is central to learning theoretic results in wide variety of models Results generally are the strongest known for learning Boolean function classes with respect to uniform distribution Work on learning problems has led to some new harmonic results Spectral properties of Boolean function classes Algorithms for approximating Boolean functions

Uniform Learning Model
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Uniform Random Examples < x, f(x) > Example Oracle EX(f) Learning Algorithm A Accuracy ε > 0

Circuit Classes Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC) DNF: depth-2 circuit with OR at root Ù } Ú Ú Ú d levels Ù Ù Ù . . . . . . . . . . . . . . . v1 v2 v vn Negations allowed

Decision Trees v3 v2 v1 1 v4 1

Decision Trees v3 x3 = 0 v2 v1 1 v4 1 x = 11001

Decision Trees v3 v2 v1 x1 = 1 1 v4 1 x = 11001

Decision Trees v3 v2 v1 1 v4 1 x = 11001 f(x) = 1

Function Size Each function representation has a natural size measure:
CDC, DNF: # of gates DT: # of leaves Size sF (f) of f with respect to class F is size of smallest representation of f within F For all Boolean f, sCDC(f) ≤ sDNF(f) ≤ sDT(f)

Efficient Uniform Learning Model
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Time poly(n,sF ,1/ε) Target function f : {0,1}n  {0,1} Uniform Random Examples < x, f(x) > Example Oracle EX(f) Learning Algorithm A Accuracy ε > 0

Harmonic-Based Uniform Learning
[LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1) Also exponential in 1/ε (so assumes ε constant) But independent of any size measure

Notation Assume f: {0,1}n  {-1,1}
For all a in {0,1}n, χa (x) ≡ (-1) a · x For all a in {0,1}n, Fourier coefficient f(a) of f at a is: Sometimes write, e.g., f({1}) for f(10…0) ^ ^ ^

Fourier Properties of Classes
[LMN]: f is a constant-depth circuit of depth d and S = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a ) [BT]: f is a monotone Boolean function and S = { a : |a| < √n / ε) }

Spectral Properties

Proof Techniques [LMN]: Hastad’s Switching Lemma + harmonic analysis
[BT]: Based on [KKL] Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)] If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε For monotone f, harmonic analysis + Cauchy-Schwartz shows AS(f) ≤ √n Note: This is tight for MAJ ^

Function Approximation
For all Boolean f, For S Í {0,1}n, define [LMN]:

“The” Fourier Learning Algorithm
Given: ε (and perhaps s, d, ...) Determine k such that for S = {a : |a| < k}, ΣaÏS f2(a) < ε Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aÎS Chernoff bounds: ~nk/ε sample size sufficient Output h ≡ sign(ΣaÎS f(a) χa) Run time ~ n2k/ε ^ ^ ~

Halfspaces [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant) Halfspace: $ wÎRn+1 s.t. f(x) = sign(w · (xº1)) If S = {a : |a| < (21/ε)2 } then åaÏS f2(a) < ε Apply LMN algorithm Similar result applies for arbitrary function applied to constant number of halfspaces Intersection of halfspaces key learning pblm ^

Halfspace Techniques [O] (cf. [BKS], [BJTa]): [KOS]:
Noise sensitivity of f at γ is probability that corrupting each bit of x with probability γ changes f(x) NSγ (f) ≡ ½(1-åa(1-2 γ)|a| f2(a)) [KOS]: If S = {a : |a| < 1/ γ} then åaÏS f2(a) < 3 NSγ (f) If f is halfspace then NSγ(f) < 9√ γ ^ ^

Monotone DT [OS]: Monotone functions are efficiently learnable given:
ε is constant sDT(f) is used as the size measure Techniques: Harmonic analysis: for monotone f, AS(f) ≤ √log sDT(f) [BT]: If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε Friedgut: $ |T| ≤ 2AS(f)/ε s.t. ΣAËT f2(A) < ε ^ ^

Weak Approximators KKL also show that if f is monotone, there is an i such that -f({i}) ≥ log2n/n Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f If A outputs a weak approximator for every f in F , then F is weakly learnable ^

Uniform Learning Model
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Uniform Random Examples < x, f(x) > Example Oracle EX(f) Learning Algorithm A Accuracy ε > 0

Weak Uniform Learning Model
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s) Target function f : {0,1}n  {0,1} Uniform Random Examples < x, f(x) > Example Oracle EX(f) Learning Algorithm A

Efficient Weak Learning Algorithm for Monotone Boolean Functions
Draw set of ~n2 examples <x,f(x)> For i = 1 to n Estimate f({i}) Output h ≡ argmaxf({i})(-χ{i}) ^ ^

Weak Approximation for MAJ of Constant-Depth Circuits
Note that adding a single MAJ to a CDC destroys the LMN spectral property [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable If f is a MAJ of CDC’s of depth d, and if the number of gates in f is s, then there is a set A Í {0,1}n such that |A| < logd s ≡ k Pr[f(x) = χA(x)] ≥ ½ +1/4snk

Weak Learning Algorithm
Compute k = logds Draw ~snk examples <x,f(x)> Repeat for |A| < k Estimate f(A) Until find A s.t. f(A) > 1/2snk Output h ≡ χA Run time ~npolylog(s) ^ ^

Weak Approximator Proof Techniques
“Discriminator Lemma” (HMPST) Implies one of the CDC’s is a weak approximator to f LMN spectral characterization of CDC Harmonic analysis Beigel result used to extend weak learning to CDC with polylog MAJ gates

Boosting In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [FS], …) Need to learn weakly with respect to near-uniform distributions For near-uniform distribution D, find weak hj s.t. Prx~D[hj = f] > ½ + 1/poly(n,s) Final h typically MAJ of weak approximators

Strong Learning for MAJ of Constant-Depth Circuits
[JKS]: MAJ of CDC is quasi-efficiently uniform learnable Show that for near-uniform distributions, some parity function is a weak approximator Beigel result again extends to CDC with poly-log MAJ gates [KP] + boosting: there are distributions for which no parity is a weak approximator

Uniform Learning from a Membership Oracle
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Membership Oracle MEM(f) Learning Algorithm A x f(x) Accuracy ε > 0

Uniform Membership Learning of Decision Trees
[KM] L1(f) ≡ åa |f(a)| ≤ sDT(f) If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaÏS f2(a) < ε [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6 So can efficiently uniform membership learn DT Output h same form as LMN: h ≡ sign(ΣaÎS f(a) χa) ^ ^ ^ ^ ^ ^ ~

Uniform Membership Learning of DNF
[J] "(distributions D) $ χa s.t Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF Modified [GL] can efficiently locate such χa given oracle for near-uniform D Boosters can provide such an oracle when uniform learning Boosting provides strong learning [BJTb], [KS], [F] For near-uniform D, can find χa in time ~ns2

Uniform Learning from a Random Walk Oracle
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Random Walk Examples < x, f(x) > Random Walk Oracle RW(f) Learning Algorithm A Accuracy ε > 0

Random Walk DNF Learning
[BMOS] Noise sensitivity and related values can be accurately estimated using a random walk oracle NSγ (f) ≡ ½(1-åa(1-2 γ)|a| f2(a)) Tb(f) ≡ åa  b |a| f2(a) Estimate of Tb(f) is efficient if |b| logarithmic Only need logarithmic |b| to learn DNF [BF] ^ ^

Random Walk Parity Learning
[JW] (unpub) Effectively, [BMOS] limited to finding “heavy” Fourier coefficents f(a) for logarithmic |a| Using a “breadth-first” variation of KM, can locate any |f(a)| > θ in time O(nlog 1/ θ) “Heavy” coefficient corresponds to a parity function that weakly approximates ^ ^

Uniform Learning from a Classification Noise Oracle
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Classification Noise Oracle EXη (f) Learning Algorithm A Uniform random x Pr[<x, f(x)>]=1-η Pr[<x, -f(x)>]=η Error rate η > 0 Accuracy ε > 0

Uniform Learning from a Statistical Query Oracle
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Statistical Query Oracle SQ(f) Learning Algorithm A ( q(), τ ) EU[q(x, f(x))] ± τ Accuracy ε > 0

SQ and Classification Noise Learning
[K] If F is uniform SQ learnable in time poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η)) Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters) Exception: F = PARn ≡ {χa : a Î {0,1}n, |a| ≤ n}

Uniform SQ Hardness for PAR
[BFJKMR] Harmonic analysis shows that for any q, χa: EU[q(x, χa(x))] = q(0n+1) + q(a º 1) Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(a º 1)| < τ Parseval: |q(b º 1)| < τ for all but 1/τ2 Fourier coefficients So ‘bad’ query eliminates only poly coefficients Even PARlog n not efficiently SQ learnable ^ ^ ^ ^ ^

Uniform Learning from an Attribute Noise Oracle
Boolean Function Class F (e.g., DNF) Hypothesis h:{0,1}n  {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target function f : {0,1}n  {0,1} Attribute Noise Oracle EXDN(f) Learning Algorithm A Uniform random x <xÅr, f(x)>, r~DN Noise model DN Accuracy ε > 0

Uniform Learning with Independent Attribute Noise
[BJTa]: LMN algorithm produces estimates of f(a) · Er~DN[χa(r)] Example application Assume noise process DN is a product distribution: DN(x) = ∏i (pixi + (1-pi)(1-xi)) Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions) Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time ^

Agnostic Learning Model
Arbitrary Boolean Function Hypothesis h in H s.t. Prx~U [f(x) ≠ h(x) ] <= optH + ε Target function f : {0,1}n  {0,1} Uniform Random Examples < x, f(x) > Example Oracle EX(f) Learning Algorithm A Accuracy ε > 0

Agnostic Learning of Halfspaces
[KKMS] Agnostic learning algorithm for H the set of halfspaces Algorithm is not Fourier-based (L1 regression) However, a somewhat weaker result can be obtained by simple Fourier analysis

Near-Agnostic Learning via LMN
[KKMS]: Let f be an arbitrary Boolean function Fix any set S Í {1..n} and fix ε Let g be any function s.t. ΣaÏS g2(a) < ε and Pr[f ≠ g] (call this η) is minimized for any such g Then for h learned by LMN by estimating coefficients of f over S: Pr[f ≠ h] < 4η + ε ^

Summary Most uniform-learning results for Boolean function classes depend on harmonic analysis Learning theory provides motivation for new harmonic observations Even very “weak” harmonic results can be useful in learning-theory algorithms

Some Open Problems Efficient uniform learning of monotone DNF
Best to date for small sDNF is [Ser], time ~nslog s (based on [BT], [M], [LMN]) Non-uniform learning Relatively easy to extend many results to product distributions, e.g. [FJS] extends [LMN] Key issue in real-world applicability

Open Problems (cont’d)
Weaker dependence on ε Several algorithms fully exponential (or worse) in 1/ε Additional proper learning results Allows for interpretation of learned hypothesis

References Beigel: When Do Extra Majority Gates Help? ...
[BFJKMR] Blum, Furst, Jackson, Kearns, Mansour, Rudich. Weakly Learning DNF... [BJTa] Bshouty, Jackson, Tamon. Uniform-Distribution Attribute Noise Learnability. [BJTb] Bshouty, Jackson, Tamon. More Efficient PAC-learning of DNF... [BKS] Benjamini, Kalai, Schramm. Noise Sensitivity of Boolean Functions... [BMOS] Bshouty, Mossel, O’Donnell, Servedio. Learning DNF from Random Walks. [BT] Bshouty, Tamon. On the Fourier Spectrum of Monotone Functions. [F] Feldman. Attribute Efficient and Non-adaptive Learning of Parities... [FJS] Furst, Jackson, Smith. Improved Learning of AC0 Functions. [FS] Freund, Schapire. A Decision-theoretic Generalization of On-line Learning... Friedgut: Boolean Functions with Low Average Sensitivity Depend on Few Coordinates. [HMPST] Hajnal, Maass, Pudlak, Szegedy, Turan. Threshold Circuits of Bounded Depth. [J] Jackson. An Efficient Membership-Query Algorithm for Learning DNF... [JKS] Jackson, Klivans, Servedio. Learnability Beyond AC0. [JW] Jackson, Wimmer. In prep. [KKL] Kahn, Kalai, Linial. The Influence of Variables on Boolean Functions. [KKMS] Kalai, Klivans, Mansour, Servedio. On Agnostic Boosting and Parity Learning. [K] Kearns. Efficient Noise-tolerant learning from Statistical Queries. [KM] Kushilevitz, Mansour. Learning Decision Trees using the Fourier Spectrum. [KOS] Klivans, O’Donnell, Servedio. Learning Intersections and Thresholds of Halfspaces. [KP] Krause, Pudlak. On Computing Boolean Functions by Sparse Real Polynomials. [KS] Klivans, Servedio. Boosting and Hard-core Sets. [LMN] Linial, Mansour, Nisan. Constant-depth Circuits, Fourier Transform, and Learnability. [M] Mansour. An O(nloglog n) Learning Algorithm for DNF... [O] O’Donnell. Hardness Amplification within NP. [OS] O’Donnell, Servedio. Learning Monotone Functions from Random Examples in Polynomial Time. [S] Schapire. The Strength of Weak Learnability. [Ser] Servedio. On Learning Monotone DNF under Product Distributions.

Fourier Analysis and Boolean Function Learning

Similar presentations

Presentation on theme: "Fourier Analysis and Boolean Function Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fourier Analysis and Boolean Function Learning

Similar presentations

Presentation on theme: "Fourier Analysis and Boolean Function Learning"— Presentation transcript:

Similar presentations

About project

Feedback