Download presentation
Presentation is loading. Please wait.
1
Deep learning Theory: F-Principle
Zhiqin Xu 许志钦 Shanghai Jiao Tong University Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai 12. 05
2
Single hidden layer can fit any function
1989 Single hidden layer can fit any function
3
Fitting is not enough!
4
Generalization error Model Complexity Generalization Gap
Need to explain what is generalization, what is overfitting. * Generalization Gap
5
Mystery of generalization in deep learning
"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.” -- John von Neumann Mayer et al., 2010
6
model: millions of parameters five data points
Overfitting??
7
DNN says: NO!
8
Flat output # of parameters: ~ 1600*Layer number>>5
** Lei Wu et al., 2017
9
Large complexity? But good generalization!
Puzzle: generalize well even # of para >> # of training data Need to explain what is random shuffled data x32 colour images in 10 classes Zhang et al., 2016
10
Dynamics is key to “preference”/“implicit regularization” of DNN.
Trajectory in parameter space 𝑡↦Θ(𝑡). Trajectory in function space 𝑡↦ℎ 𝑥,𝑡 ≔ℎ 𝑥;𝛩 𝑡 Our Approach: Discover and characterize dynamical behavior of ℎ 𝑥,𝑡 during the training on synthetics problems. Examine its empirical universality using real datasets and deep networks. Establish theories for mechanism understanding Understand DNN in applications Better use DNN to solve problems
11
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
12
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
13
Example: Frequency Principle
Red: target function Blue: DNN fitting Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018
14
Example: Frequency Principle
Red: target function Blue: DNN fitting Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018
15
Example: Frequency Principle
Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is several training steps Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018
16
Example: Frequency Principle
i) Capture low-frequency components while keeping high-frequency ones small. ii) Gradually captures high-frequency components. Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is one training step Amplitude Frequency Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018
17
Equal amplitudes Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
18
How DNN fits a 2-d image? Target: image 𝐼 𝐱 : ℝ 2 →ℝ
𝐱 : location of a pixel 𝐼 𝐱 : grayscale pixel value
19
High-dimensional real data?
F-Principle High-dimensional real data? Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
20
Frequency Image frequency (not used) Response frequency
This frequency corresponds to the rate of change of intensity across neighbouring pixels. Frequency of a general Input-Output mapping f. High freq: small change of the intensity of the i-th pixel in the image might induce a large change of the output Goodfellow et al. Zero freq Same color high freq Sharp edge high freq Adversarial example
21
Examining F-Principle for high dimensional real problems
Nonuniform Discrete Fourier transform (NUDFT) for training dataset ( 𝐱 𝑖 , 𝑦 𝑖 ) 𝑖=1 𝑛 : 𝑦 𝐤 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 e −i2𝜋𝐤⋅ 𝐱 𝑖 , ℎ 𝐤 (𝑡)= 1 𝑛 𝑖=1 𝑛 ℎ( 𝐱 𝑖 ,𝑡) e −i2𝜋𝐤⋅ 𝐱 𝑖 Difficulty: Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑. Our approaches: Projection, i.e., choose 𝐤=𝑘 𝐩 1 Filtering
22
Projection approach Relative error: Δ 𝐹 𝑘 = ℎ 𝑘 − 𝑦 𝑘 /| 𝑦 𝑘 | MNIST
CIFAR10
23
Decompose frequency domain by filtering
low-freq of 𝑦 high-freq of 𝑦 𝑘 0
24
F-Principle in high-dim space
MNIST CIFAR10 CIFAR10
25
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
26
Theory in a idealized setting
Consider a tanh-DNN of one-hidden layer for fitting a 1-d function 𝑓 ℎ 𝑥 = 𝑗=1 𝑚 𝑎 𝑗 𝜎 𝑤 𝑗 𝑥+ 𝑏 𝑗 , ℎ 𝑘 ≈ 𝑗=1 𝑚 𝑎 𝑗 exp ( 𝑖 𝑏 𝑗 𝑤 𝑗 ) exp − 𝜋𝑘 2 𝑤 𝑖 , Define the loss at frequency 𝑘 𝐿 𝑘 = ℎ 𝑘 − 𝑓 𝑘 2 By Parseval’s theorem: 𝐿= 𝐿 𝑘 𝑑𝑘=∫ 𝑓 𝑥 −ℎ(𝑥) 2 𝑑𝑥 Compute the gradient by the loss in Fourier domain 𝜃←𝜃−𝜂 𝜕𝐿 𝑘 𝜕𝜃
27
Analysis high freq 𝜕𝐿(𝑘) 𝜕𝜃 ≈𝐴 𝑘 exp − 𝜋𝑘 2 𝑤 𝑖 𝐺(𝜃,𝑘) Low freq Where 𝐴 𝑘 = ℎ 𝑘 − 𝑓 𝑘 𝐴 𝑘 >0: If 𝑤 𝑖 is small, exp (− 𝜋𝑘/ 2𝑤 𝑗 ) dominate, low frequencies dominate. For 𝑤 𝑖 ∈ 𝐵 𝛿 , center at 0 with radius 𝛿, if 𝛿 is small, contribution of high frequency loss is negligible. 𝐴 𝑘 ≈0: small contribution from 𝐿(𝑘) Insight: smoothness/regularity of activation function 𝜎(⋅) can be converted into F-Principle through gradient-based training.
28
Low freq grad > high freq grad almost surely as 𝜹→𝟎.
Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019
29
Theory of F-Principle for general DNNs
(i)network architecture, (ii) activation function, (iii) loss function Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.
30
F-Principle DNN prefers low frequencies
31
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
32
Effect of early stopping
Fit a noisy function Test error gets worse after some training step. Training and test only overlap at LOW frequency part.
33
Generalization difference
F-Principe: DNN prefers low frequencies For 𝑥 ∈ −1,1 𝑛 𝒇 𝒙 = 𝒋=𝟏 𝒏 𝒙 𝒋 , Even #‘-1’ 1; Odd #‘-1’ -1. Test accuracy: 96.3%>>10% Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess
34
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
35
Framework application
2019 Turing award In a work done independently from ours, and made available online almost at the same time, Xu et al. (2018) make the same observation that lower frequencies are learned first. The subsequent work by Xu (2018) proposes a theoretical analysis of the phenomenon in the case of 2-layer networks with sigmoid activation, based on the spectrum of the sigmoid function
36
F-Principle as a tool Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’ “In this work, we shall use the F-principle to observe the performance of adaptive activation function.” Adaptive a Fix a
37
Slides from Prof. Weinan E at 第十二届全国计算数学年会
38
Slides from Prof. Weinan E at 第十二届全国计算数学年会
39
Suffer from high-frequency curse
Slides from Prof. Weinan E at 第十二届全国计算数学年会
40
DNN-Jacobi method PhaseDNN Suffer from high-dimension curse
Green stars : DNN Dashed line: Jacobi, initialized by DNN Partition the frequency space Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 PhaseDNN (Wei Cai et al., 2019)
41
MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. Cai, Xu*, 2019
42
MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. MscaleDNN example of A=3 𝑤 11 𝑥 1 + 𝑏 1 𝑤 21 𝑥 1 + 𝑏 2 2 𝑤 31 𝑥 1 + 𝑏 3 𝑥 1 2 𝑤 41 𝑥 1 + 𝑏 4 3 𝑤 51 𝑥 1 + 𝑏 5 3 𝑤 61 𝑥 1 + 𝑏 6 Input layer 1st hidden layer other hidden layers output layer Cai, Xu*, MscaleDNN, 2019
43
Compact supported activation function.
In order to produce scale separation and identification capability of the MscaleDNN, we take the hint from the theory of compact mother scaling function in the wavelet theory
44
Solving high dimensional PDE
Example
45
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
46
Linear approximation for wide NN
ℎ 𝑥, 𝜃(𝑡) =ℎ 𝑥, 𝜃 0 + 𝛻 𝜃 ℎ(𝑥, 𝜃 0 ) 𝜃 𝑡 − 𝜃 0 for any 𝑡>0 𝐿 𝜃 = ℎ 𝑋,𝜃 −𝑌 2 2 𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 𝑋: 𝑥 𝑖 𝑖=1 𝑀 𝑌: 𝑦 𝑖 𝑖=1 𝑀 CIFAR 10, CNN Lee et al., 2019
47
Linear F-Principle dynamics
𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 ℎ ⋅,𝜃 = 𝑖=1 𝑁 𝑤 𝑖 𝜎( 𝑟 𝑖 (⋅+ 𝑙 𝑖 )) 1-d function 𝑁 sufficiently large 𝜕 𝑡 ℎ 𝜉,𝑡 = 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝜋 2 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝑓 𝑝 𝜉,𝑡 − ℎ 𝑝 𝜉,𝑡 𝑓: target function; ⋅ 𝑝 =(⋅)𝑝,where 𝑝 𝑥 = 1 𝑀 𝑖=1 𝑀 𝛿(𝑥− 𝑥 𝑖 ) ; ⋅ : Fourier transform; 𝜉: frequency For simplicity 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 𝑢(𝑥,𝑡)=ℎ(𝑥,𝑡)−𝑓(𝑥)
48
Preference induced by LFP dynamics
𝜕 𝑡 ℎ 𝜉,𝑡 =− 4 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 ℎ 𝑝 𝜉,𝑡 − 𝑓 𝑝 𝜉,𝑡 min ℎ∈ 𝐹 𝛾 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 4 −1 ℎ 𝜉 d𝜉 s.t. ℎ 𝑥 𝑖 = 𝑦 𝑖 for 𝑖=1,⋯,𝑛 low frequency preference Case 1: 𝜉 −4 dominant min 𝜉 4 ℎ 𝜉 2 d𝜉~min ℎ′′(𝑥) 2 d𝜉→ cubic spline Case 2: 𝜉 −2 dominant min 𝜉 2 ℎ 𝜉 2 d𝜉~min ℎ′(𝑥) 2 d𝜉→ linear spline
49
Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -0.1,0.1 ] and [ -0.25,0.25 ]
50
Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -2,2]
51
High-dim case 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝑑 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑑 𝑢 𝑝 𝜉,𝑡
52
An apriori generalization bound
A Priori Estimates For Two-layer Neural Networks Weinan E, Chao Ma, Lei Wu (2019) With/without regularity in the loss function
53
NN cannot learn well a highly frequency dominant target function
54
An apriori generalization error bound
sin 𝑣𝑥
55
F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]
56
Conclusion DNNs prefer low frequencies!
Reference: Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. DNNs prefer low frequencies! Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai (SMU) Acknowledge: Weinan E (Princeton), David W. McLaughlin (CIMS) A summary note of the F-Principle can be found at my page xuzhiqin.top
57
Fitting high-d functions
Linear: Non-linear:
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.