Deep learning Theory: F-Principle

Deep learning Theory: F-Principle
Zhiqin Xu 许志钦 Shanghai Jiao Tong University Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai 12. 05

Single hidden layer can fit any function
1989 Single hidden layer can fit any function

Fitting is not enough!

Generalization error Model Complexity Generalization Gap
Need to explain what is generalization, what is overfitting. * Generalization Gap

Mystery of generalization in deep learning
"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.” -- John von Neumann Mayer et al., 2010

model: millions of parameters five data points
Overfitting??

DNN says: NO!

Flat output # of parameters: ~ 1600*Layer number>>5
** Lei Wu et al., 2017

Large complexity? But good generalization!
Puzzle: generalize well even # of para >> # of training data Need to explain what is random shuffled data x32 colour images in 10 classes Zhang et al., 2016

Dynamics is key to “preference”/“implicit regularization” of DNN.
Trajectory in parameter space 𝑡↦Θ(𝑡). Trajectory in function space 𝑡↦ℎ 𝑥,𝑡 ≔ℎ 𝑥;𝛩 𝑡 Our Approach: Discover and characterize dynamical behavior of ℎ 𝑥,𝑡 during the training on synthetics problems. Examine its empirical universality using real datasets and deep networks. Establish theories for mechanism understanding Understand DNN in applications Better use DNN to solve problems

F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

Example: Frequency Principle
Red: target function Blue: DNN fitting Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018

Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is several training steps Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018

i) Capture low-frequency components while keeping high-frequency ones small. ii) Gradually captures high-frequency components. Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is one training step Amplitude Frequency Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

Equal amplitudes Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

How DNN fits a 2-d image? Target: image 𝐼 𝐱 : ℝ 2 →ℝ
𝐱 : location of a pixel 𝐼 𝐱 : grayscale pixel value

High-dimensional real data?
F-Principle High-dimensional real data? Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

Frequency Image frequency (not used) Response frequency
This frequency corresponds to the rate of change of intensity across neighbouring pixels. Frequency of a general Input-Output mapping f. High freq： small change of the intensity of the i-th pixel in the image might induce a large change of the output Goodfellow et al. Zero freq Same color high freq Sharp edge high freq Adversarial example

Examining F-Principle for high dimensional real problems
Nonuniform Discrete Fourier transform (NUDFT) for training dataset ( 𝐱 𝑖 , 𝑦 𝑖 ) 𝑖=1 𝑛 : 𝑦 𝐤 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 e −i2𝜋𝐤⋅ 𝐱 𝑖 , ℎ 𝐤 (𝑡)= 1 𝑛 𝑖=1 𝑛 ℎ( 𝐱 𝑖 ,𝑡) e −i2𝜋𝐤⋅ 𝐱 𝑖 Difficulty: Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑. Our approaches: Projection, i.e., choose 𝐤=𝑘 𝐩 1 Filtering

Projection approach Relative error: Δ 𝐹 𝑘 = ℎ 𝑘 − 𝑦 𝑘 /| 𝑦 𝑘 | MNIST
CIFAR10

Decompose frequency domain by filtering
low-freq of 𝑦 high-freq of 𝑦 𝑘 0

F-Principle in high-dim space
MNIST CIFAR10 CIFAR10

Theory in a idealized setting
Consider a tanh-DNN of one-hidden layer for fitting a 1-d function 𝑓 ℎ 𝑥 = 𝑗=1 𝑚 𝑎 𝑗 𝜎 𝑤 𝑗 𝑥+ 𝑏 𝑗 , ℎ 𝑘 ≈ 𝑗=1 𝑚 𝑎 𝑗 exp ( 𝑖 𝑏 𝑗 𝑤 𝑗 ) exp − 𝜋𝑘 2 𝑤 𝑖 , Define the loss at frequency 𝑘 𝐿 𝑘 = ℎ 𝑘 − 𝑓 𝑘 2 By Parseval’s theorem: 𝐿= 𝐿 𝑘 𝑑𝑘=∫ 𝑓 𝑥 −ℎ(𝑥) 2 𝑑𝑥 Compute the gradient by the loss in Fourier domain 𝜃←𝜃−𝜂 𝜕𝐿 𝑘 𝜕𝜃

Analysis high freq 𝜕𝐿(𝑘) 𝜕𝜃 ≈𝐴 𝑘 exp − 𝜋𝑘 2 𝑤 𝑖 𝐺(𝜃,𝑘) Low freq Where 𝐴 𝑘 = ℎ 𝑘 − 𝑓 𝑘 𝐴 𝑘 >0: If 𝑤 𝑖 is small, exp (− 𝜋𝑘/ 2𝑤 𝑗 ) dominate, low frequencies dominate. For 𝑤 𝑖 ∈ 𝐵 𝛿 , center at 0 with radius 𝛿, if 𝛿 is small, contribution of high frequency loss is negligible. 𝐴 𝑘 ≈0: small contribution from 𝐿(𝑘) Insight: smoothness/regularity of activation function 𝜎(⋅) can be converted into F-Principle through gradient-based training.

Low freq grad > high freq grad almost surely as 𝜹→𝟎.
Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

Theory of F-Principle for general DNNs
(i)network architecture, (ii) activation function, (iii) loss function Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.

F-Principle DNN prefers low frequencies

Effect of early stopping
Fit a noisy function Test error gets worse after some training step. Training and test only overlap at LOW frequency part.

Generalization difference
F-Principe: DNN prefers low frequencies For 𝑥 ∈ −1,1 𝑛 𝒇 𝒙 = 𝒋=𝟏 𝒏 𝒙 𝒋 , Even #‘-1’  1; Odd #‘-1’  -1. Test accuracy: 96.3%>>10% Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess

Framework application
2019 Turing award In a work done independently from ours, and made available online almost at the same time, Xu et al. (2018) make the same observation that lower frequencies are learned first. The subsequent work by Xu (2018) proposes a theoretical analysis of the phenomenon in the case of 2-layer networks with sigmoid activation, based on the spectrum of the sigmoid function

F-Principle as a tool Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’ “In this work, we shall use the F-principle to observe the performance of adaptive activation function.” Adaptive a Fix a

Slides from Prof. Weinan E at 第十二届全国计算数学年会

Suffer from high-frequency curse
Slides from Prof. Weinan E at 第十二届全国计算数学年会

DNN-Jacobi method PhaseDNN Suffer from high-dimension curse
Green stars : DNN Dashed line: Jacobi, initialized by DNN Partition the frequency space Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 PhaseDNN (Wei Cai et al., 2019)

MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. Cai, Xu*, 2019

MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. MscaleDNN example of A=3 𝑤 11 𝑥 1 + 𝑏 1 𝑤 21 𝑥 1 + 𝑏 2 2 𝑤 31 𝑥 1 + 𝑏 3 𝑥 1 2 𝑤 41 𝑥 1 + 𝑏 4 3 𝑤 51 𝑥 1 + 𝑏 5 3 𝑤 61 𝑥 1 + 𝑏 6 Input layer 1st hidden layer other hidden layers output layer Cai, Xu*, MscaleDNN, 2019

Compact supported activation function.
In order to produce scale separation and identification capability of the MscaleDNN, we take the hint from the theory of compact mother scaling function in the wavelet theory

Solving high dimensional PDE
Example

Linear approximation for wide NN
ℎ 𝑥, 𝜃(𝑡) =ℎ 𝑥, 𝜃 0 + 𝛻 𝜃 ℎ(𝑥, 𝜃 0 ) 𝜃 𝑡 − 𝜃 0 for any 𝑡>0 𝐿 𝜃 = ℎ 𝑋,𝜃 −𝑌 2 2 𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 𝑋: 𝑥 𝑖 𝑖=1 𝑀 𝑌: 𝑦 𝑖 𝑖=1 𝑀 CIFAR 10, CNN Lee et al., 2019

Linear F-Principle dynamics
𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 ℎ ⋅,𝜃 = 𝑖=1 𝑁 𝑤 𝑖 𝜎( 𝑟 𝑖 (⋅+ 𝑙 𝑖 )) 1-d function 𝑁 sufficiently large 𝜕 𝑡 ℎ 𝜉,𝑡 = 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝜋 2 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝑓 𝑝 𝜉,𝑡 − ℎ 𝑝 𝜉,𝑡 𝑓: target function; ⋅ 𝑝 =(⋅)𝑝,where 𝑝 𝑥 = 1 𝑀 𝑖=1 𝑀 𝛿(𝑥− 𝑥 𝑖 ) ; ⋅ : Fourier transform; 𝜉: frequency For simplicity 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 𝑢(𝑥,𝑡)=ℎ(𝑥,𝑡)−𝑓(𝑥)

Preference induced by LFP dynamics
𝜕 𝑡 ℎ 𝜉,𝑡 =− 4 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 ℎ 𝑝 𝜉,𝑡 − 𝑓 𝑝 𝜉,𝑡 min ℎ∈ 𝐹 𝛾 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 4 −1 ℎ 𝜉 d𝜉 s.t. ℎ 𝑥 𝑖 = 𝑦 𝑖 for 𝑖=1,⋯,𝑛 low frequency preference Case 1: 𝜉 −4 dominant min 𝜉 4 ℎ 𝜉 2 d𝜉~min ℎ′′(𝑥) 2 d𝜉→ cubic spline Case 2: 𝜉 −2 dominant min 𝜉 2 ℎ 𝜉 2 d𝜉~min ℎ′(𝑥) 2 d𝜉→ linear spline

Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -0.1,0.1 ] and [ -0.25,0.25 ]

Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -2,2]

High-dim case 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝑑 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑑 𝑢 𝑝 𝜉,𝑡

An apriori generalization bound
A Priori Estimates For Two-layer Neural Networks Weinan E, Chao Ma, Lei Wu (2019) With/without regularity in the loss function

NN cannot learn well a highly frequency dominant target function

An apriori generalization error bound
sin 𝑣𝑥

Conclusion DNNs prefer low frequencies!
Reference: Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. DNNs prefer low frequencies! Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai (SMU) Acknowledge: Weinan E (Princeton), David W. McLaughlin (CIMS) A summary note of the F-Principle can be found at my page xuzhiqin.top

Fitting high-d functions
Linear: Non-linear:

Deep learning Theory: F-Principle

Similar presentations

Presentation on theme: "Deep learning Theory: F-Principle"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep learning Theory: F-Principle

Similar presentations

Presentation on theme: "Deep learning Theory: F-Principle"— Presentation transcript:

Similar presentations

About project

Feedback