Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep learning Theory: F-Principle

Similar presentations


Presentation on theme: "Deep learning Theory: F-Principle"— Presentation transcript:

1 Deep learning Theory: F-Principle
Zhiqin Xu 许志钦 Shanghai Jiao Tong University Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai 12. 05

2 Single hidden layer can fit any function
1989 Single hidden layer can fit any function

3 Fitting is not enough!

4 Generalization error Model Complexity Generalization Gap
Need to explain what is generalization, what is overfitting. * Generalization Gap

5 Mystery of generalization in deep learning
"With four parameters you can fit an elephant to a curve; with five you can make him wiggle his trunk.” -- John von Neumann Mayer et al., 2010

6 model: millions of parameters five data points
Overfitting??

7 DNN says: NO!

8 Flat output # of parameters: ~ 1600*Layer number>>5
** Lei Wu et al., 2017

9 Large complexity? But good generalization!
Puzzle: generalize well even # of para >> # of training data Need to explain what is random shuffled data x32 colour images in 10 classes Zhang et al., 2016

10 Dynamics is key to “preference”/“implicit regularization” of DNN.
Trajectory in parameter space 𝑡↦Θ(𝑡). Trajectory in function space 𝑡↦ℎ 𝑥,𝑡 ≔ℎ 𝑥;𝛩 𝑡 Our Approach: Discover and characterize dynamical behavior of ℎ 𝑥,𝑡 during the training on synthetics problems. Examine its empirical universality using real datasets and deep networks. Establish theories for mechanism understanding Understand DNN in applications Better use DNN to solve problems

11 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

12 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

13 Example: Frequency Principle
Red: target function Blue: DNN fitting Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018

14 Example: Frequency Principle
Red: target function Blue: DNN fitting Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018

15 Example: Frequency Principle
Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is several training steps Zhiqin Xu et al., Training behavior of deep neural network in frequency domain, 2018

16 Example: Frequency Principle
i) Capture low-frequency components while keeping high-frequency ones small. ii) Gradually captures high-frequency components. Red: target function Blue: DNN fitting Red: FFT of target function Blue: FFT of DNN fitting Each frame is one training step Amplitude Frequency Xu, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018

17 Equal amplitudes Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

18 How DNN fits a 2-d image? Target: image 𝐼 𝐱 : ℝ 2 →ℝ
𝐱 : location of a pixel 𝐼 𝐱 : grayscale pixel value

19 High-dimensional real data?
F-Principle High-dimensional real data? Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

20 Frequency Image frequency (not used) Response frequency
This frequency corresponds to the rate of change of intensity across neighbouring pixels. Frequency of a general Input-Output mapping f. High freq: small change of the intensity of the i-th pixel in the image might induce a large change of the output Goodfellow et al. Zero freq Same color high freq Sharp edge high freq Adversarial example

21 Examining F-Principle for high dimensional real problems
Nonuniform Discrete Fourier transform (NUDFT) for training dataset ( 𝐱 𝑖 , 𝑦 𝑖 ) 𝑖=1 𝑛 : 𝑦 𝐤 = 1 𝑛 𝑖=1 𝑛 𝑦 𝑖 e −i2𝜋𝐤⋅ 𝐱 𝑖 , ℎ 𝐤 (𝑡)= 1 𝑛 𝑖=1 𝑛 ℎ( 𝐱 𝑖 ,𝑡) e −i2𝜋𝐤⋅ 𝐱 𝑖 Difficulty: Curse of dimensionality, i.e., #𝐤 grows exponentially with dimension of problem 𝑑. Our approaches: Projection, i.e., choose 𝐤=𝑘 𝐩 1 Filtering

22 Projection approach Relative error: Δ 𝐹 𝑘 = ℎ 𝑘 − 𝑦 𝑘 /| 𝑦 𝑘 | MNIST
CIFAR10

23 Decompose frequency domain by filtering
low-freq of 𝑦 high-freq of 𝑦 𝑘 0

24 F-Principle in high-dim space
MNIST CIFAR10 CIFAR10

25 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

26 Theory in a idealized setting
Consider a tanh-DNN of one-hidden layer for fitting a 1-d function 𝑓 ℎ 𝑥 = 𝑗=1 𝑚 𝑎 𝑗 𝜎 𝑤 𝑗 𝑥+ 𝑏 𝑗 , ℎ 𝑘 ≈ 𝑗=1 𝑚 𝑎 𝑗 exp ( 𝑖 𝑏 𝑗 𝑤 𝑗 ) exp − 𝜋𝑘 2 𝑤 𝑖 , Define the loss at frequency 𝑘 𝐿 𝑘 = ℎ 𝑘 − 𝑓 𝑘 2 By Parseval’s theorem: 𝐿= 𝐿 𝑘 𝑑𝑘=∫ 𝑓 𝑥 −ℎ(𝑥) 2 𝑑𝑥 Compute the gradient by the loss in Fourier domain 𝜃←𝜃−𝜂 𝜕𝐿 𝑘 𝜕𝜃

27 Analysis high freq 𝜕𝐿(𝑘) 𝜕𝜃 ≈𝐴 𝑘 exp − 𝜋𝑘 2 𝑤 𝑖 𝐺(𝜃,𝑘) Low freq Where 𝐴 𝑘 = ℎ 𝑘 − 𝑓 𝑘 𝐴 𝑘 >0: If 𝑤 𝑖 is small, exp (− 𝜋𝑘/ 2𝑤 𝑗 ) dominate, low frequencies dominate. For 𝑤 𝑖 ∈ 𝐵 𝛿 , center at 0 with radius 𝛿, if 𝛿 is small, contribution of high frequency loss is negligible. 𝐴 𝑘 ≈0: small contribution from 𝐿(𝑘) Insight: smoothness/regularity of activation function 𝜎(⋅) can be converted into F-Principle through gradient-based training.

28 Low freq grad > high freq grad almost surely as 𝜹→𝟎.
Xu, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019

29 Theory of F-Principle for general DNNs
(i)network architecture, (ii) activation function, (iii) loss function Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019.

30 F-Principle DNN prefers low frequencies

31 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

32 Effect of early stopping
Fit a noisy function Test error gets worse after some training step. Training and test only overlap at LOW frequency part.

33 Generalization difference
F-Principe: DNN prefers low frequencies For 𝑥 ∈ −1,1 𝑛 𝒇 𝒙 = 𝒋=𝟏 𝒏 𝒙 𝒋 , Even #‘-1’  1; Odd #‘-1’  -1. Test accuracy: 96.3%>>10% Test accuracy: 72% %>>10% Test accuracy: ~50%, random guess

34 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

35 Framework application
2019 Turing award In a work done independently from ours, and made available online almost at the same time, Xu et al. (2018) make the same observation that lower frequencies are learned first. The subsequent work by Xu (2018) proposes a theoretical analysis of the phenomenon in the case of 2-layer networks with sigmoid activation, based on the spectrum of the sigmoid function

36 F-Principle as a tool Jagtap, A. D. & Karniadakis, G. E. (2019), ‘Adaptive activation functions accelerate convergence in deep and physics-informed neural networks’ “In this work, we shall use the F-principle to observe the performance of adaptive activation function.” Adaptive a Fix a

37 Slides from Prof. Weinan E at 第十二届全国计算数学年会

38 Slides from Prof. Weinan E at 第十二届全国计算数学年会

39 Suffer from high-frequency curse
Slides from Prof. Weinan E at 第十二届全国计算数学年会

40 DNN-Jacobi method PhaseDNN Suffer from high-dimension curse
Green stars : DNN Dashed line: Jacobi, initialized by DNN Partition the frequency space Xu et al., Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 PhaseDNN (Wei Cai et al., 2019)

41 MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. Cai, Xu*, 2019

42 MscaleDNN Goal: facilitate the learning of many frequencies while having the potential to overcome the curse of dimensionality. MscaleDNN example of A=3 𝑤 11 𝑥 1 + 𝑏 1 𝑤 21 𝑥 1 + 𝑏 2 2 𝑤 31 𝑥 1 + 𝑏 3 𝑥 1 2 𝑤 41 𝑥 1 + 𝑏 4 3 𝑤 51 𝑥 1 + 𝑏 5 3 𝑤 61 𝑥 1 + 𝑏 6 Input layer 1st hidden layer other hidden layers output layer Cai, Xu*, MscaleDNN, 2019

43 Compact supported activation function.
In order to produce scale separation and identification capability of the MscaleDNN, we take the hint from the theory of compact mother scaling function in the wavelet theory

44 Solving high dimensional PDE
Example

45 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

46 Linear approximation for wide NN
ℎ 𝑥, 𝜃(𝑡) =ℎ 𝑥, 𝜃 0 + 𝛻 𝜃 ℎ(𝑥, 𝜃 0 ) 𝜃 𝑡 − 𝜃 0 for any 𝑡>0 𝐿 𝜃 = ℎ 𝑋,𝜃 −𝑌 2 2 𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 𝑋: 𝑥 𝑖 𝑖=1 𝑀 𝑌: 𝑦 𝑖 𝑖=1 𝑀 CIFAR 10, CNN Lee et al., 2019

47 Linear F-Principle dynamics
𝑑𝜃(𝑡) 𝑑𝑡 =− 𝛻 𝜃 ℎ 𝑋, 𝜃 0 𝑇 ℎ 𝑋,𝜃 𝑡 −𝑌 ℎ ⋅,𝜃 = 𝑖=1 𝑁 𝑤 𝑖 𝜎( 𝑟 𝑖 (⋅+ 𝑙 𝑖 )) 1-d function 𝑁 sufficiently large 𝜕 𝑡 ℎ 𝜉,𝑡 = 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝜋 2 1 𝑁 𝑖=1 𝑁 𝑟 𝑖 𝑤 𝑖 𝜉 𝑓 𝑝 𝜉,𝑡 − ℎ 𝑝 𝜉,𝑡 𝑓: target function; ⋅ 𝑝 =(⋅)𝑝,where 𝑝 𝑥 = 1 𝑀 𝑖=1 𝑀 𝛿(𝑥− 𝑥 𝑖 ) ; ⋅ : Fourier transform; 𝜉: frequency For simplicity 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 𝑢(𝑥,𝑡)=ℎ(𝑥,𝑡)−𝑓(𝑥)

48 Preference induced by LFP dynamics
𝜕 𝑡 ℎ 𝜉,𝑡 =− 4 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 ℎ 𝑝 𝜉,𝑡 − 𝑓 𝑝 𝜉,𝑡 min ℎ∈ 𝐹 𝛾 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑟 𝑤 2 𝜉 4 −1 ℎ 𝜉 d𝜉 s.t. ℎ 𝑥 𝑖 = 𝑦 𝑖 for 𝑖=1,⋯,𝑛 low frequency preference Case 1: 𝜉 −4 dominant min 𝜉 4 ℎ 𝜉 2 d𝜉~min ℎ′′(𝑥) 2 d𝜉→ cubic spline Case 2: 𝜉 −2 dominant min 𝜉 2 ℎ 𝜉 2 d𝜉~min ℎ′(𝑥) 2 d𝜉→ linear spline

49 Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -0.1,0.1 ] and [ -0.25,0.25 ]

50 Two-layer relu-NN 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑢 𝑝 𝜉,𝑡 w and r are initialized by uniform distributions on [ -2,2]

51 High-dim case 𝜕 𝑡 𝑢 𝜉,𝑡 =− 𝑟 𝑤 2 𝜉 𝑑 𝜋 2 𝑟 2 𝑤 2 𝜉 𝑑 𝑢 𝑝 𝜉,𝑡

52 An apriori generalization bound
A Priori Estimates For Two-layer Neural Networks Weinan E, Chao Ma, Lei Wu (2019) With/without regularity in the loss function

53 NN cannot learn well a highly frequency dominant target function

54 An apriori generalization error bound
sin 𝑣𝑥

55 F-Principle 1d, 2d, …, real data [1,2] DNNs prefer low freq
Experiments Theory Understanding Application Empirical study quantity 2 layer tanh [1] Theoretical tool General DNN [6] DNN+traditional method [1] Fast & low d Linear FP [5]: accurate predict 2-layer relu MscaleDNN [7] High d & high freq PDEs Error analysis [4,5] Reference: [1] Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** [2] Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 [3] Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 [4] Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** [5] Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** [6] Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. [7] Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. Early stopping [2] Poor in parity func [1] Compression phase [2]

56 Conclusion DNNs prefer low frequencies!
Reference: Xu*, Zhang, Luo, Xiao, Ma, Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks, 2019 *** Xu*, Zhang, Xiao, Training behavior of deep neural network in frequency domain, 2018 Xu*, Understanding training and generalization in deep learning by Fourier analysis, 2018 Zhang, Xu*, Luo, Ma, Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks, 2019 ** Zhang, Xu*, Luo, Ma, A type of generalization error induced by initialization in deep neural networks, 2019** Luo, Ma, Xu, Zhang, Theory on Frequency Principle in General Deep Neural Networks, 2019. Cai, Xu*, Multi-scale Deep Neural Networks for Solving High Dimensional PDEs, 2019. DNNs prefer low frequencies! Joint work with Yaoyu Zhang, Tao Luo, Zheng Ma, Yanyang Xiao, Wei Cai (SMU) Acknowledge: Weinan E (Princeton), David W. McLaughlin (CIMS) A summary note of the F-Principle can be found at my page xuzhiqin.top

57 Fitting high-d functions
Linear: Non-linear:


Download ppt "Deep learning Theory: F-Principle"

Similar presentations


Ads by Google