Unsupervised Learning: Principle Component Analysis

Unsupervised Learning: Principle Component Analysis

Unsupervised Learning
Dimension Reduction (化繁為簡) Generation (無中生有) only having function input only having function output function function Random numbers

Dimension Reduction function vector x vector z (Low Dim) (High Dim)
Looks like 3-D Actually, 2-D

3 3 3 3 3 Dimension Reduction In MNIST, a digit is 28 x 28 dims.
Most 28 x 28 dim vectors are not digits 3 3 3 3 3 用這個例子 -20。 -10。 0。 10。 20。

Open question: how many clusters do we need?
0 0 1 Clustering Cluster 3 Open question: how many clusters do we need? K-means Clustering 𝑋= 𝑥 1 ,⋯, 𝑥 𝑛 ,⋯, 𝑥 𝑁 into K clusters Initialize cluster center 𝑐 𝑖 , i=1,2, … K (K random 𝑥 𝑛 from 𝑋) Repeat For all 𝑥 𝑛 in 𝑋: Updating all 𝑐 𝑖 : 1 0 0 0 1 0 Cluster 1 Cluster 2 1 𝑥 𝑛 is most “close” to 𝑐 𝑖 𝑏 𝑖 𝑛 Otherwise 𝑐 𝑖 = 𝑥 𝑛 𝑏 𝑖 𝑛 𝑥 𝑛 𝑥 𝑛 𝑏 𝑖 𝑛

Clustering Hierarchical Agglomerative Clustering (HAC) root
Step 1: build a tree Step 2: pick a threshold

Distributed Representation
Clustering: an object must belong to one cluster Distributed representation 小傑是強化系強化系 0.70 放出系 0.25 變化系 0.05 操作系 0.00 具現化系特質系克己派（Abnegation）、友好派（Amity）、無畏派（Dauntless）、直言派（Candor）以及博學派（Erudite）小傑是

Distributed Representation
function vector x vector z (High Dim) (Low Dim) Feature selection Principle component analysis (PCA) 𝑥 2 ? Select 𝑥 2 𝑥 1 [Bishop, Chapter 12] 𝑧=𝑊𝑥

PCA 𝑧=𝑊𝑥 Large variance Reduce to 1-D: 𝑧 1 = 𝑤 1 ∙𝑥 Small variance 𝑥
Project all the data points x onto 𝑤 1 , and obtain a set of 𝑧 1 We want the variance of 𝑧 1 as large as possible 𝑤 1 𝑉𝑎𝑟 𝑧 1 = 𝑧 𝑧 1 − 𝑧 𝑧 1 = 𝑤 1 ∙𝑥 𝑤 =1

PCA Project all the data points x onto 𝑤 1 , and obtain a set of 𝑧 1 We want the variance of 𝑧 1 as large as possible 𝑉𝑎𝑟 𝑧 1 = 𝑧 𝑧 1 − 𝑧 𝑤 =1 𝑧=𝑊𝑥 Reduce to 1-D: 𝑧 1 = 𝑤 1 ∙𝑥 𝑧 2 = 𝑤 2 ∙𝑥 We want the variance of 𝑧 2 as large as possible 𝑊= 𝑤 1 𝑇 𝑤 2 𝑇 ⋮ 𝑉𝑎𝑟 𝑧 2 = 𝑧 𝑧 2 − 𝑧 𝑤 =1 𝑤 1 ∙ 𝑤 2 =0 Orthogonal matrix

Warning of Math

PCA 𝑧 1 = 𝑤 1 ∙𝑥 𝑧 1 = 1 𝑁 𝑧 1 = 1 𝑁 𝑤 1 ∙𝑥 = 𝑤 1 ∙ 1 𝑁 𝑥 = 𝑤 1 ∙ 𝑥
𝑧 1 = 1 𝑁 𝑧 1 = 1 𝑁 𝑤 1 ∙𝑥 = 𝑤 1 ∙ 1 𝑁 𝑥 = 𝑤 1 ∙ 𝑥 𝑉𝑎𝑟 𝑧 1 = 1 𝑁 𝑧 𝑧 1 − 𝑧 𝑎∙𝑏 2 = 𝑎 𝑇 𝑏 2 = 𝑎 𝑇 𝑏 𝑎 𝑇 𝑏 = 1 𝑁 𝑥 𝑤 1 ∙𝑥− 𝑤 1 ∙ 𝑥 2 = 𝑎 𝑇 𝑏 𝑎 𝑇 𝑏 𝑇 = 𝑎 𝑇 𝑏 𝑏 𝑇 𝑎 = 1 𝑁 𝑤 1 ∙ 𝑥− 𝑥 2 Find 𝑤 1 maximizing = 1 𝑁 𝑤 1 𝑇 𝑥− 𝑥 𝑥− 𝑥 𝑇 𝑤 1 𝑤 1 𝑇 𝑆 𝑤 1 = 𝑤 1 𝑇 1 𝑁 𝑥− 𝑥 𝑥− 𝑥 𝑇 𝑤 1 𝑤 = 𝑤 1 𝑇 𝑤 1 =1 = 𝑤 1 𝑇 𝐶𝑜𝑣 𝑥 𝑤 1 𝑆=𝐶𝑜𝑣 𝑥

Corresponding to the largest eigenvalue 𝜆 1
Find 𝑤 1 maximizing 𝑤 1 𝑇 𝑆 𝑤 1 𝑤 1 𝑇 𝑤 1 =1 𝑆=𝐶𝑜𝑣 𝑥 Symmetric positive-semidefinite (non-negative eigenvalues) Using Lagrange multiplier [Bishop, Appendix E] 𝑔 𝑤 1 = 𝑤 1 𝑇 𝑆 𝑤 1 −𝛼 𝑤 1 𝑇 𝑤 1 −1 𝑆 𝑤 1 −𝛼 𝑤 1 =0 𝜕𝑔 𝑤 1 𝜕 𝑤 1 1 =0 𝑆 𝑤 1 =𝛼 𝑤 1 𝑤 1 : eigenvector 𝜕𝑔 𝑤 1 𝜕 𝑤 2 1 =0 a symmetric {\displaystyle n} × {\displaystyle n} real matrix {\displaystyle M} is said to be positive definite if the scalar {\displaystyle z^{\mathrm {T} }Mz} is positive for every non-zero column vector {\displaystyle z} of {\displaystyle n} real numbers zTMz>=0 Ref: … 𝑤 1 𝑇 𝑆 𝑤 1 =𝛼 𝑤 1 𝑇 𝑤 1 =𝛼 Choose the maximum one 𝑤 1 is the eigenvector of the covariance matrix S Corresponding to the largest eigenvalue 𝜆 1

Corresponding to the 2nd largest eigenvalue 𝜆 2
Find 𝑤 2 maximizing 𝑤 2 𝑇 𝑆 𝑤 2 𝑤 2 𝑇 𝑤 2 =1 𝑤 2 𝑇 𝑤 1 =0 𝑔 𝑤 2 = 𝑤 2 𝑇 𝑆 𝑤 2 −𝛼 𝑤 2 𝑇 𝑤 2 −1 −𝛽 𝑤 2 𝑇 𝑤 1 −0 𝜕𝑔 𝑤 2 𝜕 𝑤 1 2 =0 𝑆 𝑤 2 −𝛼 𝑤 2 −𝛽 𝑤 1 =0 1 𝑤 1 𝑇 𝑆 𝑤 2 −𝛼 𝑤 1 𝑇 𝑤 2 −𝛽 𝑤 1 𝑇 𝑤 1 =0 𝜕𝑔 𝑤 2 𝜕 𝑤 2 2 =0 = 𝑤 1 𝑇 𝑆 𝑤 2 𝑇 = 𝑤 2 𝑇 𝑆 𝑇 𝑤 1 … = 𝑤 2 𝑇 𝑆 𝑤 1 = 𝜆 1 𝑤 2 𝑇 𝑤 1 =0 𝑆 𝑤 1 = 𝜆 1 𝑤 1 𝛽=0: 𝑆 𝑤 2 −𝛼 𝑤 2 =0 𝑆 𝑤 2 =𝛼 𝑤 2 𝑤 2 is the eigenvector of the covariance matrix S Corresponding to the 2nd largest eigenvalue 𝜆 2

PCA - decorrelation 𝑧=𝑊𝑥 𝐶𝑜𝑣 𝑧 =𝐷 𝐶𝑜𝑣 𝑧 = 1 𝑁 𝑧− 𝑧 𝑧− 𝑧 𝑇 =𝑊𝑆 𝑊 𝑇
𝑧 1 𝑧 2 𝑥 2 𝑧=𝑊𝑥 𝑥 1 𝐶𝑜𝑣 𝑧 =𝐷 Diagonal matrix 𝐶𝑜𝑣 𝑧 = 1 𝑁 𝑧− 𝑧 𝑧− 𝑧 𝑇 =𝑊𝑆 𝑊 𝑇 𝑆=𝐶𝑜𝑣 𝑥 =𝑊𝑆 𝑤 1 ⋯ 𝑤 𝐾 =𝑊 𝑆 𝑤 1 ⋯ 𝑆 𝑤 𝐾 =𝑊 𝜆 1 𝑤 1 ⋯ 𝜆 𝐾 𝑤 𝐾 = 𝜆 1 𝑊𝑤 1 ⋯ 𝜆 𝐾 𝑊𝑤 𝐾 = 𝜆 1 𝑒 1 ⋯ 𝜆 𝐾 𝑒 𝐾 =𝐷 Diagonal matrix

End of Warning So many good explaination !!!!!!!!!!!!!!!!!

PCA – Another Point of View
Basic Component: ……. u1 u2 u3 u4 u5 1 1 1 ⋮ 1x 1x 1x ≈ + + u1 u3 u5 𝑐 1 𝑐 2 ⋮ 𝑐 K 𝑥≈ 𝑐 1 𝑢 1 + 𝑐 2 𝑢 2 +⋯+ 𝑐 K 𝑢 𝐾 + 𝑥 Represent a digit image Pixels in a digit image component

PCA – Another Point of View
𝑥− 𝑥 ≈ 𝑐 1 𝑢 1 + 𝑐 2 𝑢 2 +⋯+ 𝑐 K 𝑢 𝐾 = 𝑥 Reconstruction error: Find 𝑢 1 ,…, 𝑢 𝐾 minimizing the error (𝑥− 𝑥 )− 𝑥 2 𝐿= min 𝑢 1 ,…, 𝑢 𝐾 𝑥− 𝑥 − 𝑘=1 𝐾 𝑐 𝑘 𝑢 𝑘 2 PCA: 𝑧=𝑊𝑥 𝑥 𝑧 1 𝑧 2 ⋮ 𝑧 𝐾 = 𝑤 1 T 𝑤 2 T ⋮ 𝑤 𝐾 T 𝑥 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾 minimizing L Proof in [Bishop, Chapter ]

𝑥− 𝑥 ≈ 𝑐 1 𝑢 1 + 𝑐 2 𝑢 2 +⋯+ 𝑐 K 𝑢 𝐾 = 𝑥 …… ≈ … … … … …
Reconstruction error: Find 𝑢 1 ,…, 𝑢 𝐾 minimizing the error (𝑥− 𝑥 )− 𝑥 2 𝑥 1 − 𝑥 ≈ 𝑐 1 1 𝑢 1 + 𝑐 2 1 𝑢 2 +⋯ 𝑥 2 − 𝑥 ≈ 𝑐 1 2 𝑢 1 + 𝑐 2 2 𝑢 2 +⋯ 𝑥 3 − 𝑥 ≈ 𝑐 1 3 𝑢 1 + 𝑐 2 3 𝑢 2 +⋯ …… Link PCA with MDS [1] T. Cox and M. Cox. Multidimensional Scaling. Chapman Hall, Boca Raton, 2nd edition, 2001. u1 u2 𝑐 1 1 𝑐 1 2 𝑐 1 3 ≈ … … 𝑐 2 1 𝑐 2 2 𝑐 2 3 Minimize Error … … … Matrix X

This is the solution of PCA
𝑥 1 − 𝑥 u1 u2 𝑐 1 1 𝑐 1 2 𝑐 1 3 ≈ … … 𝑐 2 1 𝑐 2 2 𝑐 2 3 Minimize Error … … … Matrix X M x N M x K K x K K x N X U ∑ V ≈ K columns of U: a set of orthonormal eigen vectors corresponding to the k largest eigenvalues of XXT This is the solution of PCA SVD:

If 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾
PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾 𝑥 = 𝑘=1 𝐾 𝑐 𝑘 𝑤 𝑘 To minimize reconstruction error: 𝑥− 𝑥 𝑐 𝑘 = 𝑥− 𝑥 ∙ 𝑤 𝑘 𝐾=2: 𝑤 1 1 𝑐 1 𝐾=2 𝑤 2 1 𝑥− 𝑥 𝑤 3 1

PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾 𝑥 = 𝑘=1 𝐾 𝑐 𝑘 𝑤 𝑘 To minimize reconstruction error: 𝑥− 𝑥 𝑐 𝑘 = 𝑥− 𝑥 ∙ 𝑤 𝑘 𝐾=2: 𝑐 1 𝐾=2 𝑥− 𝑥 𝑤 1 2 𝑐 2 𝑤 2 2 𝑤 3 2

PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾 𝑥 = 𝑘=1 𝐾 𝑐 𝑘 𝑤 𝑘 To minimize reconstruction error: 𝑥− 𝑥 𝑐 𝑘 = 𝑥− 𝑥 ∙ 𝑤 𝑘 𝐾=2: 𝑤 1 1 𝑥 1 𝑥 2 𝑥 3 𝑐 1 𝐾=2 𝑤 2 1 𝑥− 𝑥 𝑤 1 2 𝑤 3 1 𝑐 2 𝑤 2 2 𝑤 3 2

PCA looks like a neural network with one hidden layer (linear activation function) Autoencoder If 𝑤 1 , 𝑤 2 ,… 𝑤 𝐾 is the component 𝑢 1 , 𝑢 2 ,… 𝑢 𝐾 𝑥 = 𝑘=1 𝐾 𝑐 𝑘 𝑤 𝑘 To minimize reconstruction error: 𝑥− 𝑥 𝑐 𝑘 = 𝑥− 𝑥 ∙ 𝑤 𝑘 𝐾=2: It can be deep. Deep Autoencoder 𝑥 1 𝑥 2 𝑥 3 Minimize error 𝑐 1 𝑤 1 2 𝐾=2 𝑥− 𝑥 𝑥− 𝑥 𝑤 1 2 𝑐 2 𝑤 2 2 𝑤 2 2 Gradient Descent? 𝑤 3 2 𝑤 3 2

Using 4 components is good enough
PCA - Pokémon Inspired from: pal-component-analysis-of-pokemon-data 800 Pokemons, 6 features for each (HP, Atk, Def, Sp Atk, Sp Def, Speed) How many principle components? 𝜆 𝑖 𝜆 1 + 𝜆 2 + 𝜆 3 + 𝜆 4 + 𝜆 5 + 𝜆 6 [ ] 𝜆 1 𝜆 2 𝜆 3 𝜆 4 𝜆 5 𝜆 6 ratio 0.45 0.18 0.13 0.12 0.07 0.04 Using 4 components is good enough

PCA - Pokémon HP Atk Def Sp Atk Sp Def Speed PC1 0.4 0.5 0.3 PC2 0.1
0.0 0.6 -0.3 0.2 -0.7 PC3 -0.5 -0.6 PC4 0.7 -0.4 強度防禦(犧牲速度)

PCA - Pokémon HP Atk Def Sp Atk Sp Def Speed PC1 0.4 0.5 0.3 PC2 0.1
0.0 0.6 -0.3 0.2 -0.7 PC3 -0.5 -0.6 PC4 0.7 -0.4 特殊防禦(犧牲攻擊和生命) 生命力強幸福蛋雷吉艾斯壺壺

PCA - Pokémon http://140.112.21.35:2880/~tlkagk/pokemon/pca.html
The code is modified from

PCA - MNIST = 𝑎 1 𝑤 1 + 𝑎 2 𝑤 2 +⋯ images 30 components: Eigen-digits

Weakness of PCA Unsupervised Linear PCA LDA

Weakness of PCA Pixel (28x28) -> tSNE (2) PCA (32) -> tSNE (2)
Non-linear dimension reduction in the following lectures

Acknowledgement 感謝彭冲同學發現引用資料的錯誤感謝 Hsiang-Chih Cheng 同學發現投影片上的錯誤

Appendix WVM/s640/dimensionality+reduction.jpg _Review_2009.pdf

Unsupervised Learning: Principle Component Analysis

Similar presentations

Presentation on theme: "Unsupervised Learning: Principle Component Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning: Principle Component Analysis

Similar presentations

Presentation on theme: "Unsupervised Learning: Principle Component Analysis"— Presentation transcript:

Similar presentations

About project

Feedback