Download presentation
Presentation is loading. Please wait.
1
Kernels in Multivariate Analysis
2
Today’s Agenda Inner products Kernels Hilbert space
Reproducing kernel map RKHS: Reproducing Kernel Hilbert Space Mercer Theorem (without proof) Kernel Trick Kernel PCA Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
3
Inner products Def: An inner product on a vector space 𝑉 is a complex-valued function of two variables ⋅,⋅ :𝑉×𝑉→ℝ (or ℂ), satisfying 3 axioms: ∀𝑥,𝑦,𝑧∈𝑉,∀𝑐∈ℝ: Linear in first argument: 𝑐𝑥+𝑦,𝑧 =𝑐 𝑥,𝑧 + 𝑦,𝑧 Then, bi-linearity is implied Symmetry (or complex conjugation, if in ℂ): 𝑥,𝑧 = 𝑧,𝑥 Positive definiteness: 𝑥,𝑥 ≥0, with equality only for 𝑥=0 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
4
Inner products Vector spaces define a huge family of sets and yet have a very rigid structure (8 axioms, binary operator, etc.) Examples: ℝ 𝑛≤∞ , ℂ ∞ , lines through origin, planes through origin, functions containing zero, etc. Likewise, IPs are very (very) general functions. 𝑓: ℝ 𝑛 →ℝ is pretty general… Right? Yet, it’s a “double-special” case of IP! Fix 𝑉= ℝ 𝑛 and second argument, then: 𝑓 ⋅ = ⋅, 𝑥 0 is just IP at a fixed point on a fixed domain ℝ 𝑛 In fact, just positive definite IPs, i.e. ⋅,⋅ ≥0, give rise to lots of methods we see and use in statistics Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
5
Concept check: Find imposters
Which ones are not IPs? Canonical dot product in ℝ 𝑛 : 𝐱,𝐲 =𝐱∙𝐲 𝑥,𝑦 = 𝑥+𝑦 2 − 𝑥−𝑦 2 /4 𝐴,𝐵 =trace 𝐵 𝑇 𝐴 𝑥,𝑦 = 𝑥 𝑇 𝐴𝑦, 𝐴 is nonsingular 𝑓,𝑔 = 0 1 𝑓𝑔 𝑓,𝑔 =∫𝑓𝑔 𝐱,𝐲 =𝐱+𝐲 𝑓,𝑔 = 0 1 𝑓 ′ 𝑔 𝐴,𝐵 =trace(𝐴+𝐵) Recall axioms: ⋅,⋅ is bi-linear, symmetric, positive-definite. 6-9 are not IPs Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
6
Kernels (or kernel functions)
Def: Kernel function on input space is an inner product in a feature space It’s a similarity measure in a feature space It’s a weighing function (discrete or continuous) 𝑘:𝒳×𝒳→𝑉 𝑘 𝑥,𝑦 = 𝜑 𝑥 ,𝜑 𝑦 𝑉 Input space 𝒳= ℝ 𝑝 Feature space 𝑉, ⋅,⋅ 𝜑:𝒳→𝑉 𝑥,𝑦 = 𝑥,𝑦 ℝ 𝑝 =𝑥∙𝑦 𝜑 𝑥 ,𝜑 𝑦 𝑉 = 𝜑 𝑥 ,𝜑 𝑦 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
7
Examples of Kernels Identity (simplest): 𝑘 𝑥,𝑦 = 𝑥,𝑦 =𝑥⋅𝑦
Polynomial: 𝑘 𝑥,𝑦 = 𝑥,𝑦 𝑑 Input space: 𝒳= ℝ 𝑑 , Feature space: 𝑉= space of monomials Ex: 𝐱,𝐲 2 = 𝐱⋅𝐲 2 = 𝑥 1 𝑦 1 + 𝑥 2 𝑦 = 𝑥 1 2 𝑦 𝑥 1 𝑦 1 𝑥 2 𝑦 2 + 𝑥 2 2 𝑦 2 2 Feature extraction: 𝜑: ℝ 2 → ℝ 4 by 𝜑 𝐱 = 𝑥 1 2 , 𝑥 2 2 , 𝑥 1 𝑥 2 , 𝑥 2 𝑥 1 Inhomogeneous Polynomial: 𝑘 𝑥,𝑦 = 𝑥,𝑦 +𝑐 𝑑 Gaussian (aka radial): 𝑘 𝑥,𝑦 = exp − 𝑥−𝑦 𝜎 2 Looks familiar? Note: 𝑘 ⋅,𝜇 ∝𝑁 ⋅∣𝜇, 𝜎 2 Any pdf is a kernel (with one argument fixed) ! Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
8
Positive (semi-)definite kernels
Def: Gram matrix (or kernel matrix) of 𝑘 w.r.t. 𝑥 1 ,…, 𝑥 𝑛 is 𝐾≜ 𝐾 𝑖𝑗 𝑛×𝑛 = 𝑘 𝑥 𝑖 , 𝑥 𝑗 𝑛×𝑛 Def: 𝐾 is psd, if 𝑥 ′ 𝐾𝑥≥0,∀𝑥 Def: 𝑘(⋅,⋅) is psd (aka pd), if it defines psd 𝐾 In literature, aka: covariance function, reproducing kernel, admissible kernel, support vector kernel, nonnegative definite kernel,… Cauchy-Schwartz Inequality for psd kernels: 𝑘 𝑥,𝑦 2 ≤𝑘 𝑥,𝑥 𝑘 𝑦,𝑦 Here we only consider psd (symmetric) kernels! Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
9
A space exploration Vector space, 𝑉: space of vectors
Inner product space (or Pre-Hilbert space) 𝑉, ⋅,⋅ Inner product: a real (or complex) function of two variables: ⋅,⋅ :𝑉×𝑉→ℝ Symmetric 𝑣,𝑢 = 𝑢,𝑣 Linear in 1st argument, 𝑎𝑢+𝑏𝑣,⋅ =𝑎 𝑢,⋅ +𝑏 𝑣,⋅ Positive definite with itself 𝑣,𝑣 ≥0 Complete vector space: where all Cauchy sequences converge Hilbert space: complete inner product space ℋ≜ 𝑉, ⋅,⋅ ,complete Metric space, 𝑆: a set with a metric (distance function) Metric: a positive distance function 𝑑(⋅,⋅):𝑆×𝑆→ ℝ ≥0 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
10
Why use a (David) Hilbert space ℋ
Projections always exist in ℋ ℋ can be an infinite dimensional Elements can be functions Converging (Cauchy) sequences converge in the space (and not elsewhere!) So, the solutions (limits) to many problems exist and are “reachable” b/c Hilbert spaces and weird hats were in-fashion circa 1910 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
11
Reproducing kernel map (RK map), 𝜑
Feature map maps points to features (vectors in feature space), i.e. 𝜑:𝒳→ 𝑉, ⋅,⋅ Def: reproducing kernel map is a feature map, 𝜑, that maps points to kernels (as vectors in RKHS, a Hilbert space of functions defined by fixed kernel 𝑘): 𝜑:𝒳→ ℋ 𝑘 ≜ 𝑓:𝒳→ℝ 𝑏𝑦 𝑥↦𝑘 ⋅,𝑥 i.e. 𝜑 𝑥 ⋅ =𝑘 ⋅,𝑥 Since 𝜑 𝑥 is a function, we can evaluate 𝜑 once more: 𝜑 𝑥 𝑦 = 𝑘 ⋅,𝑥 𝑦 =𝑘 𝑥,𝑦 = 𝜑 𝑥 ,𝜑 𝑦 = 𝑘 ⋅,𝑥 ,𝑘 ⋅,𝑦 Reproducing property: 𝑘 𝑥,𝑦 = 𝑘 ⋅,𝑥 ,𝑘 ⋅,𝑦 Correspondence: 𝜑↔𝑘↔ ℋ 𝑘 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
12
Example: Gaussian RK map
Consider RK map 𝜑 for Gaussian kernel 𝑘: 𝜑 𝑥 =𝑘 ⋅,𝑥 = exp − 𝑥−⋅ 𝜎 2 ∝𝑁(𝑥∣⋅,𝜎 2 ) 𝜑 maps points to normal densities centered at these points As we continuously slide 𝜑 along points, the Gaussian density (image of 𝜑 in feature space ℋ 𝑘 ) slides along Source: Learning with Kernels… Smola A.J. 2001 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
13
Reproducing Kernel Hilbert Spaces, ℋ 𝑘
Def: RKHS ℋ 𝑘 is a Hilbert space (of real functions on 𝒳) with a spanning, reproducing kernel 𝑘 i.e. for an input set 𝒳 and a (fixed) kernel 𝑘:𝒳×𝒳→ℝ ℋ 𝑘 ⊂ℋ= 𝑓:𝒳→ℝ, ⋅,⋅ 1. Where 𝑘 has reproducing property: 𝑘 𝑥,𝑦 = 𝑘 ⋅,𝑥 ,𝑘 ⋅,𝑦 or equivalently 𝑓 𝑥 = 𝑓,𝑘 𝑥,⋅ ,∀𝑓∈ℋ 2. And, 𝑘 spans ℋ: ℋ= span 𝑘 𝑥,⋅ ∣𝑥∈𝒳 where is a closure or completion, i.e. includes all limit points. Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
14
Dimension reduction Feature Space Input space, ℝ 𝑝 ℝ 𝑝 ℝ 𝑞 ,𝑞<𝑝 Linear, uncorrelated image PCA, CCA, LDA, … Linear & correlated 𝑥 𝑖 Non-linear & correlated 𝑥 𝑖 RKHS, ℋ 𝑘 Kernel methods: kPCA, kCCA, kLDA, … Fewer dimensions than in ℋ 𝑘 Linear, uncorrelated image PCA, LDA, CCA: feature extraction and dimension reduction Note: dimension reduction is in feature space, not the input space So, the model uses fewer features, not observed variables Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
15
Kernel trick Trick: (implicitly) map data from 𝒳 to a “nice” high-dimensional space ℋ 𝑘 , enabled with inner-product to solve classification and other problems We don’t need to know the feature map 𝜑 or the image 𝜑 𝑥 We only work with inner products, expressed as kernels We pick a kernel 𝑘:𝒳×𝒳→ℝ satisfying Mercer conditions Mercer theorem allows implied use of map via kernels 𝜑:𝒳→ ℋ 𝑘 Input space, 𝒳 Feature space, ℋ 𝑘 , ⋅,⋅ Trick: map 𝜑 may not be known 𝑘 𝑥,𝑦 = 𝜑 𝑥 ,𝜑 𝑦 ℋ 𝑘 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
16
Kernel trick Embed data into feature space ℋ 𝑘 : 𝑥↦𝜑 𝑥
Data images 𝜑 𝑥 𝑖 1 𝑛 must have linear relations Use only pairwise inner products, 𝜑 𝑥 𝑖 𝜑 𝑥 𝑗 𝑇 , not image points Compute pairwise IPs from original data using kernel function, 𝑘 ⋅,⋅ 𝜑 𝑥 𝑖 ⊥𝜑 𝑥 𝑗 ? Source: Kernel Methods for Pattern Analysis, John Shawe, 2004 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
17
Mercer condition Def: Mercer condition:
𝑔∈ 𝐿 2 ⇒ 𝑘 𝑥,𝑦 𝑔 𝑥 𝑔 𝑦 𝑑𝑥𝑑𝑦 ≥0 𝑘 is a psd kernel (i.e. symmetric, non-negative, bounded) In general, replace 𝐿 2 ℝ 𝑛 ,𝑑𝑥 with 𝐿 2 𝒳,𝜇 for any positive finite measure 𝜇 on an input space 𝒳. Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
18
Mercer theorem Assume Mercer condition and an integral operator 𝑇 𝑘
𝑇 𝑘 𝑓 𝑥 = 𝒳 𝑘 𝑢,𝑥 𝑓 𝑢 𝑑𝑢 Mercer theorem: ∃ continuous orthonormal basis (ONB) 𝐯 𝑖 of 𝑇 𝑘 with ordered eigenvalues 𝜆 𝑖 >0 s.t. 𝝀∈ ℓ 1 and 𝑘 𝑥,𝑦 = 𝑖 𝜆 𝑖 𝐯 𝑖 𝑥 𝐯 𝑖 𝑦 𝑇 Theorem guarantees a Hilbert space with a kernel as dot product of eigenvectors We can compute the kernel by solving eigenvalue problem Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
19
PCA vs kPCA Note that feature map (here Φ) defines relationship with RKHS (here 𝐻), but doesn’t need to be determined! Source: Learning withKernels SVM,… by A.J. Smola, 2001 Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
20
kPCA example Fig 1: Circular groups (PCA fails here)
Pick kernel as similarity measure (measure of closeness) Fig 2: 𝑘 𝑥,𝑦 = 𝑥 𝑇 𝑦+1 2 Fig 2: 𝑘 𝑥,𝑦 = exp − 𝑥−𝑦 𝜎 2 Fig 2 Fig 3 Source: Wikipedia, Author: Petter Strandmark Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
21
PCA: Review Def: PCA seeks few linear combinations to summarize data at with minimal information loss Reduces dimensionality by discarding least-explanatory PCs (features) Def: PCs are (linearly uncorrelated) images of (orthogonal) map, Γ, of original variables 𝑥 1 ,…, 𝑥 𝑝 : 𝑌≜ Γ ′ 𝑋−𝜇 where 𝑋~ 𝜇,Σ>0 , 𝑌 are PCs Γ is matrix of eigenvectors of Σ corresponding to ordered eigenvalues 𝑋 𝑛×𝑝 = 𝑥 𝑖𝑗 𝑛×𝑝 = 𝑥 1 … 𝑥 𝑝 = 𝑥 1 ′ ⋮ 𝑥 𝑛 ′ is our data matrix Columns of 𝑌 are PCs Γ is an orthonormal basis (ONB) or orthogonal basis? Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
22
Kernel PCA (kPCA) Non-linear generalization of PCA using kernel methods Performs standard (linear) PCA in feature (RKHS) space, ℋ 𝑘 Find kernel matrix 𝐾= 𝑘 𝑥 𝑖 , 𝑥 𝑗 𝑛×𝑛 for a chosen kernel Hard eigenvalue problem: solve 𝑛𝜆𝐯=𝐾𝐯 for 𝐯∈ ℋ 𝑘 Easier dual problem: solve 𝑛𝜆𝜶=𝐾𝜶, where 𝜶 are coefficients of 𝐯, since 𝐯=span 𝜑 𝑥 𝑖 = 𝛼 𝑖 𝜑 𝑥 𝑖 Normalize eigenvectors s.t. 𝐯,𝐯 =1 or 𝜆 𝜶,𝜶 =1 Feature extraction, 𝜑 𝑥 : compute principal PCs as projections onto eigenvectors, 𝐯 𝐯,𝜑 𝑥 = 𝛼 𝑖 𝑘 𝑥 𝑖 ,𝑥 =𝐾𝜶 Source: Wikipedia, Author: Petter Strandmark Stat 541, Multivariate Analysis Oleg Melnikov, Rice University, Fall 2013 11/23/2018
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.