CIS 5590: Advanced Topics in Large-Scale Machine Learning

CIS 5590: Advanced Topics in Large-Scale Machine Learning
Week II: Kernels, Dimension Reduction, and Graphs Instructor: Kai Zhang Temple University, Spring 2018

Symmetric Matrices Kernel (Gram) matrix Graph adjacency matrix
𝒙 𝒋 𝒙 𝒊 𝐾 𝑖𝑗 =𝑘( 𝒙 𝑖 , 𝒙 𝑗 ) Kernel methods Support vector machines Kernel PCA, LDA, CCA Gaussian process regression Graph-based algorithms Manifold learning, dimension reduction Clustering, semi-supervised learning Random walk, graph propagation Manipulating 𝒏×𝒏 matrix 𝑶(𝒏𝟑) time, 𝑶(𝒏𝟐) space!

(implicit) feature map
Kernel Methods Kernel 𝐾: ℝ 𝑑 × ℝ 𝑑 → ℝ is a similarity measure that can be written as Example: Polynomial kernel T Kernel function (implicit) feature map

Kernels Choice of kernels (closed-form)

Kernel-based Learning
Linear algorithm SVM, MPM, PCA, CCA, FDA… x1 Embed data Data K xn Kernel design Kernel algorithm

Histogram intersection kernel
Histogram Intersection kernel between histograms a, b K small -> a, b are different K large -> a, b are similar Intro. by Swain and Ballard 1991 to compare color histograms. Odone et al 2005 proved positive definiteness. Can be used directly as a kernel for an SVM.

Graph kernel Motivation Unknown Known
Toxic Non-toxic B E D B A C B D C A E Unknown B C D Known C A E Task: predict whether molecules are toxic, given set of known examples D F

Similarity between Graphs
? ≡ Fundamental Problem: How to measure Similarity between two Graphs? Dates back to solving Graph Isomorphism Problem. Proved by Babai (2017) that Graph isomorphism is quasi-polynomial, so far...

Graph Kernel Intuition – two graphs are similar if they exhibit similar patterns when performing random walks H I J Random walk vertices heavily distributed towards A,B,D,E Random walk vertices heavily distributed towards H,I,K with slight bias towards L Similar! A B C K L Q R S D E F Random walk vertices evenly distributed Not Similar! T U V

Kernel Eigen-system Eigenvalue and eigen-function
Integral operator with kernel k(.,.) Positive definite kernel

Mercer’s Theorem Mercer’s theorem is the fundamental theorem underlying Reproducing kernel Hilbert spaces Can be deemed as an infinite-sample extension of kernel eigenvalue decomposition

Univariate Gaussian case
Eigenfunction and eigenvector of 1-d Gaussian

Kernel Eigenvectors Kernel eigenvectors provides empirical kernel map
Suppose the eigenvalue decomposition 𝐾=𝑈Σ U ′ Then 𝜙=𝑈 Σ 1/2 is an explicit solution satisfying 𝐾 𝑖𝑗 = 𝜙 𝑥 𝑖 ,𝜙( 𝑥 𝑗 ) This is already a type of nonlinear manifold learning Example: USPS data embedding 16-by-16, digit 3 (200) and digit 8 (200)

Manifold Learning Manifold Learning (or non-linear dimensionality reduction) embeds data that originally lies in a high dimensional space in a lower dimensional space, while preserving characteristic properties. a manifold is a topological space that locally resembles Euclidean space near each point.

Mapmaking problem A B B A Earth (sphere) Planar map

For each image: there are 64x64 = 4096 pixels (observed dimensions)
Image Examples Objective: to find a small number of features that represent a large number of observed dimensions. For each image: there are 64x64 = 4096 pixels (observed dimensions) Assumption: High-dimensional data often lies on or near a much lower dimensional, curved manifold.

Multi-dimensional Scaling
Given pairwise dissimilarities, reconstruct a map that preserves distances Given an n-by-n distance matrix D Find n points 𝑦 1 , 𝑦 2 ,…,𝑦_𝑛 , such that their pairwise distance resembles those of D Objective function Solution by eigenvalue decomposition Obtain inner-product matrix K Factorize 𝐾=𝑈Σ U ′ Obtain Y= 𝑈 Σ 1/2 𝑖,𝑗=1 𝑛 𝐷 𝑖𝑗 − 𝐷 𝑖𝑗 𝑌 2 𝐷 𝑖𝑗 𝑋 = 𝑥 𝑖 − 𝑗 𝑗 𝐷 𝑖𝑗 𝑌 = 𝑦 𝑖 − 𝑦 𝑗 Let D be a distance matrix, one can transform it to an inner product matrix by 𝐾=−𝐻𝐷𝐻/2, where H is so called double centering matrix 𝐻=𝐼−𝑒 𝑒 ′ /𝑛; if D is Euclidian, K is PSD matrix.

From Distance to Gram Matrix
Assume distance matrix D is calculated using Euclidian distance which is equivalent to Z matrix can be written as Centering matrix 𝐻=𝐼−𝑒 𝑒 ′ /𝑛 𝑋𝐻 subtracts the mean from each 𝑋_𝑖 𝐻 𝑣 𝑒 ′ 𝐻=𝐻𝑣 𝑒 ′ 𝐼− 𝑒 𝑒 ′ 𝑛 =0 Recovering the centered inner product matrix as 𝑍=𝑣∗𝑒’ − 𝐻𝐷𝐻 2 =−0.5∗𝐻 𝑍−2 𝑋 ′ 𝑋+ 𝑍 ′ 𝐻=𝐻 𝑋 ′ 𝑋𝐻=(𝑋𝐻)′(𝑋𝐻) Centered data matrix

ISOMAP From Euclidian distance to manifold distance
Neighboring points: input-space distance Faraway points: a sequence of “short hops” between neighboring points Method: Finding shortest paths in a graph with edges connecting neighboring data points Unlike the geodesic distance, the Euclidean distance cannot reflect the geometric structure of the data points

ISOMAP Step Name Description 1 O(DN2) Construct neighborhood graph, G
Compute matrix DG={dX(i,j)} dx(i,j) = Euclidean distance between neighbors 2 Compute shortest paths between all pairs Compute matrix DG={dG(i,j)} dG(i,j) = sequence of hops = approx geodesic dist. 3 O(dN2) Construct k-dimensional coordinate vectors Apply MDS to DG instead of DX

The 2-D embedding recovered by Isomap
SWISS ROLL The 2-D embedding recovered by Isomap

Stochastic Neighborhood Embedding
A probabilistic version of local MDS more important to get local distances right than non-local ones has a probabilistic way of deciding if a pairwise distance is “local”. High-D Space probability of picking j given that you start at i j k i

Stochastic Neighborhood Embedding
Give each data point a location in the low- dimensional space. Evaluate this representation by seeing how well the low-D probabilities model the high-D ones. probability of picking j given that you start at i High-D Space j k i

The Cost Function For points where pij is large and qij is small we lose a lot. Nearby points in high-D really want to be nearby in low-D For points where qij is large and pij is small we lose a little because we waste some of the probability mass in the Qi distribution. Widely separated points in high-D have a mild preference for being widely separated in low-D.

Gradient Updates Points are pulled towards each other if the p’s are bigger than the q’s and repelled if the q’s are bigger than the p’s j i

T-SNE By using a Gaussian to compute P_ij and a heavy-tailed student’s t to compute Q_ij a heavy-tailed Student-t distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map.

Visualization of 6,000 digits from the MNIST dataset produced by t-SNE.

The COIL20 dataset Each object is rotated about a vertical axis to produce a closed one-dimensional manifold of images.

Visualization of COIL20 produced by t-SNE.

Visualization of COIL20 produced by ISOMAP

Time Series Embedding Example
BOLD (Blood-oxygen-level dependent )signal Signal of a single brain region Signal of altogether 90 brain regions

Impact of perplexity Perplexity controls the range of the neighborhood used in computing the probability distribution

Graph Laplacian (Zhu,Ghahramani,Lafferty ’03)
Interpolation on Graphs: interpolate values of a function at all vertices from given values at a few vertices. Minimize Subject to given values 0.51 1 CDC20 ANAPC10 0.53 CDC27 ANAPC2 0.30 0.61 ANAPC5 UBE2C

The Laplacian Matrix of a Graph
2 1 4 3 5 6 Symmetric Non-positive off-diagonals Diagonally dominant

Signal Propagation on Graphs
Given a graph (adjacency matrix 𝑊∈ 𝑅 𝑛×𝑛 ), and functional values on a few nodes (Y∈ 𝑅 𝑛×1 ), how to obtain full 𝑓 that is Smooth and consistent? Iterative Approach Start from 𝐹 0 Update by Stop when converged - 𝒇( 𝒙 𝒋 ) 𝑾 𝒊𝒋 𝒇( 𝒙 𝒊 ) First term: each point receive information from its neighbors. Second term: retains the initial information.

Regularization Perspective
It could be easily shown that iteration result F* is equivalent to minimize the following cost function: The first term is the Smoothing Constraint: nearby points are likely to have the same label; The second term is the Fitting Constraint: the classification results does not change too much from the initial assignment.

Computer graphics Applications
Rendering on meshed surfaces

Spring networks View edges as rubber bands or ideal linear springs
Nail down some vertices, let rest settle When stretched to length potential energy is

Spring networks Nail down some vertices, let rest settle
Physics: position minimizes total potential energy subject to boundary constraints (nails)

Drawing by Spring Networks
(Tutte ’63) socnetbad10

(Tutte ’63) If the graph is planar, then the spring drawing has no crossing edges! socnetbad10

(Tutte ’63) socnetbad10

Recent Advances Auto-encoder Word embedding
an artificial neural network used for unsupervised learning of efficient codings. The goal is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. Word embedding a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Unsupervised feature learning with a neural network
Network is trained to output the input (learn identify function). Trivial solution unless: Constrain number of units in Layer 2 (learn compressed representation), or Constrain Layer 2 to be sparse. x4 x5 x6 +1 Layer 1 Layer 2 x1 x2 x3 Layer 3 𝑥 ≈𝑥 a1 a2 a3 Encoding ℎ=𝑠𝑖𝑔𝑚( 𝑊 1 𝑥+𝑏) Decoding 𝑥 =𝑠𝑖𝑔𝑚( 𝑊 2 ℎ+𝑐)

Linear Auto-encoder and PCA
Constraints: Encoding and decoding layer share the parameters Linear transfer function Equivalent to PCA from x to h: Projection to the k highest eigen-space from h to ˆx: Truncated x (saving only k highest eigen-space)

Deep Auto-encoders Stacking more and more layers Layerwise training

Word embedding Goal: Associate a low-dimensional, dense vector w with each word w ∈ V so that similar words (in a distributional sense) share a similar vector representation. Implications A word ought to be able to predict its context (word2vec Skip-Gram) A context ought to be able to predict its missing word (word2vec CBOW) You shall know a word by the company it keeps. (J. R. Firth, 1957) Tomas Mikolov et al. “Distributed Representations of Words and Phrases and their Compositionality”. In: Advances in Neural Information Processing Systems , pp. 3111–3119

Word embedding Each unique word is mapped to a point in a real continuous m-dimensional space Typically, |V| > 10^6 , 100 < m < 500

Word embedding

Skip-Gram Model Given a sequence of words w_i’s, maximize the average of log probability Associate with each u ∈ V an “input vector” u ∈ R^d and an “output vector” v ∈ R^d . Model context probabilities as The Gradient wrt v can be computed as

Negative Sampling The complexity of computing the gradient is O(|V|)
Soft-max operator involves all the vocabulary Approximate it with K ``negative samples’’( ) Negative sampling reduces to O(K)

CIS 5590: Advanced Topics in Large-Scale Machine Learning

Similar presentations

Presentation on theme: "CIS 5590: Advanced Topics in Large-Scale Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 5590: Advanced Topics in Large-Scale Machine Learning

Similar presentations

Presentation on theme: "CIS 5590: Advanced Topics in Large-Scale Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback