Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LING 696B: MDS and non-linear methods of dimension reduction.

Similar presentations


Presentation on theme: "1 LING 696B: MDS and non-linear methods of dimension reduction."— Presentation transcript:

1 1 LING 696B: MDS and non-linear methods of dimension reduction

2 2 Big picture so far Blob/pizza/pancake shaped data --> Gaussian distributions Clustering with blobs Linear dimention reduction What if the data are not blob-shaped? Can we still reduce the dimension? Can we still perform clustering?

3 3 Dimension reduction with PCA Decomposition of covariance matrix If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

4 4 Success of reduction = Blob-likeness of data a2a2 a1a1 Pancake data in 3D

5 5 Example: articulatory data Story and Titze: extracting composite articulatory control parameters from area functions using PCA PCA can be a “preprocessor” For K-means

6 6 Can neural nets do dimension reduction? Yes, but most architectures can be seen as implementations of some variant of linear projection Output X Input X hidden = W W Context/time-delay layer Can an Elman-style network discover segments?

7 7 Metric multidimensional scaling Input data: “distance” between stimuli Intend to recover some psychological space for the stimuli Dimension reduction also achieved through appropriate matrix decomposition

8 8 Calculating Metric MDS Data: distance matrix D Entries: D ij = || x i - x j || 2 Need to calculate X from D: Gram matrix: G = X*X T (N X N) Entries: G ij = = x i x j T (unknown) Main point: if the distance is Euclidean, and X is centered, then can compute the Gram matrix from distance matrix G = D substracts column mean, then the row mean (homework)

9 9 Calculating Metric MDS Get X from Gram matrix: decompose G Dimension reduction: only a few d i ’s are significant, the rest are small (similar to PCA), e.g. d = 2

10 10 Calculating Metric MDS Now don’t want rotation, but X itself (different from PCA). There are infinitely many solutions Any rotation matrix R, XR*R T X T = XX T Same problem with Factor Analysis: x = C*z + v = CR * R T z + v The recovered X has to be psychological X XTXT

11 11 MDS and PCA Both are linear dimension reduction Euclidean distance --> identical solutions for dimension reduction X T X (covariance matrix) and XX T (Gram matrix) have the same eigenvalues (homework) (see summary) MDS can be applied even if we don’t know whether it’s Euclidean (Non-metric MDS) MDS needs to diagonalize large matrices when N is large

12 12 Going beyond (linear+blob) combination Looking for non-Gaussian image with linear projections (last week) Linear Discriminant Analysis Independent Component Analysis Looking for non-linear projections that may find blobs (today) Isomap Spectral clustering

13 13 Why non-linear dimension reduction? Linear methods are all based on the Gaussian assumption Gaussians are closed under linear transformations Yet lots of data do not look like blobs In high dimensions, geometric intuition breaks down Hard to see what a distribution “looks like”

14 14 Non-linear dimension reduction Data sampled from a “manifold” structure Manifold: a “surface” that locally looks like Euclidean Each small piece Looks like Euclidean (pictures from L. Saul) No rotation or linear projection can produce this “interesting” structure

15 15 The generic dimension reduction problem Dimension reduction = finding lower dimensional embedding of the manifold Sensory data = embedding in an ambient measurement space (d large) Goal: embedding in a lower dimensional space (visual: d<4) Ideally, d = intrinsic dimension (~ cognition?)

16 16 The need for non-linear transformations Why directly applying MDS will not work? A twisted structure may change the ordering (see demo)

17 17 Embedding needs to preserve global structure Cutting the data into blobs?

18 18 Embedding needs to preserve global structure Cutting the data into blobs? No concept of global structure Can’t tell the intrinsic dimension

19 19 What does it mean to preserve global structure? This is hard to quantify, but we can at least look for an embedding that preserves some properties of the global structure E.g. preserves distance Example: distance between two points on earth The actual calculation depends on what we think the shape of earth is

20 20 Global structure through distance Geodesic distance: the distance between two points along the manifold d(A,B) = min{length(curve(A-->B))} curve(A-->B) is on the manifold No shortcuts! “global distance”

21 21 Global structure through undirected graphs In practice, no manifold, only work with data points But enough data points can always approximate the surface when they are “dense” Think of the data as connected by “rigid bars” Desired embedding: “stretch” the dataset as far as allowed by the bars Like making a map

22 22 Isomap (Tenenbaum et al) Idea: approximating geodesic distance by making small, local connections Dynamic programming through the neighborhood graph

23 23 Isomap (Tenenbaum et al) The algorithm Compute neighborhood graph (by K- nearest neighbor) Calculate pairwise distance d(i,j) by shortest path between point i and j (and also cut out outliers) Run metric MDS on the distance matrix D, extract the leading eigenvectors Key: maintain the geodesic distance rather than the ambient distance

24 24 The effect of neighborhood size in Isomap What happens if K is small? What happens if K is large? What happens if K = N? Should K be fixed? Have assumed a uniform distribution on the manifold (see demo)

25 25 Nice properties of Isomap Implicitly defines a non-linear projection of original data (through geodesic distance + MDS) so that: Euclidean distance new = geodesic distance old Compare to kernel methods (later) No local maxima: another eigenvalue problem Theoretical guarantee (footnote 18,19) Only needs to choose the neighborhood size K

26 26 Problem of Isomap What if the data have holes? Things with holes cannot be massaged into a convex set When the data consists of disjoint parts, don’t want to maintain the distance between the different parts Need to solve a clustering problem Make sense to keep this distance? How can we stretch Two parts at a time? How to stretch a circle?

27 27 Spectral clustering K-means/Gaussian mixture/PCA clustering only works for blobs Clustering non-blob data: image segmentation in computer vision (example from Kobus Barnard)

28 28 Spectral = graph structure Rather than working directly with data points, work with the graph constructed from data points Isomap: distance calculated from neighborhood graph Spectral clustering: find a layout of the graph that separates the clusters

29 29 Undirected graphs Backbone of the graph A set of nodes V={1,…,N} A set of edges E={e ij } Algebraic graphs: either connected or not connected Weighted graphs: the edges carry weights A lot of problems can be formulated as graph problems, e.g. Google, OT

30 30 Seeing the graph structure through matrix Fix an ordering of the nodes (1,…,N) Let edges from j to k correspond to a matrix entry A(j,k) or W(j,k) A(j,k) = 0/1 for unweighted graph W(j,k) = weights for weighted graph Laplacian (D-A) is another useful matrix 23

31 31 Spectrum of graph A lot of questions related to graphs can be answered through their matrices Examples The chance of a random walk going through a particular node (Google) The time needed for a random walk to reach equilibrium (Manhattan project) Approximate solutions to intractable problems, e.g. a layout of the graph that will separate less connected parts (clustering)

32 32 Clustering as a graph partitioning problem Normalized-cut problem: splitting the graph into two parts, so that Each part is not too small The edges being cut don’t carry too many weights Weights on edges from A to B Weights on edges within A AB

33 33 Normalized cut through spectral embedding Exact solution of normalized-cut is NP-hard (explodes for large graph) “Soft” version is solvable: looking for coordinates for the nodes x 1, … x N to minimize Strongly connected nodes stay nearby, weakly connected nodes stay faraway Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding

34 34 Belkin and Niyogi, and others Spectral clustering algorithm Construct a graph by connecting each data point with its neighbors Compute the laplacian matrix L Use the spectral embedding (bottom eigenvectors of L) to represent data, and run K-means What is the free parameter here?

35 35 The effect of neighborhood size in contructing a graph This can be specified with a radius, or a neighborhood size K Same problem as Isomap Don’t want to connect everyone Then graph is complete -- little structure Don’t want to connect too few Then the graph is too sparse -- not robust to holes/shortcuts/outliers This is a delicate matter (see demo)

36 36 Distributional clustering of words in Belkin and Niyogi Feature vector: word counts from the previous and following 300 words

37 37 Speech clustering in Belkin and Niyogi Feature vector: spectrogram (256)

38 38 Summary of graph-based methods When the geometry of data is unknown, it seems reasonable to work with a graph derived from data Dimension reduction: find a low-dimensional representation of the graph Clustering: use a spectral embedding of the graph to separate components Constructing the graph require heuristic parameters for the neighborhood size (choice of K)

39 39 Computation of linear and non-linear reduction All involve diagonalization of matrices PCA: covariance matrix (dense) MDS: Gram matrix derived from Euclidean distance (dense) Isomap: Gram matrix derived from geodesic distance (dense) Spectral clustering: weight matrix derived from data (sparse) Many variants do not have this nice property

40 40 Questions: How often does manifold arise in perception/cognition? What is the right metric for calculating local distance in the ambient space? Do people utilize manifold structure in different perceptual domains? (and what does this tell us about K) Vowel manifolds? (crazy experiment)

41 41 Last word I’m not sure this experiment will work. But can people just learn arbitrary manifold structures? Are there constraints on the structure that people can learn?

42 42


Download ppt "1 LING 696B: MDS and non-linear methods of dimension reduction."

Similar presentations


Ads by Google