Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center.

Similar presentations


Presentation on theme: "1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center."— Presentation transcript:

1 1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center

2 2 Acknowledgement Lecture material shamelessly adapted/stolen from the following sources: –Kilian Weinberger: “Survey on Distance Metric Learning” slides IBM summer intern talk slides (Aug. 2006) –Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) –Yann LeCun talk slides (NIPS 2006 workshop on “Learning to Compare Examples”)

3 3 Outline  Motivation and Basic Concepts  ML tasks where it’s useful to learn dist. metric  Overview of Dimensionality Reduction  Mahalanobis Metric Learning for Clustering with Side Info (Xing et al.)  Pseudo-metric online learning (Shalev-Shwartz et al.)  Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis)  Metric Learning for Kernel Regression (Weinberger & Tesauro)  Metric learning for RL basis function construction (Keller et al.)  Similarity learning for image processing (LeCun et al.) Part 1 Part 2

4 4 Motivation Many ML algorithms and tasks require a distance metric (equivalently, “dissimilarity” metric) –Clustering (e.g. k-means) –Classification & regression: Kernel methods Nearest neighbor methods –Document/text retrieval Find most similar fingerprints in DB to given sample Find most similar web pages to document/keywords –Nonlinear dimensionality reduction methods: Isomap, Maximum Variance Unfolding, Laplacian Eigenmaps, etc.

5 5 Motivation (2) Many problems may lack a well-defined, relevant distance metric –Incommensurate features  Euclidean distance not meaningful –Side information  Euclidean distance not relevant –Learning distance metrics may thus be desirable A sensible similarity/distance metric may be highly task-dependent or semantic-dependent –What do these data points “mean”? –What are we using the data for?

6 Which images are most similar?

7 It depends... centeredleftright

8 male female It depends...

9 ... what you are looking for student professor

10 ... what you are looking for nature background plain background

11 Key DML Concept: Mahalanobis distance metric The simplest mapping is a linear transformation

12 Mahalanobis distance metric The simplest mapping is a linear transformation Algorithms can learn both matrices PSD

13 >5 Minutes Introduction to Dimensionality Reduction

14 How can the dimensionality be reduced? eliminate redundant features eliminate irrelevant features extract low dimensional structure

15 Notation Input: Output: Embedding principle: with Nearby points remain nearby, distant points remain distant. Estimate r.

16 Two classes of DR algorithms LinearNon-Linear

17 Linear dimensionality reduction

18 Principal Component Analysis (Jolliffe 1986) Project data into subspace of maximum variance.

19 Optimization

20 Covariance matrix Eigenvalue solution:

21 Eigenvectors of covariance matrix C Minimizes ssq reconstruction error Dimensionality r can be estimated from eigenvalues of C PCA requires meaningful scaling of input features Facts about PCA

22 Multidimensional Scaling (MDS) milesNYLAPhoenixChicago NY027902450810 LA279003902050 Phoenix245039001740 Chicago810205017400

23 Multidimensional Scaling (MDS)

24 inner product matrix

25 Multidimensional Scaling (MDS) equivalent to PCA use eigenvectors of inner-product matrix requires only pairwise distances

26 Non-linear dimensionality reduction

27

28 From subspace to submanifold We assume the data is sampled from some manifold with lower dimensional degree of freedom. How can we find a truthful embedding?

29 Approximate manifold with neighborhood graph

30

31 Isomap Tenenbaum et al 2000 Compute shortest path between all inputs Create geodesic distance matrix Perform MDS with geodesic distances geodesic distance

32 Locally Linear Embedding (LLE) Roweis and Saul 2000 Maximize pairwise distances Preserve local distances and angles “Unfolding” by semidefinite programming

33 Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

34 Maximum Variance Unfolding (MVU) Weinberger and Saul 2004

35 Optimization problem unfold data by maximizing pairwise distances Preserve local distances

36 Optimization problem center output (translation invariance)

37 Optimization problem Problem: Optimization non-convex multiple local minima

38 Optimization problem Solution: Change of notation single global minimum

39 Unfolding the swiss-roll

40 40 Mahalanobis Metric Learning for Clustering with Side Information (Xing et al. 2003) Exemplars {x i, i=1,…,N} plus two types of side info: – “Similar” set S = { (x i, x j ) } s.t. x i and x j are “similar” (e.g. same class) – “Dissimilar” set D = { (x i, x j ) } s.t. x i and x j are “dissimilar” Learn optimal Mahalanobis matrix M D 2 ij = (x i – x j ) T M (x i – x j ) (global dist. fn.) Goal : keep all pairs of “similar” points close, while separating all “dissilimar” pairs. Formulate as a constrained convex programming problem – minimize the distance between the data pairs in S – Subject to data pairs in D are well separated

41 41 MMC-SI (Cont’d) Objective of learning: M is positive semi-definite – Ensure non negativity and triangle inequality of the metric The number of parameters is quadratic in the number of features – Difficult to scale to a large number of features – Significant danger of overfitting small datasets

42 Mahalanobis Metric for Clustering (MMC-SI) Xing et al., NIPS 2002

43 Move similarly labeled inputs together MMC-SI

44 Move different labeled inputs apart MMC-SI

45 Convex optimization problem

46 target: Mahalanobis matrix

47 Convex optimization problem pushing differently labeled inputs apart

48 Convex optimization problem pulling similar points together

49 Convex optimization problem ensuring positive semi-definiteness

50 Convex optimization problem CONVEX

51 Two convex sets Set of all matrices that satisfy constraint 1: Cone of PSD matrices:

52 Convex optimization problem convex objective convex constraints

53 Gradient Alternating Projection

54 Take step along gradient.

55 Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space.

56 Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone.

57 Gradient Alternating Projection Algorithm is guaranteed to converge to optimal solution Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone. REPEAT

58 58 (a)Data Dist. of the original dataset (b) Data scaled by the global metric Mahalanobis Metric Learning: Example I Keep all the data points within the same classes close Separate all the data points from different classes

59 59 Mahalanobis Metric Learning: Example II Diagonal distance metric M can simplify computation, but could lead to disastrous results (a)Original data (c) Rescaling by learned diagonal M (b) rescaling by learned full M

60 Summary of Xing et al 2002 Learns Mahalanobis metric Well suited for clustering Can be kernelized Optimization problem is convex Algorithm is guaranteed to converge Assumes data to be uni-modal

61 POLA (Pseudo-metric online learning algorithm) Shalev-Shwartz et al, ICML 2004

62 This time the inputs are accessed two at a time. POLA (Pseudo-metric online learning algorithm)

63 Differently labeled inputs are separated. POLA (Pseudo-metric online learning algorithm)

64 POLA (Pseudo-metric online learning algorithm)

65 Similarly labeled inputs are moved closer. POLA (Pseudo-metric online learning algorithm)

66 Margin

67 Convex optimization At each time t, we get two inputs:, Constraint 1: Constraint 2: Both are convex!!

68 Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back

69 Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back Repeat with new constraints

70 Alternating Projection Initialize inside PSD cone Project onto constraint - satisfying hyperplane and back Repeat with new constraints If solution exists, algorithm converges inside intersection.

71 Theoretical Guarantees: Provided global solution exists: Batch-version converges after finite number of passes over data. Online-version has an upper bound on accumulated violation of threshold.

72 Summary of POLA Learns Mahalanobis metric Online algorithm Can also be kernelized Introduces a margin Algorithm converges if solution exists Assumes data to be unimodal

73 Neighborhood Component Analysis (Goldberger et. al. 2004) Distance metric for visualization and kNN


Download ppt "1 A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center."

Similar presentations


Ads by Google