Download presentation

Presentation is loading. Please wait.

Published bySteven Graves Modified over 3 years ago

1
Latent Space Domain Transfer between High Dimensional Overlapping Distributions Sihong Xie Wei Fan Jing Peng* Olivier Verscheure Jiangtao Ren Sun Yat-Sen University IBM T. J. Watson Research Center *Montclair State University Main Challenge: 1.Transfer learning 2.High Dimensional (4000 features) 3.Overlapping (<80% features are the same) 4.Solution with performance bounds

2
Standard Supervised Learning New York Times training (labeled) test (unlabeled) Classifier 85.5% New York Times

3
In Reality…… New York Times training (labeled) test (unlabeled) Classifier 64.1% New York Times Labeled data not available! Reuters

4
Domain Difference Performance Drop traintest NYT New York Times Classifier 85.5% Reuters NYT ReutersNew York Times Classifier 64.1% ideal setting realistic setting

5
High Dimensional Data Transfer High Dimensional Data: Text Categorization Image Classification The number of features in our experiment is more than 4000 Challenges: High dimensionality. more than training examples Euclidean distance becomes similar Feature sets completely overlapping? No. Some less than 80% features are the same. Marginally not so related? Harder to find transferable structures Proper similarity definition.

6
Transfer between high dimensional overlapping distributions Overlapping Distribution A?10.2+1 Data from two domains may not be lying on exactly the same space, but at most an overlapping one. B0.09?0.1+1 C0.01?0.3 xyzlabel

7
Problems with overlapping distribution Using only the overlapping features may be lack of predictive information Transfer between high dimensional overlapping distributions A?10.2+1 B0.09?0.1+1 C0.01?0.3 f1f2f3label Hard to predict correctly

8
Overlapping Distribution Use the union of all features and fill in the missing value zeros? Transfer between high dimensional overlapping distributions A010.2+1 B0.0900.1+1 C0.0100.3 f1f2f3label Does it helps? D 2 { A, B} = 0.0181 > D 2 {A, C} = 0.0101 A is mis-classified as the same class as C, instead of B

9
Transfer between high dimensional overlapping distributions When one uses the union of the overlapping and non-overlapping features and leave the missing values as zero, the distance of two marginal distributions p(x) can become asymptotically very large as a function of non-overlapping features: becomes a dominant factor in similarity measure.

10
High dimensionality can underpin important features Transfer between high dimensional overlapping distributions The blues are closer to the green than to the red

11
LatentMap: two step correction Missing value regression Brings marginal distribution closer Latent space dimensionality reduction Further brings marginal distribution closer Ignores non-important noisy and error imported features Identify transferable substructures across two domains.

12
Filling up missing values (recall the previous example) Missing Value Regression 1. Project to overlapped feature 2. Map from z to x Relationship found by regression model D { img(A), B} = 0.0109 < D {img(A), C} = 0.0125 A is correctly classified as the same class as B

13
Dimensionality Reduction Missing Values Filled Overlapping Features Missing Values Word vector Matrix

14
Dimensionality Reduction Project the word vector matrix to the most important and inherent sub-space Low dimensional representatio n

15
Solution (high dimensionality) The blues are closer to the reds than to the greens recall the previous example The blues are closer to the greens than to the reds

16
Properties It can bring the marginal distributions of two domain close. - Marginal distributions are brought close in high- dimensional space( section 3.2 ) - Two marginal distributions are further minimized in low dimensional space. ( theorem 3.2 ) It bring two domains conditional distributions close. - Nearby instances from two domains have similar conditional distribution ( section 3.3 ) It can reduce domain transfer risk - The risk of nearest neighbor classifier can be bounded in transfer learning settings. ( theorem 3.3 )

17
Experiment (I) Data Sets 20 News Groups 20000 newsgroup articles SRAA (simulated real auto aviation) 73128 articles from 4 discussion groups Reuters 21758 Reuters news articles Baseline methods naïve Bayes, logistic regression, SVM Knn-Reg: missing value filled without SVD pLatentMap: SVD but missing value as 0 Try to justify the two steps in our framework First fill up the GAP, then use knn classifier to do classification 20 News groups comp comp.sys comp.graphics rec rec.sport rec.auto Out-Domain In-Domain

18
Learning Tasks

19
Experiment (II) 10 win 1 loss Overall performance

20
Experiment (III) knnReg: Missing values filled but without SVD Compared with knnReg 8 win 3 loss pLatentMap: SVD but without filling missing values Compared with pLatentMap 8 win 3 loss

21
Conclusion Problem: High dimensional overlapping domain transfer -– text and image categorization Step 1: Missing values filling up --- Bring two domains marginal distributions closer Step 2: SVD dimension reduction --- Further b ring two marginal distributions closer (Theorem 3.2) --- Cluster points from two domains, making conditional distribution transferable. (Theorem 3.3) Code and data available from the authors webpage

24
Solution (high dimensionality) Illustration of SVD The most important and inherent information is in eigen-vectors corresponding to the top k eigen-values. Top k singular- values Top k singular vectors So We can ….

25
Analysis (I) SVR (support vector regression) minimizes the distance between two domains marginal distributions Minimized by SVR Brings the marginal distributions close In original space Upper bound of distance between 2 domains points on overlapping features

26
Analysis (II) SVD also clusters data such that nearby data have similar concept Min SVD achieve the optimum solution Objective function of k- means

27
Analysis (III) SVD (singular value decomposition) bounds the distance of two marginal distributions (Theorem 3.2) V k =XT ||T|| 2 = Where >1 So the two marginal distributions are brought closer

28
Analysis (IV) Bound the risk (R) of Nearest Neighbor classifier under transfer learning settings (Theorem 3.3) Cluster data such that nearest neighbors have similar conditional distribution The larger the distance between two conditional distributions, the higher the bound will be Justify why we use SVD R -cov(r1, r2) Where r i related with conditional distribution

29
Experiment (IV) Parameter sensitivity Number of neighbors to retrieve Number of the dimension of latent space

30
Thank you!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google