Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data.

Similar presentations


Presentation on theme: "Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data."— Presentation transcript:

1 Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data and Applications

2 Ting Liu, CMU 2 Thesis Committee Andrew Moore (Chair) Martial Hebert Jeff Schneider Trevor Darrell (MIT)

3 Ting Liu, CMU 3 Thesis Proposal Goal: to make nonparametric methods tractable for high-dim, massive datasets Nonparametric methods: – K-nearest-neighbor (K-NN) – Kernel density estimation – SVM evaluation phase – and more … My thesis

4 Ting Liu, CMU 4 High-dim, massive data Why K-NN? It is simple –goes back to as early as [Fix-Hodges 1951] –[Cover-Hart 1967] justifies k-NN theoretically It is easy to implement –sanity check for other (more complicated) algorithms –similar insights for other nonparametric algorithms It is useful many applications in – text categorization – drug activity detection – multimedia, computer vision – and more…

5 Ting Liu, CMU 5 Application: Video Segmentation Task: Shot transition detection Cut Gradual transition (fades, dissolves …)

6 Ting Liu, CMU 6 Technically [Qi-Hauptmann-Liu 2003] Pair-wise similarity features Classification normal: 0 cut: 1 gradual: 2 4 hours MPEG-1 video (420,970 frames) K-NN very slow  good performance We want a fast k-NN classification method. Color histogram Video frames

7 Ting Liu, CMU 7 Application: Application: Near-duplicate Detection and Sub-image Retrieval Copyrighted Image Database

8 Ting Liu, CMU 8 Algorithm Overview [Yan-Rahul 2004] 12,100,000 patches (12,100 copyrighted images) Transformation DoG + PCA-SIFT Search store Each image: 1000 patches 1000 k-NN search per query Each image:1000 patches Each patch: 36-dim train query We want a fast k-NN search method.

9 Ting Liu, CMU 9 KNS2 (2-class) KNS3 (2-class) IOC (multi-class) Spatial tree SR-tree Kd-tree Metric-tree K-NN Methods K-NN Exact K-NNApproximate K-NN K-NN search K-NN classification Naïve Random sample PCA LSH Spill-tree My work slow

10 Ting Liu, CMU 10 KNS2 (2-class) KNS3 (2-class) IOC (multi-class) K-NN Methods K-NN Exact K-NNApproximate K-NN K-NN search K-NN classification Spill-tree Spatial tree SR-tree Kd-tree Metric-tree

11 Ting Liu, CMU 11 Problems with Exact K-NN Search: Efficiency Slow with huge dataset in high dimensions Complexity of algorithms – Naïve (linear scan): O(dN) per query – Advanced: O(dlogN) ~ O(dN) (spatial data structure to avoid searching all points) SR-tree [Katayama-Satoh 1997] Kd-tree [Friedman-Bentley-Finkel 1977] Metric-tree (ball-tree) [Uhlmann 1991, Omohundro 1991]

12 Ting Liu, CMU 12 A set of points in R 2 Metric-tree: an Example

13 Ting Liu, CMU 13 Build a metric-tree P 2 P1P1 L [Uhlmann 1991, Omohundro 1991]

14 Ting Liu, CMU 14 A metric-tree Metric-tree Data Structure Internal data structure P2 P1 [Uhlmann 1991, Omohundro 1991]

15 Ting Liu, CMU 15 Let q be any query point Let x be a point inside ball B Metric-tree: the Triangle Inequality x q x q

16 Ting Liu, CMU 16 Metric-tree Based K-NN Search Depth first search Pruning using the triangle inequality Significant speed-up when d is small: O(dlogN) Little speed-up when d is large: O(dN) “Curse of dimensionality”

17 Ting Liu, CMU 17 KNS2 (2-class) KNS3 (2-class) IOC (multi-class) K-NN Methods K-NN Exact K-NNApproximate K-NN K-NN search K-NN classification Spill-tree Spatial tree SR-tree Kd-tree Metric-tree

18 Ting Liu, CMU 18 My Work (part 1): Fast K-NN Classification Based on Metric-tree Idea: Do classification w/o finding the k-NNs KNS2: Fast k-NN classification for skewed 2-class KNS3: Fast k-NN classification for 2-class IOC: Fast k-NN classification for multi-class

19 Ting Liu, CMU 19 KNS2: Fast K-NN Classification for Skewed 2-class Assumptions: (1) 2 classes: pos. / neg. (2) pos. class much less frequent than neg. class Example: video segmentation (~10,000 shot transitions, ~ 400,000 normal frames) Q: How many of the k-NN are from pos. class?

20 Ting Liu, CMU 20 How Many of the K-NN are From pos. Class? Step 1 --- Find positive Find the k closest pos. points q d1d1 d2d2 d3d3 Example: k = 3 d i : distance of the i’th closest pos. point to q Fewer pos. points → easy to compute

21 Ting Liu, CMU 21 How Many of the K-NN are From pos. Class? Step 2 --- Count negative q d1d1 d2d2 d3d3 c1c1 c2c2 c3c3 Example: k = 3 c 1 = 1 c 2 = 5 c 3 = 8 c i : Num. of neg. points within d i

22 Ting Liu, CMU 22 How Many of the K-NN are From pos. Class? Step 2 --- Lowerbound negative q d1d1 d2d2 d3d3 c1c1 c2c2 c3c3 Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? Idea: lowerbound each c i instead of computing it c i : Num. of neg. points within d i

23 Ting Liu, CMU 23 Let q be any query point Let x be a point inside ball B Metric-tree: the Triangle Inequality x q x q

24 Ting Liu, CMU 24 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 20 c 1 ≥ 0, c 2 ≥ 0, c 3 ≥ 0 A

25 Ting Liu, CMU 25 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 12 8 c 1 ≥ 0, c 2 ≥ 0, c 3 ≥ 12 B C

26 Ting Liu, CMU 26 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 12 8 c 1 ≥ 0, c 2 ≥ 0, c 3 ≥ 12 B C

27 Ting Liu, CMU 27 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 5 c 1 ≥ 0, c 2 ≥ 5, c 3 ≥ 12 7 D E

28 Ting Liu, CMU 28 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 5 c 1 ≥ 0, c 2 ≥ 5, c 3 ≥ 12 7 D E

29 Ting Liu, CMU 29 How Many of the K-NN are From pos. Class? Step 2 --- Estimate negative q Example: k = 3 Estimate c 1 ≥ 3 ? c 2 ≥ 2 ? c 3 ≥ 1 ? c i : Num. of neg. points within d i 4 c 1 ≥ 4, c 2 ≥ 5, c 3 ≥ 12 7 We are done! Return 0 E F

30 Ting Liu, CMU 30 KNS2: the Algorithm Build two metric-trees (Pos_tree / Neg_tree) Search Pos_tree, find k pos. NNs Search Neg_tree repeat pick a node from Neg_tree refine C = {c 1, c 2,…,c k } if c i ≥ k-i+1 remove c i from C end repeat Let k’ = size(C) after the search return k’

31 Ting Liu, CMU 31 Experimental Results (KNS2) DatasetDimension (d) Data Size (N) ds1 10 26,733 Letter 16 20,000 Video 45 420,970 J_Lee 100 181,395 Blanc_Mel 100 186,414 ds2 1.1 £ 10 6 88,358

32 Ting Liu, CMU 32 CPU Time Speedup Over Naïve K-NN (k = 9) KNS2: 3x – 60x speed-up over naïve

33 Ting Liu, CMU 33 KNS2 (2-class) KNS3 (2-class) IOC (multi-class) K-NN Methods K-NN Exact K-NNApproximate K-NN K-NN search K-NN classification Spill-tree

34 Ting Liu, CMU 34 --- “I’m Feeling Lucky” search --- spill-tree My Work (Part 2): a New Metric-tree Based Approximate NN Search

35 Ting Liu, CMU 35 Empirically… takes 10% of the time finding the NN takes 90% of the time backtracking Why is Metric-tree Slow? p 2 p1p1 q

36 Ting Liu, CMU 36 “I’m Feeling Lucky” Search Algorithm: simple –Descends a metric-tree without backtracking –Return the first point hit in a leaf node Complexity: super fast – O(logN) per query Accuracy: quite low –Liable to make mistakes when q is near the decision boundary

37 Ting Liu, CMU 37 Spill-tree: – adding redundancy to help “I’m-Feeling-Lucky” search

38 Ting Liu, CMU 38 Spill-tree A variant of metric-tree The children of a node can “spill over” onto each other, and contain shared data-points

39 Ting Liu, CMU 39 A Spill-tree Data Structure LLL p2p1 LR Overlapping buffer Overlapping buffer size Spill-tree: Both children own points between LL and LR Metric-tree: each child only owns points to one side of L

40 Ting Liu, CMU 40 A Spill-tree Data Structure Advantage of Spill-tree –higher accuracy –makes mistake only when true NN is far away LLLLR p2p2 p1p1 q Overlapping buffer Overlapping buffer size

41 Ting Liu, CMU 41 A Spill-tree Data Structure Problem with spill-tree –uncontrolled depth –O(logN) when – when – empirically, is the expected dist. of a point to its NN LLLLR p2p2 p1p1 q Overlapping buffer Overlapping buffer size

42 Ting Liu, CMU 42 Hybrid Spill-tree Search Balance threshold ρ = 70% (empirically) if either child of a node v contains more than ρ of the total points, then split v in the conventional way. Overlapping node -- “I’m Feeling Lucky” search Non-overlapping node -- backtracking search

43 Ting Liu, CMU 43 Further Efficiency Improvement by Random Projection Intuition: random projection approximately preserves distance.

44 Ting Liu, CMU 44 Experiments for Spill-tree Experiments for Spill-tree Dataset Num. Data (N) Num. Dim (d) Aerial 275,465 60 Corel_hist 20,000 64 Corel_uci 68,040 64 Disk 40,000 1024 Galaxy 40,000 3838

45 Ting Liu, CMU 45 Comparison Methods Naïve k-NN Metric-tree Locality Sensitive Hashing (LSH) Spill-tree

46 Ting Liu, CMU 46 Spill-tree vs. Metric-tree The CPU time (s) speed-up of Spill-tree over metric-tree Spill-tree enjoys 3.3 ~ 706 folds speed-up over metric-tree

47 Ting Liu, CMU 47 Spill-tree vs. LSH The CPU time (s) of Spill-tree and its speedup (in parentheses) over LSH Spill-tree enjoys 2.5 ~ 31 folds speed-up over LSH

48 Ting Liu, CMU 48 KNS2 (2-class) KNS3 (2-class) IOC (multi-class) K-NN Methods K-NN Exact K-NNApproximate K-NN K-NN search K-NN classification Spill-tree

49 Ting Liu, CMU 49 My Contribution T.Liu, A. W. Moore, A. Gray. Efficient Exact k-NN and Nonparametric Classification in High Dimensions, NIPS 2003. Y. Qi, A. Hauptman, T.Liu. Supervised Classification for Video Shot Segmentation, ICME 2003. T.Liu, K. Yang, A. W. Moore. The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data, KDD 2004. T.Liu, A. W. Moore, A. Gray, K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms, NIPS 2004.

50 Ting Liu, CMU 50 Related Work [Uhlmann 1991, Omohundro 1991] Propose the idea of Metric-tree (Ball-tree) [Omachi-Aso, 1997] Similar idea of KNS2 for NN classification [Gionis-Indyk-Motwani, 1999] A practical approximate NN method: LSH [Arya-Fu, 2003] Expected-case complexity of approximate NN searching [Yan-Rahul, 2004] Near-duplicate Detection and Sub-image Retrieval [Indyk, 1998] Approximate NN under L ∞ norm

51 Ting Liu, CMU 51 Future Work Improve my previous work – Self-tuning spill-tree – Theoretical analysis of spill-tree Explore new related area – Dual-tree Applications in real-world

52 Ting Liu, CMU 52 Future Work (1): Self-Tuning Spill-tree Two key factors of spill-tree –random projection dimension d’ –overlapping size

53 Ting Liu, CMU 53 Benefits of Automatic Parameter Tuning Avoid tedious hand-tuning Gain more insights into the approx. NN

54 Ting Liu, CMU 54 Future work(2): Theoretical Analysis Spill-tree + “I’m feeling lucky search” – good performance in practice – no theoretic guarantee 

55 Ting Liu, CMU 55 Idea: when the number of points is large enough, then I’m feeling lucky search finds the true NN w.h.p.

56 Ting Liu, CMU 56 Idea: with overlapping buffer, the probability of successfully finding the true NN can be increased

57 Ting Liu, CMU 57 Future Work(3): Dual-Tree Search N-body problem [Gray-Moore, 2001] – NN classification – Kernel density estimation – Outlier detection – Two-point correlation Require pair-wise comparison of all N points – Naïve solution: O(N 2 ) – Advanced solution based metric-tree Single-tree: only build trees on training data Dual-tree: build trees on both training, query data

58 Ting Liu, CMU 58 Let q be a point inside query node Q Let x be a point inside training node B Metric-tree: the Triangle Inequality x q Q B

59 Ting Liu, CMU 59 Pruning Opportunity Pruning Opportunity [Gray-Moore 2001] D max (Q, B) D min (Q, A) A B Q OAOA OBOB OQOQ Prune A when A can’t be pruned in this case A, B: nodes from training set Q: node from test set But, this is too pessimistic!

60 Ting Liu, CMU 60 More pruning opportunity A B Q OAOA OBOB q Prune A when Hyperbola H determined by O A,O B, r A +r B A can be pruned in this case Challenge: to compute this efficiently

61 Ting Liu, CMU 61 Future Work(4): Applications Multimedia --- video segmentation – shot-based segmentation – story-based segmentation Image retrieval --- near-duplicate detection Computer vision --- object recognition

62 Ting Liu, CMU 62 Time Line Now – Apr., 2005 –Dual-tree (design and implementation) –Testing on real-world datasets May – Aug., 2005 –Improving spill-tree algorithm –Theoretical analysis Sept. – Dec., 2005 –Applications of new k-NN algorithm Jan. – Mar., 2006 –Write up final thesis

63 Ting Liu, CMU 63 Thank you! QUES TIONS


Download ppt "Fast Nonparametric Machine Learning Algorithms for High-dimensional Ting Liu Carnegie Mellon University February, 2005 Ph.D. Thesis Proposal Massive Data."

Similar presentations


Ads by Google