CMU SCS 15-826: Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
Dimensionality Reduction PCA -- SVD
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.
Principal Component Analysis
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Dimensionality Reduction
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Singular Value Decomposition and Data Management
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Dimensionality Reduction
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #20: SVD - part III (more case studies) C. Faloutsos.
Chapter 2 Dimensionality Reduction. Linear Methods
CMU SCS : Multimedia Databases and Data Mining Lecture #26: Compression - JPEG, MPEG, fractal C. Faloutsos.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Carnegie Mellon Powerful Tools for Data Mining Fractals, power laws, SVD C. Faloutsos Carnegie Mellon University.
Shape Analysis and Retrieval Statistical Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
15-826: Multimedia Databases and Data Mining
Lecture 8:Eigenfaces and Shared Features
Principal Component Analysis (PCA)
LSI, SVD and Data Management
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Jimeng Sun · Charalampos (Babis) E
Next Generation Data Mining Tools: SVD and Fractals
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Latent Semantic Analysis
Presentation transcript:

CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos

CMU SCS Copyright: C. Faloutsos (2012)2 Must-read Material MM Textbook Appendix DMM Textbook

CMU SCS Copyright: C. Faloutsos (2012)3 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

CMU SCS Copyright: C. Faloutsos (2012)4 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text Singular Value Decomposition (SVD) multimedia...

CMU SCS Copyright: C. Faloutsos (2012)5 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties Conclusions

CMU SCS Copyright: C. Faloutsos (2012)6 SVD - Case studies multi-lingual IR; LSI queries compression PCA - ‘ratio rules’ Karhunen-Lowe transform query feedbacks google/Kleinberg algorithms

CMU SCS Copyright: C. Faloutsos (2012)7 Case study - LSI Q1: How to do queries with LSI? Q2: multi-lingual IR (english query, on spanish text?)

CMU SCS Copyright: C. Faloutsos (2012)8 Case study - LSI Q1: How to do queries with LSI? Problem: Eg., find documents with ‘data’ data inf. retrieval brain lung = CS MD xx

CMU SCS Copyright: C. Faloutsos (2012)9 Case study - LSI Q1: How to do queries with LSI? A: map query vectors into ‘concept space’ – how? data inf. retrieval brain lung = CS MD xx

CMU SCS Copyright: C. Faloutsos (2012)10 Case study - LSI Q1: How to do queries with LSI? A: map query vectors into ‘concept space’ – how? data inf. retrieval brain lung q=q= term1 term2 v1 q v2

CMU SCS Copyright: C. Faloutsos (2012)11 Case study - LSI Q1: How to do queries with LSI? A: map query vectors into ‘concept space’ – how? data inf. retrieval brain lung q=q= term1 term2 v1 q v2 A: inner product (cosine similarity) with each ‘concept’ vector v i

CMU SCS Copyright: C. Faloutsos (2012)12 Case study - LSI Q1: How to do queries with LSI? A: map query vectors into ‘concept space’ – how? data inf. retrieval brain lung q=q= term1 term2 v1 q v2 A: inner product (cosine similarity) with each ‘concept’ vector v i q o v1

CMU SCS Copyright: C. Faloutsos (2012)13 Case study - LSI compactly, we have: q V= q concept Eg: data inf. retrieval brain lung q=q= term-to-concept similarities = CS-concept

CMU SCS Copyright: C. Faloutsos (2012)14 Case study - LSI Drill: how would the document (‘information’, ‘retrieval’) be handled by LSI?

CMU SCS Copyright: C. Faloutsos (2012)15 Case study - LSI Drill: how would the document (‘information’, ‘retrieval’) be handled by LSI? A: SAME: d concept = d V Eg: data inf. retrieval brain lung d= term-to-concept similarities = CS-concept

CMU SCS Copyright: C. Faloutsos (2012)16 Case study - LSI Observation: document (‘information’, ‘retrieval’) will be retrieved by query (‘data’), although it does not contain ‘data’!! data inf. retrieval brain lung d= CS-concept q=q=

CMU SCS Copyright: C. Faloutsos (2012)17 Case study - LSI Q1: How to do queries with LSI? Q2: multi-lingual IR (english query, on spanish text?)

CMU SCS Copyright: C. Faloutsos (2012)18 Case study - LSI Problem: –given many documents, translated to both languages (eg., English and Spanish) –answer queries across languages

CMU SCS Copyright: C. Faloutsos (2012)19 Case study - LSI Solution: ~ LSI data inf. retrieval brain lung CS MD datos informacion

CMU SCS Copyright: C. Faloutsos (2012)20 SVD - Case studies multi-lingual IR; LSI queries compression PCA - ‘ratio rules’ Karhunen-Lowe transform query feedbacks google/Kleinberg algorithms

CMU SCS Copyright: C. Faloutsos (2012)21 Case study: compression [Korn+97] Problem: given a matrix compress it, but maintain ‘random access’ (surprisingly, its solution leads to data mining and visualization...)

CMU SCS Copyright: C. Faloutsos (2012)22 Problem - specs ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK

CMU SCS Copyright: C. Faloutsos (2012)23 Idea

CMU SCS Copyright: C. Faloutsos (2012)24 SVD - reminder space savings: 2:1 minimum RMS error first singular vector

CMU SCS Copyright: C. Faloutsos (2012)25 Case study: compression outliers? A: treat separately (SVD with ‘Deltas’) first singular vector

CMU SCS Copyright: C. Faloutsos (2012)26 Compression - Performance 3 pass algo (-> scalability) (HOW?) random cell(s) reconstruction 10:1 compression with < 2% error

CMU SCS Copyright: C. Faloutsos (2012)27 Performance - scaleup space error

CMU SCS Copyright: C. Faloutsos (2012)28 Compression - Visualization no Gaussian clusters; Zipf-like distribution

CMU SCS Copyright: C. Faloutsos (2012)29 SVD - Case studies multi-lingual IR; LSI queries compression PCA - ‘ratio rules’ Karhunen-Lowe transform query feedbacks google/Kleinberg algorithms

CMU SCS Copyright: C. Faloutsos (2012)30 PCA - ‘Ratio Rules’ [Korn+00] Typically: ‘Association Rules’ (eg., {bread, milk} -> {butter} But: –which set of rules is ‘better’? –how to reconstruct missing/corrupted values? –need binary/bucketized values

CMU SCS Copyright: C. Faloutsos (2012)31 PCA - ‘Ratio Rules’ Idea: try to find ‘concepts’: singular vectors dictate rules about ratios: bread:milk:butter = 2:4:3 $ on bread $ on milk $ on butter 2 3 4

CMU SCS Copyright: C. Faloutsos (2012)32 PCA - ‘Ratio Rules’ Identical to PCA = Principal Components Analysis –Q1: which set of rules is ‘better’? –Q2: how to reconstruct missing/corrupted values? –Q3: is there need for binary/bucketized values? –Q4: how to interpret the rules (= ‘principal components’)?

CMU SCS Copyright: C. Faloutsos (2012)33 PCA - ‘Ratio Rules’ Q2: how to reconstruct missing/corrupted values? Eg: rule: bread:milk = 3:4 a customer spent $6 on bread - how about milk?

CMU SCS Copyright: C. Faloutsos (2012)34 PCA - ‘Ratio Rules’ pictorially:

CMU SCS Copyright: C. Faloutsos (2012)35 PCA - ‘Ratio Rules’ harder cases: overspecified/underspecified over-specified: milk:bread:butter = 1:2:3 a customer got $2 bread and $4 milk how much milk? Answer: minimize distance between ‘feasible’ and ‘expected’ values (using SVD...)

CMU SCS Copyright: C. Faloutsos (2012)36 PCA - ‘Ratio Rules’ harder cases: underspecified

CMU SCS Copyright: C. Faloutsos (2012)37 PCA - ‘Ratio Rules’ bottom line: we can reconstruct any count of missing values This is very useful: can spot outliers (how?) can measure the ‘goodness’ of a set of rules (how?)

CMU SCS Copyright: C. Faloutsos (2012)38 PCA - ‘Ratio Rules’ Identical to PCA = Principal Components Analysis –Q1: which set of rules is ‘better’? –Q2: how to reconstruct missing/corrupted values? –Q3: is there need for binary/bucketized values? –Q4: how to interpret the rules (= ‘principal components’)?

CMU SCS Copyright: C. Faloutsos (2012)39 PCA - ‘Ratio Rules’ Q1: which set of rules is ‘better’? A: the ones that needs the fewest outliers: –pretend we don’t know a value (eg., $ of ‘Smith’ on ‘bread’) –reconstruct it –and sum up the squared errors, for all our entries (other answers are also reasonable)

CMU SCS Copyright: C. Faloutsos (2012)40 PCA - ‘Ratio Rules’ Identical to PCA = Principal Components Analysis –Q1: which set of rules is ‘better’? –Q2: how to reconstruct missing/corrupted values? –Q3: is there need for binary/bucketized values? –Q4: how to interpret the rules (= ‘principal components’)?

CMU SCS Copyright: C. Faloutsos (2012)41 PCA - ‘Ratio Rules’ Identical to PCA = Principal Components Analysis –Q1: which set of rules is ‘better’? –Q2: how to reconstruct missing/corrupted values? –Q3: is there need for binary/bucketized values? –Q4: how to interpret the rules (= ‘principal components’)? NO

CMU SCS Copyright: C. Faloutsos (2012)42 PCA - Ratio Rules NBA dataset ~500 players; ~30 attributes

CMU SCS Copyright: C. Faloutsos (2012)43 PCA - Ratio Rules PCA: get singular vectors v1, v2,... ignore entries with small abs. value try to interpret the rest

CMU SCS Copyright: C. Faloutsos (2012)44 PCA - Ratio Rules NBA dataset - V matrix (term to ‘concept’ similarities) v1

CMU SCS Copyright: C. Faloutsos (2012)45 Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept? v1

CMU SCS Copyright: C. Faloutsos (2012)46 Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept? v1

CMU SCS Copyright: C. Faloutsos (2012)47 Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept? v1

CMU SCS Copyright: C. Faloutsos (2012)48 Ratio Rules - example RR1: minutes:points = 2:1 corresponding concept? A: ‘goodness’ of player

CMU SCS Copyright: C. Faloutsos (2012)49 Ratio Rules - example RR2: points: rebounds negatively correlated(!)

CMU SCS Copyright: C. Faloutsos (2012)50 Ratio Rules - example RR2: points: rebounds negatively correlated(!) - concept? v2

CMU SCS Copyright: C. Faloutsos (2012)51 Ratio Rules - example RR2: points: rebounds negatively correlated(!) - concept? A: position: offensive/defensive

CMU SCS Copyright: C. Faloutsos (2012)52 SVD - Case studies multi-lingual IR; LSI queries compression PCA - ‘ratio rules’ Karhunen-Lowe transform query feedbacks google/Kleinberg algorithms

CMU SCS Copyright: C. Faloutsos (2012)53 K-L transform [Duda & Hart]; [Fukunaga] A subtle point: SVD will give vectors that go through the origin v1

CMU SCS Copyright: C. Faloutsos (2012)54 K-L transform A subtle point: SVD will give vectors that go through the origin Q: how to find v 1 ’ ? v1 v1’v1’

CMU SCS Copyright: C. Faloutsos (2012)55 K-L transform A subtle point: SVD will give vectors that go through the origin Q: how to find v 1 ’ ? A: ‘centered’ PCA, ie., move the origin to center of gravity v1 v1’v1’

CMU SCS Copyright: C. Faloutsos (2012)56 K-L transform A subtle point: SVD will give vectors that go through the origin Q: how to find v 1 ’ ? A: ‘centered’ PCA, ie., move the origin to center of gravity and THEN do SVD v1’v1’

CMU SCS Copyright: C. Faloutsos (2012)57 K-L transform How to ‘center’ a set of vectors (= data matrix)? What is the covariance matrix? A: see textbook (‘whitening transformation’)

CMU SCS Copyright: C. Faloutsos (2012)58 Conclusions SVD: popular for dimensionality reduction / compression SVD is the ‘engine under the hood’ for PCA (principal component analysis)... as well as the Karhunen-Lowe transform (and there is more to come...)

CMU SCS Copyright: C. Faloutsos (2012)59 References Duda, R. O. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. Jolliffe, I. T. (1986). Principal Component Analysis, Springer Verlag.

CMU SCS Copyright: C. Faloutsos (2012)60 References Korn, F., H. V. Jagadish, et al. (May 13-15, 1997). Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. ACM SIGMOD, Tucson, AZ. Korn, F., A. Labrinidis, et al. (1998). Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. VLDB, New York, NY.

CMU SCS Copyright: C. Faloutsos (2012)61 References Korn, F., A. Labrinidis, et al. (2000). "Quantifiable Data Mining Using Ratio Rules." VLDB Journal 8(3-4): Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.