Presentation is loading. Please wait.

Presentation is loading. Please wait.

10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos.

Similar presentations


Presentation on theme: "10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos."— Presentation transcript:

1 10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos

2 Multi DB and D.M.Copyright: C. Faloutsos (2001)2 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

3 Multi DB and D.M.Copyright: C. Faloutsos (2001)3 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text Singular Value Decomposition (SVD) multimedia...

4 Multi DB and D.M.Copyright: C. Faloutsos (2001)4 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case studies Conclusions

5 Multi DB and D.M.Copyright: C. Faloutsos (2001)5 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

6 Multi DB and D.M.Copyright: C. Faloutsos (2001)6 SVD - Other properties - summary can produce orthogonal basis (obvious) (who cares?) can solve over- and under-determined linear problems (see C(1) property) can compute ‘fixed points’ (= ‘steady state prob. in Markov chains’) (see C(4) property)

7 Multi DB and D.M.Copyright: C. Faloutsos (2001)7 SVD -outline of properties (A): obvious (B): less obvious (C): least obvious (and most powerful!)

8 Multi DB and D.M.Copyright: C. Faloutsos (2001)8 Properties - by defn.: A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] A(1): U T [r x n] U [n x r ] = I [r x r ] (identity matrix) A(2): V T [r x n] V [n x r ] = I [r x r ] A(3):  k = diag( 1 k, 2 k,... r k ) (k: ANY real number) A(4): A T = V  U T

9 Multi DB and D.M.Copyright: C. Faloutsos (2001)9 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = ??

10 Multi DB and D.M.Copyright: C. Faloutsos (2001)10 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T symmetric; Intuition?

11 Multi DB and D.M.Copyright: C. Faloutsos (2001)11 Less obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T symmetric; Intuition? ‘document-to-document’ similarity matrix B(2): symmetrically, for ‘V’ (A T ) [m x n] A [n x m] = V  2 V T Intuition?

12 Multi DB and D.M.Copyright: C. Faloutsos (2001)12 Less obvious properties A: term-to-term similarity matrix B(3): ( (A T ) [m x n] A [n x m] ) k = V  2k V T and B(4): (A T A ) k ~ v 1 1 2k v 1 T for k>>1 where v 1 : [m x 1] first column (eigenvector) of V 1 : strongest eigenvalue

13 Multi DB and D.M.Copyright: C. Faloutsos (2001)13 Less obvious properties B(4): (A T A ) k ~ v 1 1 2k v 1 T for k>>1 B(5): (A T A ) k v’ ~ (constant) v 1 ie., for (almost) any v’, it converges to a vector parallel to v 1 Thus, useful to compute first eigenvector/value (as well as the next ones, too...)

14 Multi DB and D.M.Copyright: C. Faloutsos (2001)14 Less obvious properties - repeated: A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(1): A [n x m] (A T ) [m x n] = U  2 U T B(2):(A T ) [m x n] A [n x m] = V  2 V T B(3): ( (A T ) [m x n] A [n x m] ) k = V  2k V T B(4): (A T A ) k ~ v 1 1 2k v 1 T B(5): (A T A ) k v’ ~ (constant) v 1

15 Multi DB and D.M.Copyright: C. Faloutsos (2001)15 Least obvious properties A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] let x 0 = V  (-1) U T b if under-specified, x 0 gives ‘shortest’ solution if over-specified, it gives the ‘solution’ with the smallest least squares error (see Num. Recipes, p. 62)

16 Multi DB and D.M.Copyright: C. Faloutsos (2001)16 Least obvious properties Illustration: under-specified, eg [1 2] [w z] T = 4 (ie, 1 w + 2 z = 4) 1 234 1 2 all possible solutions x0x0 w z shortest-length solution

17 Multi DB and D.M.Copyright: C. Faloutsos (2001)17 Least obvious properties Illustration: over-specified, eg [3 2] T [w] = [1 2] T (ie, 3 w = 1; 2 w = 2 ) 1 234 1 2 reachable points (3w, 2w) desirable point b

18 Multi DB and D.M.Copyright: C. Faloutsos (2001)18 Least obvious properties - cont’d A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] where v 1, u 1 the first (column) vectors of V, U. (v 1 == right-eigenvector) C(3): symmetrically: u 1 T A = 1 v 1 T u 1 == left-eigenvector Therefore:

19 Multi DB and D.M.Copyright: C. Faloutsos (2001)19 Least obvious properties - cont’d A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(4): A T A v 1 = 1 2 v 1 (fixed point - the usual dfn of eigenvector for a symmetric matrix)

20 Multi DB and D.M.Copyright: C. Faloutsos (2001)20 Least obvious properties - altogether A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] C(1): A [n x m] x [m x 1] = b [n x 1] then, x 0 = V  (-1) U T b: shortest, actual or least- squares solution C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] C(3): u 1 T A = 1 v 1 T C(4): A T A v 1 = 1 2 v 1

21 Multi DB and D.M.Copyright: C. Faloutsos (2001)21 Properties - conclusions A(0): A [n x m] = U [ n x r ]  [ r x r ] V T [ r x m] B(5): (A T A ) k v’ ~ (constant) v 1 C(1): A [n x m] x [m x 1] = b [n x 1] then, x 0 = V  (-1) U T b: shortest, actual or least- squares solution C(4): A T A v 1 = 1 2 v 1

22 Multi DB and D.M.Copyright: C. Faloutsos (2001)22 SVD - detailed outline... Case studies SVD properties more case studies –Kleinberg/google algorithms –query feedbacks Conclusions

23 Multi DB and D.M.Copyright: C. Faloutsos (2001)23 Kleinberg’s algorithm Problem dfn: given the web and a query find the most ‘authoritative’ web pages for this query Step 0: find all pages containing the query terms Step 1: expand by one move forward and backward

24 Multi DB and D.M.Copyright: C. Faloutsos (2001)24 Kleinberg’s algorithm Step 1: expand by one move forward and backward

25 Multi DB and D.M.Copyright: C. Faloutsos (2001)25 Kleinberg’s algorithm on the resulting graph, give high score (= ‘authorities’) to nodes that many important nodes point to give high importance score (‘hubs’) to nodes that point to good ‘authorities’) hubsauthorities

26 Multi DB and D.M.Copyright: C. Faloutsos (2001)26 Kleinberg’s algorithm observations recursive definition! each node (say, ‘i’-th node) has both an authoritativeness score a i and a hubness score h i

27 Multi DB and D.M.Copyright: C. Faloutsos (2001)27 Kleinberg’s algorithm Let E be the set of edges and A be the adjacency matrix: the (i,j) is 1 if the edge from i to j exists Let h and a be [n x 1] vectors with the ‘hubness’ and ‘authoritativiness’ scores. Then:

28 Multi DB and D.M.Copyright: C. Faloutsos (2001)28 Kleinberg’s algorithm Then: a i = h k + h l + h m that is a i = Sum (h j ) over all j that (j,i) edge exists or a = A T h k l m i

29 Multi DB and D.M.Copyright: C. Faloutsos (2001)29 Kleinberg’s algorithm symmetrically, for the ‘hubness’: h i = a n + a p + a q that is h i = Sum (q j ) over all j that (i,j) edge exists or h = A a p n q i

30 Multi DB and D.M.Copyright: C. Faloutsos (2001)30 Kleinberg’s algorithm In conclusion, we want vectors h and a such that: h = A a a = A T h Recall properties: C(2): A [n x m] v 1 [m x 1] = 1 u 1 [n x 1] C(3): u 1 T A = 1 v 1 T

31 Multi DB and D.M.Copyright: C. Faloutsos (2001)31 Kleinberg’s algorithm In short, the solutions to h = A a a = A T h are the left- and right- eigenvectors of the adjacency matrix A. Starting from random a’ and iterating, we’ll eventually converge (Q: to which of all the eigenvectors? why?)

32 Multi DB and D.M.Copyright: C. Faloutsos (2001)32 Kleinberg’s algorithm (Q: to which of all the eigenvectors? why?) A: to the ones of the strongest eigenvalue, because of property B(5): B(5): (A T A ) k v’ ~ (constant) v 1

33 Multi DB and D.M.Copyright: C. Faloutsos (2001)33 Kleinberg’s algorithm - results Eg., for the query ‘java’: 0.328 www.gamelan.com 0.251 java.sun.com 0.190 www.digitalfocus.com (“the java developer”)

34 Multi DB and D.M.Copyright: C. Faloutsos (2001)34 Kleinberg’s algorithm - discussion ‘authority’ score can be used to find ‘similar pages’ (how?) closely related to ‘citation analysis’, social networs / ‘small world’ phenomena

35 Multi DB and D.M.Copyright: C. Faloutsos (2001)35 google/page-rank algorithm closely related: imagine a particle randomly moving along the edges (*) compute its steady-state probabilities (*) with occasional random jumps

36 Multi DB and D.M.Copyright: C. Faloutsos (2001)36 google/page-rank algorithm ~identical problem: given a Markov Chain, compute the steady state probabilities p1... p5 1 2 3 4 5

37 Multi DB and D.M.Copyright: C. Faloutsos (2001)37 google/page-rank algorithm Let A be the transition matrix (= adjacency matrix, row-normalized : sum of each row = 1) 1 2 3 4 5 =

38 Multi DB and D.M.Copyright: C. Faloutsos (2001)38 google/page-rank algorithm A p = p 1 2 3 4 5 =

39 Multi DB and D.M.Copyright: C. Faloutsos (2001)39 google/page-rank algorithm A p = p thus, p is the eigenvector that corresponds to the highest eigenvalue (=1, since the matrix is row-normalized)

40 Multi DB and D.M.Copyright: C. Faloutsos (2001)40 Kleinberg/google - conclusions SVD helps in graph analysis: hub/authority scores: strongest left- and right- eigenvectors of the adjacency matrix random walk on a graph: steady state probabilities are given by the strongest eigenvector of the transition matrix

41 Multi DB and D.M.Copyright: C. Faloutsos (2001)41 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

42 Multi DB and D.M.Copyright: C. Faloutsos (2001)42 Query feedbacks [Chen & Roussopoulos, sigmod 94] sample problem: estimate selectivities (e.g., ‘how many movies were made between 1940 and 1945?’ for query optimization, LEARNING from the query results so far!!

43 Multi DB and D.M.Copyright: C. Faloutsos (2001)43 Query feedbacks Idea #1: consider a function for the CDF (cummulative distr. function), eg., 6-th degree polynomial (or splines, or whatever) year count, so far

44 Multi DB and D.M.Copyright: C. Faloutsos (2001)44 Query feedbacks GREAT Idea #2: adapt your model, as you see the actual counts of the actual queries year count, so far actual original estimate

45 Multi DB and D.M.Copyright: C. Faloutsos (2001)45 Query feedbacks year count, so far actual original estimate

46 Multi DB and D.M.Copyright: C. Faloutsos (2001)46 Query feedbacks year count, so far actual original estimate a query

47 Multi DB and D.M.Copyright: C. Faloutsos (2001)47 Query feedbacks year count, so far actual original estimate new estimate

48 Multi DB and D.M.Copyright: C. Faloutsos (2001)48 Query feedbacks Eventually, the problem becomes: - estimate the parameters a1,... a7 of the model - to minimize the least squares errors from the real answers so far. Formally:

49 Multi DB and D.M.Copyright: C. Faloutsos (2001)49 Query feedbacks Formally, with n queries and 6-th degree polynomials: =

50 Multi DB and D.M.Copyright: C. Faloutsos (2001)50 Query feedbacks where x i,j such that Sum (x i,j * a i ) = our estimate for the # of movies and b j : the actual =

51 Multi DB and D.M.Copyright: C. Faloutsos (2001)51 Query feedbacks In matrix form: = X a = b 1st query n-th query

52 Multi DB and D.M.Copyright: C. Faloutsos (2001)52 Query feedbacks In matrix form: X a = b and the least-squares estimate for a is a = V   U T b according to property C(1) (let X = U  V T )

53 Multi DB and D.M.Copyright: C. Faloutsos (2001)53 Query feedbacks - enhancements the solution a = V   U T b works, but needs expensive SVD each time a new query arrives GREAT Idea #3: Use ‘Recursive Least Squares’, to adapt a incrementally. Details: in paper - intuition:

54 Multi DB and D.M.Copyright: C. Faloutsos (2001)54 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit

55 Multi DB and D.M.Copyright: C. Faloutsos (2001)55 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit new query

56 Multi DB and D.M.Copyright: C. Faloutsos (2001)56 Query feedbacks - enhancements Intuition: x b a 1 x + a 2 least squares fit new query a’ 1 x + a’ 2

57 Multi DB and D.M.Copyright: C. Faloutsos (2001)57 Query feedbacks - enhancements the new coefficients can be quickly computed from the old ones, plus statistics in a (7x7) matrix (no need to know the details, although the RLS is a brilliant method)

58 Multi DB and D.M.Copyright: C. Faloutsos (2001)58 Query feedbacks - enhancements GREAT idea #4: ‘forgetting’ factor - we can even down-play the weight of older queries, since the data distribution might have changed. (comes for ‘free’ with RLS...)

59 Multi DB and D.M.Copyright: C. Faloutsos (2001)59 Query feedbacks - conclusions SVD helps find the Least Squares solution, to adapt to query feedbacks (RLS = Recursive Least Squares is a great method to incrementally update least-squares fits)

60 Multi DB and D.M.Copyright: C. Faloutsos (2001)60 SVD - detailed outline... Case studies SVD properties more case studies –google/Kleinberg algorithms –query feedbacks Conclusions

61 Multi DB and D.M.Copyright: C. Faloutsos (2001)61 Conclusions SVD: a valuable tool given a document-term matrix, it finds ‘concepts’ (LSI)... and can reduce dimensionality (KL)... and can find rules (PCA; RatioRules)

62 Multi DB and D.M.Copyright: C. Faloutsos (2001)62 Conclusions cont’d... and can find fixed-points or steady-state probabilities (google/ Kleinberg/ Markov Chains)... and can solve optimally over- and under- constraint linear systems (least squares / query feedbacks)

63 Multi DB and D.M.Copyright: C. Faloutsos (2001)63 References Brin, S. and L. Page (1998). Anatomy of a Large- Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. Chen, C. M. and N. Roussopoulos (May 1994). Adaptive Selectivity Estimation Using Query Feedback. Proc. of the ACM-SIGMOD, Minneapolis, MN.

64 Multi DB and D.M.Copyright: C. Faloutsos (2001)64 References cont’d Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.


Download ppt "10-603/15-826A: Multimedia Databases and Data Mining SVD - part II (more case studies) C. Faloutsos."

Similar presentations


Ads by Google