Presentation is loading. Please wait.

Presentation is loading. Please wait.

10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.

Similar presentations


Presentation on theme: "10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos."— Presentation transcript:

1 10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos

2 Multi DB and D.M.Copyright: C. Faloutsos (2001)2 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining

3 Multi DB and D.M.Copyright: C. Faloutsos (2001)3 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text multimedia...

4 Multi DB and D.M.Copyright: C. Faloutsos (2001)4 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI

5 Multi DB and D.M.Copyright: C. Faloutsos (2001)5 Vector Space Model and Clustering keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

6 Multi DB and D.M.Copyright: C. Faloutsos (2001)6 Vector Space Model and Clustering main idea: document...data... aaron zoo data V (= vocabulary size) ‘indexing’

7 Multi DB and D.M.Copyright: C. Faloutsos (2001)7 Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback

8 Multi DB and D.M.Copyright: C. Faloutsos (2001)8 Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively CS TRs MD TRs

9 Multi DB and D.M.Copyright: C. Faloutsos (2001)9 Vector Space Model and Clustering ranked output: easy! CS TRs MD TRs

10 Multi DB and D.M.Copyright: C. Faloutsos (2001)10 Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] CS TRs MD TRs

11 Multi DB and D.M.Copyright: C. Faloutsos (2001)11 Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? CS TRs MD TRs

12 Multi DB and D.M.Copyright: C. Faloutsos (2001)12 Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones CS TRs MD TRs

13 Multi DB and D.M.Copyright: C. Faloutsos (2001)13 Outline - detailed main idea cluster search cluster generation evaluation

14 Multi DB and D.M.Copyright: C. Faloutsos (2001)14 Cluster generation Problem: – given N points in V dimensions, –group them

15 Multi DB and D.M.Copyright: C. Faloutsos (2001)15 Cluster generation Problem: – given N points in V dimensions, –group them

16 Multi DB and D.M.Copyright: C. Faloutsos (2001)16 Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity

17 Multi DB and D.M.Copyright: C. Faloutsos (2001)17 Cluster generation Q1: document-to-document similarity (recall: ‘bag of words’ representation) D1: {‘data’, ‘retrieval’, ‘system’} D2: {‘lung’, ‘pulmonary’, ‘system’} distance/similarity functions?

18 Multi DB and D.M.Copyright: C. Faloutsos (2001)18 Cluster generation A1: # of words in common A2:........ normalized by the vocabulary sizes A3:.... etc About the same performance - prevailing one: cosine similarity

19 Multi DB and D.M.Copyright: C. Faloutsos (2001)19 Cluster generation cosine similarity: similarity(D1, D2) = cos( θ ) = sum(v 1,i * v 2,i ) / len(v 1 )/ len(v 2 ) θ D1 D2

20 Multi DB and D.M.Copyright: C. Faloutsos (2001)20 Cluster generation cosine similarity - observations: related to the Euclidean distance weights v i,j : according to tf/idf θ D1 D2

21 Multi DB and D.M.Copyright: C. Faloutsos (2001)21 Cluster generation tf (‘term frequency’) high, if the term appears very often in this document. idf (‘inverse document frequency’) penalizes ‘common’ words, that appear in almost every document

22 Multi DB and D.M.Copyright: C. Faloutsos (2001)22 Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity ?

23 Multi DB and D.M.Copyright: C. Faloutsos (2001)23 Cluster generation A1: min distance (‘single-link’) A2: max distance (‘all-link’) A3: avg distance A4: distance to centroid ?

24 Multi DB and D.M.Copyright: C. Faloutsos (2001)24 Cluster generation A1: min distance (‘single-link’) –leads to elongated clusters A2: max distance (‘all-link’) –many, small, tight clusters A3: avg distance –in between the above A4: distance to centroid –fast to compute

25 Multi DB and D.M.Copyright: C. Faloutsos (2001)25 Cluster generation We have document-to-document similarity document-to-cluster similarity Q: How to group documents into ‘natural’ clusters

26 Multi DB and D.M.Copyright: C. Faloutsos (2001)26 Cluster generation A: *many-many* algorithms - in two groups [VanRijsbergen]: theoretically sound (O(N^2)) –independent of the insertion order iterative (O(N), O(N log(N))

27 Multi DB and D.M.Copyright: C. Faloutsos (2001)27 Cluster generation - ‘sound’ methods Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cut cat tigerhorsecow 0.1 0.3 0.8

28 Multi DB and D.M.Copyright: C. Faloutsos (2001)28 Cluster generation - ‘sound’ methods Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers) –like ‘k-means’ –but how to decide ‘k’?

29 Multi DB and D.M.Copyright: C. Faloutsos (2001)29 Cluster generation - ‘sound’ methods Approach#3: Graph theoretic [Zahn]: –build MST; –delete edges longer than 2.5* std of the local average

30 Multi DB and D.M.Copyright: C. Faloutsos (2001)30 Cluster generation - ‘sound’ methods Result: why ‘3’? variations Complexity?

31 Multi DB and D.M.Copyright: C. Faloutsos (2001)31 Cluster generation - ‘iterative’ methods general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed (possibly adjusting cluster centroid) possibly, re-assign some vectors to improve clusters Fast and practical, but ‘unpredictable’

32 Multi DB and D.M.Copyright: C. Faloutsos (2001)32 Cluster generation - ‘iterative’ methods general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed (possibly adjusting cluster centroid) possibly, re-assign some vectors to improve clusters Fast and practical, but ‘unpredictable’

33 Multi DB and D.M.Copyright: C. Faloutsos (2001)33 Cluster generation one way to estimate # of clusters k: the ‘cover coefficient’ [Can+] ~ SVD

34 Multi DB and D.M.Copyright: C. Faloutsos (2001)34 Outline - detailed main idea cluster search cluster generation evaluation

35 Multi DB and D.M.Copyright: C. Faloutsos (2001)35 Evaluation Q: how to measure ‘goodness’ of one distance function vs another? A: ground truth (by humans) and –‘precision’ and ‘recall’

36 Multi DB and D.M.Copyright: C. Faloutsos (2001)36 Evaluation precision = (retrieved & relevant) / retrieved –100% precision -> no false alarms recall = (retrieved & relevant)/ relevant –100% recall -> no false dismissals

37 Multi DB and D.M.Copyright: C. Faloutsos (2001)37 Text - Detailed outline text –problem –full text scanning –inversion –signature files –clustering –information filtering and LSI

38 Multi DB and D.M.Copyright: C. Faloutsos (2001)38 LSI - Detailed outline LSI –problem definition –main idea –experiments

39 Multi DB and D.M.Copyright: C. Faloutsos (2001)39 Information Filtering + LSI [Foltz+,’92] Goal: – users specify interests (= keywords) –system alerts them, on suitable news-documents Major contribution: LSI = Latent Semantic Indexing –latent (‘hidden’) concepts

40 Multi DB and D.M.Copyright: C. Faloutsos (2001)40 Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. –“data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

41 Multi DB and D.M.Copyright: C. Faloutsos (2001)41 Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

42 Multi DB and D.M.Copyright: C. Faloutsos (2001)42 Information Filtering + LSI Pictorially: concept-document matrix and...

43 Multi DB and D.M.Copyright: C. Faloutsos (2001)43 Information Filtering + LSI... and concept-term matrix

44 Multi DB and D.M.Copyright: C. Faloutsos (2001)44 Information Filtering + LSI Q: How to search, eg., for ‘system’?

45 Multi DB and D.M.Copyright: C. Faloutsos (2001)45 Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

46 Multi DB and D.M.Copyright: C. Faloutsos (2001)46 Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

47 Multi DB and D.M.Copyright: C. Faloutsos (2001)47 Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

48 Multi DB and D.M.Copyright: C. Faloutsos (2001)48 LSI - Detailed outline LSI –problem definition –main idea –experiments

49 Multi DB and D.M.Copyright: C. Faloutsos (2001)49 LSI - Experiments 150 Tech Memos (TM) / month 34 users submitted ‘profiles’ (6-66 words per profile) 100-300 concepts

50 Multi DB and D.M.Copyright: C. Faloutsos (2001)50 LSI - Experiments four methods, cross-product of: –vector-space or LSI, for similarity scoring –keywords or document-sample, for profile specification measured: precision/recall

51 Multi DB and D.M.Copyright: C. Faloutsos (2001)51 LSI - Experiments LSI, with document-based profiles, were better precision recall (0.25,0.65) (0.50,0.45) (0.75,0.30)

52 Multi DB and D.M.Copyright: C. Faloutsos (2001)52 LSI - Discussion - Conclusions Great idea, –to derive ‘concepts’ from documents –to build a ‘statistical thesaurus’ automatically –to reduce dimensionality Often leads to better precision/recall but: –Needs ‘training’ set of documents –‘concept’ vectors are not sparse anymore

53 Multi DB and D.M.Copyright: C. Faloutsos (2001)53 LSI - Discussion - Conclusions Observations Bellcore (-> Telcordia) has a patent used for multi-lingual retrieval How exactly SVD works?

54 Multi DB and D.M.Copyright: C. Faloutsos (2001)54 Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text SVD: a powerful tool multimedia...

55 Multi DB and D.M.Copyright: C. Faloutsos (2001)55 References Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12): 51- 60.

56 Multi DB and D.M.Copyright: C. Faloutsos (2001)56 References Can, F. and E. A. Ozkarahan (Dec. 1990). "Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases." ACM TODS 15(4): 483-517. Noreault, T., M. McGill, et al. (1983). A Performance Evaluation of Similarity Measures, Document Term Weighting Schemes and Representation in a Boolean Environment. Information Retrieval Research, Butterworths. Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing. G. Salton. Englewood Cliffs, New Jersey, Prentice-Hall Inc.

57 Multi DB and D.M.Copyright: C. Faloutsos (2001)57 References - cont’d Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey, Prentice-Hall Inc. Salton, G. and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill. Van-Rijsbergen, C. J. (1979). Information Retrieval. London, England, Butterworths. Zahn, C. T. (Jan. 1971). "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers C-20(1): 68-86.


Download ppt "10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos."

Similar presentations


Ads by Google