Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.

Similar presentations


Presentation on theme: "Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video."— Presentation transcript:

1 Text Databases

2 Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video databases Time Series databases

3 Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

4 Vector Space Model and Clustering Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

5 Vector Space Model and Clustering main idea: each document is a vector of size d: d is the number of different terms in the database document...data... aaron zoo data d (= vocabulary size) ‘indexing’

6 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating points Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

7 Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

8 Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

9 Document Vectors novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids

10 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

11 Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback

12 Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively CS TRs MD TRs

13 Vector Space Model and Clustering ranked output: easy! CS TRs MD TRs

14 Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] CS TRs MD TRs

15 Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? CS TRs MD TRs

16 Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones CS TRs MD TRs

17 Cluster generation Problem: given N points in V dimensions, group them

18 Cluster generation Problem: given N points in V dimensions, group them (typically a k-means or AGNES is used)

19 Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

20 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

21 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

22 Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

23 tf x idf

24 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents

25 Similarity Measures for document vectors Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

26 tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

27 Vector space similarity (use the weights to compare the documents)

28 Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

29 Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

30 Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

31 Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news- documents Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts

32 Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

33 Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

34 Information Filtering + LSI Pictorially: concept-document matrix and...

35 Information Filtering + LSI... and concept-term matrix

36 Information Filtering + LSI Q: How to search, eg., for ‘system’?

37 Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

38 Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

39 Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

40 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

41 SVD - Motivation problem #1: text - LSI: find ‘concepts’ problem #2: compression / dim. reduction

42 SVD - Motivation problem #1: text - LSI: find ‘concepts’

43 SVD - Motivation problem #2: compress / reduce dimensionality

44 Problem - specs ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK

45 SVD - Motivation

46

47 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

48 SVD - Definition A [n x m] = U [n x r]   r x r] (V [m x r] ) T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts)  : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts)

49 SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U  V T, where U,  V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix)  : eigenvalues are positive, and sorted in decreasing order

50 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx

51 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept

52 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept doc-to-concept similarity matrix

53 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx ‘strength’ of CS-concept

54 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept

55 SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept

56 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

57 SVD - Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity matrix V: term-to-concept sim. matrix  : its diagonal elements: ‘strength’ of each concept

58 SVD - Interpretation #2 best axis to project on: (‘best’ = min sum of squares of projection errors)

59 SVD - Motivation

60 SVD - interpretation #2 minimum RMS error SVD: gives best axis to project v1

61 SVD - Interpretation #2

62 A = U  V T - example: = xx v1

63 SVD - Interpretation #2 A = U  V T - example: = xx variance (‘spread’) on the v1 axis

64 SVD - Interpretation #2 A = U  V T - example: U  gives the coordinates of the points in the projection axis = xx

65 SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? = xx

66 SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? A: set the smallest eigenvalues to zero: = xx

67 SVD - Interpretation #2 ~ xx

68 ~ xx

69 ~ xx

70 ~

71 Equivalent: ‘spectral decomposition’ of the matrix: = xx

72 SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: = xx u1u1 u2u2 1 2 v1v1 v2v2

73 SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT2 + +... n m

74 SVD - Interpretation #2 ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT2 + +... n m n x 1 1 x m r terms

75 SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) =u1u1 1 vT1vT1 u2u2 2 vT2vT2 + +... n m assume: 1 >= 2 >=...

76 SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of i ’s) =u1u1 1 vT1vT1 u2u2 2 vT2vT2 + +... n m assume: 1 >= 2 >=...

77 SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx

78 SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx

79 SVD - Interpretation #3 Drill: find the SVD, ‘by inspection’! Q: rank = ?? = xx??

80 SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/cols) = xx??

81 SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/cols) = xx orthogonal??

82 SVD - Interpretation #3 column vectors: are orthogonal - but not unit vectors: = xx 0 0 0 0 0 00 000

83 SVD - Interpretation #3 and the eigenvalues are: = xx 0 0 0 0 0 00 000

84 SVD - Interpretation #3 A: SVD properties: matrix product should give back matrix A matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other ditto for matrix V matrix  should be diagonal, with positive values

85 SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

86 SVD - Complexity O( n * m * m) or O( n * n * m) (whichever is less) less work, if we just want eigenvalues or if we want first k eigenvectors or if the matrix is sparse [Berry] Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica...)

87 SVD - Complexity Faster algorithms for approximate eigenvector computations exist: Alan Frieze, Ravi Kannan, Santosh Vempala: Fast Monte-Carlo Algorithms for finding low-rank approximations, Proceedings of the 39th FOCS, p.370, November 08-11, 1998 Sudipto Guha, Dimitrios Gunopulos, Nick Koudas: Correlating synchronous and asynchronous data streams. KDD 2003: 529-534

88 SVD - conclusions so far SVD: A= U  V T : unique (*) U: document-to-concept similarities V: term-to-concept similarities  : strength of each concept dim. reduction: keep the first few strongest eigenvalues (80-90% of ‘energy’) SVD: picks up linear correlations SVD: picks up non-zero ‘blobs’

89 References Berry, Michael: http://www.cs.utk.edu/~lsi/ Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.


Download ppt "Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video."

Similar presentations


Ads by Google