Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Eigen Decomposition and Singular Value Decomposition
Dimensionality Reduction. High-dimensional == many features Find concepts/topics/genres: – Documents: Features: Thousands of words, millions of word pairs.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Dimensionality Reduction PCA -- SVD
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Ordinary least squares regression (OLS)
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Singular Value Decomposition and Data Management
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Multimedia Indexing and Dimensionality Reduction.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
Advanced Multimedia Text Classification Tamara Berg.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Carnegie Mellon Powerful Tools for Data Mining Fractals, power laws, SVD C. Faloutsos Carnegie Mellon University.
Dimensionality Reduction Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Vector Space Models.
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CMU SCS : Multimedia Databases and Data Mining Lecture #18: SVD - part I (definitions) C. Faloutsos.
Dimensionality Reduction
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Plan for Today’s Lecture(s)
LSI, SVD and Data Management
Multimedia and Text Indexing
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Representation of documents and queries
4. Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Presentation transcript:

Text Databases

Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video databases Time Series databases

Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Vector Space Model and Clustering Keyword (free-text) queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Vector Space Model and Clustering main idea: each document is a vector of size d: d is the number of different terms in the database document...data... aaron zoo data d (= vocabulary size) ‘indexing’

Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally A vector is like an array of floating points Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse

Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback

Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively CS TRs MD TRs

Vector Space Model and Clustering ranked output: easy! CS TRs MD TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] CS TRs MD TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? CS TRs MD TRs

Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones CS TRs MD TRs

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation Problem: given N points in V dimensions, group them (typically a k-means or AGNES is used)

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf Recall the Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

tf x idf

Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents

Similarity Measures for document vectors Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector space similarity (use the weights to compare the documents)

Computing Similarity Scores

Vector Space with Term Weights and Cosine Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

Text - Detailed outline Text databases problem full text scanning inversion signature files (a.k.a. Bloom Filters) Vector model and clustering information filtering and LSI

Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news- documents Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts

Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI Pictorially: concept-document matrix and...

Information Filtering + LSI... and concept-term matrix

Information Filtering + LSI Q: How to search, eg., for ‘system’?

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Motivation problem #1: text - LSI: find ‘concepts’ problem #2: compression / dim. reduction

SVD - Motivation problem #1: text - LSI: find ‘concepts’

SVD - Motivation problem #2: compress / reduce dimensionality

Problem - specs ~10**6 rows; ~10**3 columns; no updates; random access to any cell(s) ; small error: OK

SVD - Motivation

SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Definition A [n x m] = U [n x r]   r x r] (V [m x r] ) T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts)  : r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix) V: m x r matrix (m terms, r concepts)

SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U  V T, where U,  V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix)  : eigenvalues are positive, and sorted in decreasing order

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept doc-to-concept similarity matrix

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx ‘strength’ of CS-concept

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept

SVD - Example A = U  V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept

SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Interpretation #1 ‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity matrix V: term-to-concept sim. matrix  : its diagonal elements: ‘strength’ of each concept

SVD - Interpretation #2 best axis to project on: (‘best’ = min sum of squares of projection errors)

SVD - Motivation

SVD - interpretation #2 minimum RMS error SVD: gives best axis to project v1

SVD - Interpretation #2

A = U  V T - example: = xx v1

SVD - Interpretation #2 A = U  V T - example: = xx variance (‘spread’) on the v1 axis

SVD - Interpretation #2 A = U  V T - example: U  gives the coordinates of the points in the projection axis = xx

SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? = xx

SVD - Interpretation #2 More details Q: how exactly is dim. reduction done? A: set the smallest eigenvalues to zero: = xx

SVD - Interpretation #2 ~ xx

~ xx

~ xx

~

Equivalent: ‘spectral decomposition’ of the matrix: = xx

SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: = xx u1u1 u2u2 1 2 v1v1 v2v2

SVD - Interpretation #2 Equivalent: ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT n m

SVD - Interpretation #2 ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT n m n x 1 1 x m r terms

SVD - Interpretation #2 approximation / dim. reduction: by keeping the first few terms (Q: how many?) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=...

SVD - Interpretation #2 A (heuristic - [Fukunaga]): keep 80-90% of ‘energy’ (= sum of squares of i ’s) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=...

SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx

SVD - Interpretation #3 finds non-zero ‘blobs’ in a data matrix = xx

SVD - Interpretation #3 Drill: find the SVD, ‘by inspection’! Q: rank = ?? = xx??

SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/cols) = xx??

SVD - Interpretation #3 A: rank = 2 (2 linearly independent rows/cols) = xx orthogonal??

SVD - Interpretation #3 column vectors: are orthogonal - but not unit vectors: = xx

SVD - Interpretation #3 and the eigenvalues are: = xx

SVD - Interpretation #3 A: SVD properties: matrix product should give back matrix A matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other ditto for matrix V matrix  should be diagonal, with positive values

SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies Additional properties

SVD - Complexity O( n * m * m) or O( n * n * m) (whichever is less) less work, if we just want eigenvalues or if we want first k eigenvectors or if the matrix is sparse [Berry] Implemented: in any linear algebra package (LINPACK, matlab, Splus, mathematica...)

SVD - Complexity Faster algorithms for approximate eigenvector computations exist: Alan Frieze, Ravi Kannan, Santosh Vempala: Fast Monte-Carlo Algorithms for finding low-rank approximations, Proceedings of the 39th FOCS, p.370, November 08-11, 1998 Sudipto Guha, Dimitrios Gunopulos, Nick Koudas: Correlating synchronous and asynchronous data streams. KDD 2003:

SVD - conclusions so far SVD: A= U  V T : unique (*) U: document-to-concept similarities V: term-to-concept similarities  : strength of each concept dim. reduction: keep the first few strongest eigenvalues (80-90% of ‘energy’) SVD: picks up linear correlations SVD: picks up non-zero ‘blobs’

References Berry, Michael: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. Press, W. H., S. A. Teukolsky, et al. (1992). Numerical Recipes in C, Cambridge University Press.