Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
IR Models: Review Vector Model and Probabilistic.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
Presented by Arun Qamra
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Automated Information Retrieval
Multimedia and Text Indexing
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Basic Information Retrieval
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
Large scale multilingual and multimodal integration
Boolean and Vector Space Retrieval Models
15-826: Multimedia Databases and Data Mining
Latent Semantic Analysis
Presentation transcript:

Multimedia Databases Text II

Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video databases Time Series databases Data Mining

Text - Detailed outline Text databases problem full text scanning inversion signature files clustering information filtering and LSI

Vector Space Model and Clustering keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Vector Space Model and Clustering main idea: document...data... aaron zoo data V (= vocabulary size) ‘indexing’

Vector Space Model and Clustering Then, group nearby vectors together Q1: cluster search? Q2: cluster generation? Two significant contributions ranked output relevance feedback

Vector Space Model and Clustering cluster search: visit the (k) closest superclusters; continue recursively CS TRs MD TRs

Vector Space Model and Clustering ranked output: easy! CS TRs MD TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] CS TRs MD TRs

Vector Space Model and Clustering relevance feedback (brilliant idea) [Roccio’73] How? CS TRs MD TRs

Vector Space Model and Clustering How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones CS TRs MD TRs

Outline - detailed main idea cluster search cluster generation evaluation

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation Problem: given N points in V dimensions, group them

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity

Cluster generation Q1: document-to-document similarity (recall: ‘bag of words’ representation) D1: {‘data’, ‘retrieval’, ‘system’} D2: {‘lung’, ‘pulmonary’, ‘system’} distance/similarity functions?

Cluster generation A1: # of words in common A2: normalized by the vocabulary sizes A3:.... etc About the same performance - prevailing one: cosine similarity

Cluster generation cosine similarity: sim(D1, D2) = cos( θ ) = sum(v 1,i * v 2,i ) / len(v 1 )/ len(v 2 ) θ D1 D2

Cluster generation cosine similarity - observations: related to the Euclidean distance weights v i,j : according to tf/idf θ D1 D2

Cluster generation tf (‘term frequency’) high, if the term appears very often in this document. idf (‘inverse document frequency’) penalizes ‘common’ words, that appear in almost every document

Cluster generation We need Q1: document-to-document similarity Q2: document-to-cluster similarity ?

Cluster generation A1: min distance (‘single-link’) A2: max distance (‘all-link’) A3: avg distance A4: distance to centroid ?

Cluster generation A1: min distance (‘single-link’) leads to elongated clusters A2: max distance (‘all-link’) many, small, tight clusters A3: avg distance in between the above A4: distance to centroid fast to compute

Cluster generation We have document-to-document similarity document-to-cluster similarity Q: How to group documents into ‘natural’ clusters

Cluster generation A: *many-many* algorithms - in two groups [VanRijsbergen]: theoretically sound (O(N^2)) independent of the insertion order iterative (O(N), O(N log(N))

Outline - detailed main idea cluster search cluster generation evaluation

Evaluation Q: how to measure ‘goodness’ of one distance function vs another? A: ground truth (by humans) and ‘precision’ and ‘recall’

Evaluation precision = (retrieved and relevant) / retrieved 100% precision -> no false alarms recall = (retrieved and relevant)/ relevant 100% recall -> no false dismissals

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

LSI - Detailed outline LSI problem definition main idea experiments

Information Filtering + LSI [Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news- documents Major contribution: LSI = Latent Semantic Indexing latent (‘hidden’) concepts

Information Filtering + LSI Main idea map each document into some ‘concepts’ map each term into some ‘concepts’ ‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval” (0.6) -> DBMS_concept

Information Filtering + LSI Pictorially: term-document matrix (BEFORE)

Information Filtering + LSI Pictorially: concept-document matrix and...

Information Filtering + LSI... and concept-term matrix

Information Filtering + LSI Q: How to search, eg., for ‘system’?

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI A: find the corresponding concept(s); and the corresponding documents

Information Filtering + LSI Thus it works like an (automatically constructed) thesaurus: we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

LSI - Discussion - Conclusions Great idea, to derive ‘concepts’ from documents to build a ‘statistical thesaurus’ automatically to reduce dimensionality Often leads to better precision/recall but: Needs ‘training’ set of documents ‘concept’ vectors are not sparse anymore

LSI - Discussion - Conclusions Observations Bellcore (-> Telcordia) has a patent used for multi-lingual retrieval How exactly SVD works?

Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text SVD: a powerful tool multimedia...

References Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12): Can, F. and E. A. Ozkarahan (Dec. 1990). "Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases." ACM TODS 15(4): Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing. G. Salton. Englewood Cliffs, New Jersey, Prentice-Hall Inc.

References - cont’d Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey, Prentice-Hall Inc. Salton, G. and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill. Van-Rijsbergen, C. J. (1979). Information Retrieval. London, England, Butterworths. Zahn, C. T. (Jan. 1971). "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers C-20(1):