Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.

Multimedia Databases Text I

Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video databases Time Series databases

Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video tracks) has increased in the recent years. Joint Research from Database Management, Computer Vision, Signal Processing and Pattern Recognition aims to solve problems related to multimedia data management.

Multimedia Data There are four major types of multimedia data: images, video sequences, sound tracks, and text. From the above, the easiest type to manage is text, since we can order, index, and search text using string management techniques, etc. Management of simple sounds is also possible by representing audio as signal sequences over different channels. Image retrieval has received a lot of attention in the last decade (CV and DBs). The main techniques can be extended and applied also for video retrieval.

Content-based Image Retrieval Images were traditionally managed by first annotating their contents and then using text-retrieval techniques to index them. However, with the increase of information in digital image format some drawbacks of this technique were revealed: Manual annotation requires vast amount of labor Different people may perceive differently the contents of an image; thus no objective keywords for search are defined A new research field was born in the 90’s: Content- based Image Retrieval aims at indexing and retrieving images based on their visual contents.

Feature Extraction The basis of Content-based Image Retrieval is to extract and index some visual features of the images. There are general features (e.g., color, texture, shape, etc.) and domain-specific features (e.g., objects contained in the image). Domain-specific feature extraction can vary with the application domain and is based on pattern recognition On the other hand, general features can be used independently from the image domain.

Color Features To represent the color of an image compactly, a color histogram is used. Colors are partitioned to k groups according to their similarity and the percentage of each group in the image is measured. Images are transformed to k-dimensional points and a distance metric (e.g., Euclidean distance) is used to measure the similarity between them. k-bins k-dimensional space

Using Transformations to Reduce Dimensionality In many cases the embedded dimensionality of a search problem is much lower than the actual dimensionality Some methods apply transformations on the data and approximate them with low-dimensional vectors The aim is to reduce dimensionality and at the same time maintain the data characteristics If d(a,b) is the distance between two objects a, b in real (high- dimensional) and d’(a’,b’) is their distance in the transformed low- dimensional space, we want d’(a’,b’)  d(a,b). d(a,b) d’(a’,b’)

Text - Detailed outline Text databases problem full text scanning inversion signature files clustering information filtering and LSI

Problem - Motivation Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering

Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents? Problem - Motivation

Full-text scanning for single term: (naive: O(N*M)) ABRACADABRAtext CAB pattern

for single term: (naive: O(N*M)) Knuth, Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern Full-text scanning

ABRACADABRAtext CAB pattern CAB... Full-text scanning

for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern Full-text scanning

Approximate matching - string editing distance: d(‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY Full-text scanning

string editing distance - how to compute? A: dynamic programming Let s and t are two strings Then, start from the end and try to find the minimum number of operations: cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t Full-text scanning

if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion ) Full-text scanning

Complexity: O(M*N) More on your algorithms book… Conclusions: Full text scanning needs no space overhead, but is slow for large datasets Full-text scanning

Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

Text – Inverted Files

Q: space overhead? Text – Inverted Files A: mainly, the postings lists

how to organize dictionary? stemming – Y/N? Keep only the root of each word ex. inverted, inversion  invert insertions? Text – Inverted Files

how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions? Text – Inverted Files

postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(rank) log(freq) freq ~ 1/rank / ln(1.78V) Text – Inverted Files

postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest Text – Inverted Files

Text - Detailed outline text problem full text scanning inversion signature files Information Retrieval Vector Model and clustering information filtering and LSI

Signature files idea: ‘quick & dirty’ filter

then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures? Signature files

A: superimposed coding!! [Mooers49],... m (=4 bits/word) F (=12 bits sign. size) Signature files

A: superimposed coding!! [Mooers49],... data actual match Signature files

A: superimposed coding!! [Mooers49],... retrieval actual dismissal Signature files

A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’) Signature files

A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’ Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size) Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

Q2: Why is it called ‘false drop’? Old, but fascinating story: how to find qualifying books (by title word, and/or author, and/or keyword) in O(1) time? Signature files without computers...

Solution: edge-notched cards... 1240 each title word is mapped to m numbers(how?) and the corresponding holes are cut out: Signature files

Solution: edge-notched cards... 1240 data ‘data’ -> #1, #39 Signature files

Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards!... 1240 data ‘data’ -> #1, #39 Signature files

Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document? Signature files

Another name: Bloom Filters Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance+Lohman] Many other applications in Networks and databases [ http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf ] Signature files

Information Retrieval What is the goal of IR? Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval.

Some IR History Roots in the scientific “Information Explosion” following WWII Interest in computer-based IR from mid 1950’s Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) User Interfaces, Large-scale testing and application (‘90s)

Query Languages A way to express the question (information need) Types: Boolean Natural Language Stylized Natural Language Form-Based (GUI)

Simple query language: Boolean Terms + Connectors (or operators) terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT

Boolean Logic A B

User Query Index Pre-process Parse Collections of documents, files, etc Rank Query text input

Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.

Similar presentations

Presentation on theme: "Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.

Similar presentations

Presentation on theme: "Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image."— Presentation transcript:

Similar presentations

About project

Feedback