Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.

Similar presentations


Presentation on theme: "Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image."— Presentation transcript:

1 Multimedia Databases Text I

2 Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video databases Time Series databases

3 Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video tracks) has increased in the recent years. Joint Research from Database Management, Computer Vision, Signal Processing and Pattern Recognition aims to solve problems related to multimedia data management.

4 Multimedia Data There are four major types of multimedia data: images, video sequences, sound tracks, and text. From the above, the easiest type to manage is text, since we can order, index, and search text using string management techniques, etc. Management of simple sounds is also possible by representing audio as signal sequences over different channels. Image retrieval has received a lot of attention in the last decade (CV and DBs). The main techniques can be extended and applied also for video retrieval.

5 Content-based Image Retrieval Images were traditionally managed by first annotating their contents and then using text-retrieval techniques to index them. However, with the increase of information in digital image format some drawbacks of this technique were revealed: Manual annotation requires vast amount of labor Different people may perceive differently the contents of an image; thus no objective keywords for search are defined A new research field was born in the 90’s: Content- based Image Retrieval aims at indexing and retrieving images based on their visual contents.

6 Feature Extraction The basis of Content-based Image Retrieval is to extract and index some visual features of the images. There are general features (e.g., color, texture, shape, etc.) and domain-specific features (e.g., objects contained in the image). Domain-specific feature extraction can vary with the application domain and is based on pattern recognition On the other hand, general features can be used independently from the image domain.

7 Color Features To represent the color of an image compactly, a color histogram is used. Colors are partitioned to k groups according to their similarity and the percentage of each group in the image is measured. Images are transformed to k-dimensional points and a distance metric (e.g., Euclidean distance) is used to measure the similarity between them. k-bins k-dimensional space

8 Using Transformations to Reduce Dimensionality In many cases the embedded dimensionality of a search problem is much lower than the actual dimensionality Some methods apply transformations on the data and approximate them with low-dimensional vectors The aim is to reduce dimensionality and at the same time maintain the data characteristics If d(a,b) is the distance between two objects a, b in real (high- dimensional) and d’(a’,b’) is their distance in the transformed low- dimensional space, we want d’(a’,b’)  d(a,b). d(a,b) d’(a’,b’)

9 Text - Detailed outline Text databases problem full text scanning inversion signature files clustering information filtering and LSI

10 Problem - Motivation Given a database of documents, find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering

11 Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents? Problem - Motivation

12 Full-text scanning for single term: (naive: O(N*M)) ABRACADABRAtext CAB pattern

13 for single term: (naive: O(N*M)) Knuth, Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRAtext CAB pattern Full-text scanning

14 ABRACADABRAtext CAB pattern CAB... Full-text scanning

15 for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRAtext CAB pattern Full-text scanning

16 Approximate matching - string editing distance: d(‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY Full-text scanning

17 string editing distance - how to compute? A: dynamic programming Let s and t are two strings Then, start from the end and try to find the minimum number of operations: cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t Full-text scanning

18 if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion ) Full-text scanning

19 Complexity: O(M*N) More on your algorithms book… Conclusions: Full text scanning needs no space overhead, but is slow for large datasets Full-text scanning

20 Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI

21 Text – Inverted Files

22 Q: space overhead? Text – Inverted Files A: mainly, the postings lists

23 how to organize dictionary? stemming – Y/N? Keep only the root of each word ex. inverted, inversion  invert insertions? Text – Inverted Files

24 how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA trees,... stemming – Y/N? insertions? Text – Inverted Files

25 postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’ log(rank) log(freq) freq ~ 1/rank / ln(1.78V) Text – Inverted Files

26 postings lists Cutting+Pedersen (keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92] geometric progression compression (Elias codes) [Zobel+] – down to 2% overhead! Conclusions: needs space overhead (2%-300%), but it is the fastest Text – Inverted Files

27 Text - Detailed outline text problem full text scanning inversion signature files Information Retrieval Vector Model and clustering information filtering and LSI

28 Signature files idea: ‘quick & dirty’ filter

29 then, do seq. scan on sign. file and discard ‘false alarms’ Adv.: easy insertions; faster than seq. scan Disadv.: O(N) search (with small constant) Q: how to extract signatures? Signature files

30 A: superimposed coding!! [Mooers49],... m (=4 bits/word) F (=12 bits sign. size) Signature files

31 A: superimposed coding!! [Mooers49],... data actual match Signature files

32 A: superimposed coding!! [Mooers49],... retrieval actual dismissal Signature files

33 A: superimposed coding!! [Mooers49],... nucleotic false alarm (‘false drop’) Signature files

34 A: superimposed coding!! [Mooers49],... ‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’ Signature files

35 Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

36 Q1: How to choose F and m ? A: so that doc. signature is 50% full m (=4 bits/word) F (=12 bits sign. size) Signature files

37 Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files? Signature files

38 Q2: Why is it called ‘false drop’? Old, but fascinating story: how to find qualifying books (by title word, and/or author, and/or keyword) in O(1) time? Signature files without computers...

39 Solution: edge-notched cards... 1240 each title word is mapped to m numbers(how?) and the corresponding holes are cut out: Signature files

40 Solution: edge-notched cards... 1240 data ‘data’ -> #1, #39 Signature files

41 Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards!... 1240 data ‘data’ -> #1, #39 Signature files

42 Q3: other apps of signature files? A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document? Signature files

43 Another name: Bloom Filters Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99] differential files [Severance+Lohman] Many other applications in Networks and databases [ http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf ] Signature files

44 Information Retrieval What is the goal of IR? Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval.

45 Some IR History Roots in the scientific “Information Explosion” following WWII Interest in computer-based IR from mid 1950’s Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) User Interfaces, Large-scale testing and application (‘90s)

46 Query Languages A way to express the question (information need) Types: Boolean Natural Language Stylized Natural Language Form-Based (GUI)

47 Simple query language: Boolean Terms + Connectors (or operators) terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOT

48 Boolean Logic A B

49 User Query Index Pre-process Parse Collections of documents, files, etc Rank Query text input

50 Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank


Download ppt "Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image."

Similar presentations


Ads by Google