15-826: Multimedia Databases and Data Mining

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos.
Indexing DNA Sequences Using q-Grams
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU
Space-for-Time Tradeoffs
String Searching Algorithm
CMU SCS : Multimedia Databases and Data Mining Lecture#3: Primary key indexing – hashing C. Faloutsos.
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture #16: Text - part III: Vector space model and clustering C. Faloutsos.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Distance Functions for Sequence Data and Time Series
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU.
Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image.
Text and Web Search. Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
10-603/15-826A: Multimedia Databases and Data Mining Text - part II C. Faloutsos.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
String Matching Fundamental Data Structures and Algorithms April 22, 2003.
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides.
CART H E A T CART H1 E A T CART H1 E2 A3 T4.
CSG523/ Desain dan Analisis Algoritma
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-826: Multimedia Databases and Data Mining
CS 430: Information Discovery
Advanced Algorithm Design and Analysis (Lecture 12)
Distance Functions for Sequence Data and Time Series
String Processing.
Indexing and Searching (File Structures)
CSCE350 Algorithms and Data Structure
Fast Fourier Transform
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
15-826: Multimedia Databases and Data Mining
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
MA/CSSE 473 Day 27 Student questions Leftovers from Boyer-Moore
Presentation transcript:

15-826: Multimedia Databases and Data Mining Lecture #14: Text – Part I C. Faloutsos

Copyright: C. Faloutsos (2017) Must-read Material MM Textbook, Chapter 6 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) 15-826 Outline Goal: ‘Find similar / interesting things’ Intro to DB Indexing - similarity search Data Mining 15-826 Copyright: C. Faloutsos (2017)

Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text multimedia ... 15-826 Copyright: C. Faloutsos (2017)

Text - Detailed outline C. Faloutsos 15-826 Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and LSI 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications: 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications: Web law + patent offices digital libraries information filtering 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT ...) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Problem - Motivation Types of queries: boolean (‘data’ AND ‘retrieval’ AND NOT ...) additional features (‘data’ ADJACENT ‘retrieval’) keyword queries (‘data’, ‘retrieval’) How to search a large collection of documents? 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning Build a FSA; scan a c t 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning for single term: (naive: O(N*M)) ABRACADABRA text CAB pattern 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) build a small FSA; visit every text letter once only, by carefully shifting more than one step ABRACADABRA text CAB pattern 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning ABRACADABRA text CAB pattern CAB ... CAB CAB 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77) preprocess pattern; start from right to left & skip! ABRACADABRA text CAB pattern 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning ABRACADABRA text CAB pattern CAB CAB CAB 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning ABRACADABRA text OMINOUS pattern OMINOUS Boyer+Moore: fastest, in practice Sunday (‘90): some improvements 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) again, build a simplified FSA in O(M) time Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92]; available on unix/linux/win/mac 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning Approximate matching - string editing distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions, substitutions to transform the first string into the second SURVEY SURGERY 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning string editing distance - how to compute? A: 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning string editing distance - how to compute? A: dynamic programming cost( i, j ) = cost to match prefix of length i of first string s with prefix of length j of second string t 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1) else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion ) 15-826 Copyright: C. Faloutsos (2017)

String editing distance f S U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 15-826 Copyright: C. Faloutsos (2017)

String editing distance U R V E Y 1 2 3 4 5 6 G 7 subst. del. 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) Full-text scanning Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results) 15-826 Copyright: C. Faloutsos (2017)

Copyright: C. Faloutsos (2017) 15-826 Full-text scanning Conclusions: Full text scanning needs no space overhead, but is slow for large datasets 15-826 Copyright: C. Faloutsos (2017)