Presentation is loading. Please wait.

Presentation is loading. Please wait.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

Similar presentations


Presentation on theme: "Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and."— Presentation transcript:

1 Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586 Indexing Mixed Types for Approximate Retrieval

2 2 Queries with Mixed-Type Predicates StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; SIMLARTO: –a domain-specific function –returns a similarity value between two strings Example: edit distance ed(Tom Hanks, Ton Hank) = 2

3 3 Why fuzzy predicates? Errors in queries –User doesn’t remember a string exactly –User types a wrong string Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star … Samuel L. Jackson Schwarzenegger Samuel L. Jackson Keanu Reeves Star Relation R Relation S Errors in databases: –Data is not clean –Especially true in data integration and cleansing

4 4 Problem Formulation SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal: Answer the query efficiently

5 5 Rest of the talk Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

6 6 Assumptions SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; One fuzzy string predicate (edit distance) One numeric predicate (’Schwarrzenger’, 2, 1980, 5) (Qs, δs, Qn, δn) Query:

7 7 Intuition of MAT (Mixed-attribute-type) Tree “2 > 1 + 1” –One integrated indexing structure is better than –two independent indexing structures on two attributes Indexing numeric attributes: B-tree or R-tree Indexing strings as a tree to support fuzzy predicates? MAT tree

8 8 Answering a query (Qs, δs, Qn, δn) Top-down traverse the MAT-tree At each node, do pruning by checking: –If [Q n – δ n, Q n + δ n ] overlap with the numeric range. –If minEditDistance(Q s, T n ) <= δ s.

9 9 Challenge How to represent strings to fit into a limited space and support fuzzy-predicate pruning Limited space (disk based)

10 10 Existing Approaches to Indexing Strings as Trees M-tree: –Edit distance: metric space Q-tree –Utilize the q-gram property of strings. –See our paper for details

11 11 Representing strings as a trie

12 12 Compressing a trie Select k representative nodes (centers). Each center is in the format of. A compressed trie represents more strings compression

13 13 minEditDistace (Q s, T n )? –Convert a trie to an automaton. –Compute the min distance between a string and an automaton [Myers and Miller, 1989] –Early termination possible Minimum edit distance between a string a trie

14 14 Compressed trie  Automaton Each node is a state. Each edge becomes a transition between two states. For compressed node, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node into automaton nodes.

15 15 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

16 16 Constructing MAT-tree Option 1: insert records one by one. Option 2: –bulk-load records –construct the MAT-tree bottom-up

17 17 Compressing a trie Important: –Accurately represent strings in a limited space. –Minimize “information loss”. –Maintain the pruning power during a traversal. Three methods: –(1) Reducing # of accepted strings –(2) Keeping accepted strings “clustered” –(3) Combining of (1) and (2)

18 18 Method (1): Reducing # of accepted strings Intuition: –reducing this # makes the compressed trie more accurate Goodness function: # of accepted strings Algorithm: “Randomized” –Randomly select k initial centers –Randomly select one of the centers –Randomly select an unselected node –Swap them if it can improve the goodness function –Do certain # of iterations

19 19 Method (2): Keeping accepted strings clustered Intuition: –keeping the accepted strings similar to the original ones by letting them share common prefix. –Place k centers as close to the root as possible. Algorithm: “BreadthFirst”

20 20 Method (3): Combining (1) and (2) Intuition: –minimize the number of accepted strings, and in the same time maintain their similarity to the originals. Algorithm: “Bottomup” –Keep shrinking the trie bottom up until we have k nodes –Compress a node that minimizes # of additional strings

21 21 Dynamic maintenance Insertion (s, n) Search the index for (s, n). If it’s not in the index, identify the correct leaf node. If no overflow: –update the “MBR” of the leaf node and its precedents recursively if necessary. If overflow: –Split the leaf node and –Construct two compressed tries –Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

22 22 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

23 23 Setting Data –IMDB: 100K movie star records (Name and YOB). –Customers: 50K records (Name and YOB) Test bed –PC: 2.4G P4, 1.2GB Memory, Windows XP –Visual C++ compiler Similar results. Report result for IMDB.

24 24 Implemented approaches B-tree Q-tree B-tree & Q-tree BQ-tree BM-tree Sequential scan “BBQ-tree”?

25 25 “2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

26 26 Scalability

27 27 Effect of numeric threshold δn

28 28 Effect of string threshold δs

29 29 Dynamic Maintenance: time

30 30 Dynamic maintenance: MAT quality

31 31 Number of centers Increasing cluster # may not reduce the running time: pruning power versus computational cost For BottomUp and BreadthFirst (compared to Randomized) - Centers close to the root, thus more likely to do early termination

32 32 Conclusion MAT-tree: an efficient indexing structure for queries with mixed-type predicates Can be efficiently constructed and maintained Future work: develop a uniform framework to support different kinds of similarity functions Q&A? The Flamingo Project : http://www.ics.uci.edu/~flamingo/http://www.ics.uci.edu/~flamingo/


Download ppt "Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and."

Similar presentations


Ads by Google