Presentation is loading. Please wait.

Presentation is loading. Please wait.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:

Similar presentations


Presentation on theme: "Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:"— Presentation transcript:

1 Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li: supported by NSF CAREER Award IIS-0238586 Indexing Mixed Types for Approximate Retrieval

2 2 Queries with Mixed-Type Predicates StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonStar Wars: Episode III - Revenge of the Sith2005Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonGoodfellas1990Drama ………… SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; SIMLARTO: –a domain-specific function –returns a similarity value between two strings Example: edit distance ed(Tom Hanks, Ton Hank) = 2

3 3 Why fuzzy predicates? Errors in queries –User doesn’t remember a string exactly –User types a wrong string Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star … Samuel L. Jackson Schwarzenegger Samuel L. Jackson Keanu Reeves Star Relation R Relation S Errors in databases: –Data is not clean –Especially true in data integration and cleansing

4 4 Problem Formulation SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; Given: A query with fuzzy predicates on strings and range predicates on numeric attributes on a single relation Goal: Answer the query efficiently

5 5 Rest of the talk Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

6 6 Assumptions SELECT * FROM Movies WHERE star SIMILARTO ’Schwarrzenger’ AND |year – 1980| <= 5; One fuzzy string predicate (edit distance) One numeric predicate (’Schwarrzenger’, 2, 1980, 5) (Qs, δs, Qn, δn) Query:

7 7 Intuition of MAT (Mixed-attribute-type) Tree “2 > 1 + 1” –One integrated indexing structure is better than –two independent indexing structures on two attributes Indexing numeric attributes: B-tree or R-tree Indexing strings as a tree to support fuzzy predicates? MAT tree

8 8 Answering a query (Qs, δs, Qn, δn) Top-down traverse the MAT-tree At each node, do pruning by checking: –If [Q n – δ n, Q n + δ n ] overlap with the numeric range. –If minEditDistance(Q s, T n ) <= δ s.

9 9 Challenge How to represent strings to fit into a limited space and support fuzzy-predicate pruning Limited space (disk based)

10 10 Existing Approaches to Indexing Strings as Trees M-tree: –Edit distance: metric space Q-tree –Utilize the q-gram property of strings. –See our paper for details

11 11 Representing strings as a trie

12 12 Compressing a trie Select k representative nodes (centers). Each center is in the format of. A compressed trie represents more strings compression

13 13 minEditDistace (Q s, T n )? –Convert a trie to an automaton. –Compute the min distance between a string and an automaton [Myers and Miller, 1989] –Early termination possible Minimum edit distance between a string a trie

14 14 Compressed trie  Automaton Each node is a state. Each edge becomes a transition between two states. For compressed node, expand it to L levels. At each level, all characters in Σ become single states and are connected to a common tail ε. Convert a compressed node into automaton nodes.

15 15 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

16 16 Constructing MAT-tree Option 1: insert records one by one. Option 2: –bulk-load records –construct the MAT-tree bottom-up

17 17 Compressing a trie Important: –Accurately represent strings in a limited space. –Minimize “information loss”. –Maintain the pruning power during a traversal. Three methods: –(1) Reducing # of accepted strings –(2) Keeping accepted strings “clustered” –(3) Combining of (1) and (2)

18 18 Method (1): Reducing # of accepted strings Intuition: –reducing this # makes the compressed trie more accurate Goodness function: # of accepted strings Algorithm: “Randomized” –Randomly select k initial centers –Randomly select one of the centers –Randomly select an unselected node –Swap them if it can improve the goodness function –Do certain # of iterations

19 19 Method (2): Keeping accepted strings clustered Intuition: –keeping the accepted strings similar to the original ones by letting them share common prefix. –Place k centers as close to the root as possible. Algorithm: “BreadthFirst”

20 20 Method (3): Combining (1) and (2) Intuition: –minimize the number of accepted strings, and in the same time maintain their similarity to the originals. Algorithm: “Bottomup” –Keep shrinking the trie bottom up until we have k nodes –Compress a node that minimizes # of additional strings

21 21 Dynamic maintenance Insertion (s, n) Search the index for (s, n). If it’s not in the index, identify the correct leaf node. If no overflow: –update the “MBR” of the leaf node and its precedents recursively if necessary. If overflow: –Split the leaf node and –Construct two compressed tries –Cascade the split to the precedents if necessary. Deletion and Update are handled similarly

22 22 Outline Motivation: supporting queries with mixed-type predicates Our approach: MAT tree Construction and maintenance of MAT tree Experiments

23 23 Setting Data –IMDB: 100K movie star records (Name and YOB). –Customers: 50K records (Name and YOB) Test bed –PC: 2.4G P4, 1.2GB Memory, Windows XP –Visual C++ compiler Similar results. Report result for IMDB.

24 24 Implemented approaches B-tree Q-tree B-tree & Q-tree BQ-tree BM-tree Sequential scan “BBQ-tree”?

25 25 “2 > 1 + 1” An integrated indexing structure is better than two separate indexing structures δs=3, δn=4

26 26 Scalability

27 27 Effect of numeric threshold δn

28 28 Effect of string threshold δs

29 29 Dynamic Maintenance: time

30 30 Dynamic maintenance: MAT quality

31 31 Number of centers Increasing cluster # may not reduce the running time: pruning power versus computational cost For BottomUp and BreadthFirst (compared to Randomized) - Centers close to the root, thus more likely to do early termination

32 32 Conclusion MAT-tree: an efficient indexing structure for queries with mixed-type predicates Can be efficiently constructed and maintained Future work: develop a uniform framework to support different kinds of similarity functions Q&A? The Flamingo Project : http://www.ics.uci.edu/~flamingo/http://www.ics.uci.edu/~flamingo/

33 33 Backup Slides

34 34 Constructing MAT-tree Option 1: inserting records one by one. Option 2: bulk-loading data records and constructing the MAT-tree in a bottom-up fashion. –Records are sorted based on one attribute. –Fill pages with records until full. –Calculate the numeric range and the compressed trie for each leaf nodes. –Merge leaf nodes into internal nodes recursively according to desired fanout, until a single root is formed.

35 35 Example – Customer Service Call Center NameSSNYOB Jack Lemmon430-871-82941978 Harrison Ford292-918-29131962 Tom Hanks234-762-12341956 Tim Legler125-457-86541870 ……… Customer calls in Issue a fuzzy query: Name LIKE “Tom Hanks” AND YOB CLOSE to 1958 Return result Serve the customer In this example, the underline system should be able to support fuzzy query on both the string and numeric attributes!

36 36 Scalability test (IO)


Download ppt "Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:"

Similar presentations


Ads by Google