Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li – University of California, Irvine Presenter: Raghav Karumur Date: 3/30/2011 Course: [CSCI 8735] Advanced Database Systems Department of Computer Science and Engineering University of Minnesota, Twin Cities Spring 2011

Lunch Time! 2 Advanced Database Systems Raghav Karumur Spring 2011 Ill go for Chinese food! What was the restaurants name??? Uh…… Ch-o-chi??? What was the restaurants name??? Uh…… Ch-o-chi???

Let me Find It!

Errr… Just one typo!

Outline Overview Problem Formulation and Preliminaries Contributions Algorithms Index Construction Experiments and Analysis Conclusion References 5 Advanced Database Systems Raghav Karumur Spring 2011

Overview 6 Advanced Database Systems Raghav Karumur Spring 2011 Terminology – Clear? Location-based keyword search consists of : a set of key words + spatial location Goal: Find objects with these key words close to the location. Ex: User is looking for a restaurant named Chaochi close to San Jose. Consider the query: Q 1 : (Chaochi) near (San Jose) The website returns listings close to San Jose that have the key word Chaochi Problem: Inconsistencies can exist either in user queries/data or both. - Users/ up loaders may enter wrong spelling! Q 1 : (Chochi) near (San Jose) Therefore, Q 1 may not be able to find the restaurant with the mistyped title. Hence, support of approximate key word search is necessary! Terminology – Clear? Location-based keyword search consists of : a set of key words + spatial location Goal: Find objects with these key words close to the location. Ex: User is looking for a restaurant named Chaochi close to San Jose. Consider the query: Q 1 : (Chaochi) near (San Jose) The website returns listings close to San Jose that have the key word Chaochi Problem: Inconsistencies can exist either in user queries/data or both. - Users/ up loaders may enter wrong spelling! Q 1 : (Chochi) near (San Jose) Therefore, Q 1 may not be able to find the restaurant with the mistyped title. Hence, support of approximate key word search is necessary!

Overview 7 Advanced Database Systems Raghav Karumur Spring 2011 Approach used so far: Build a collection of keywords similar to the mistyped keyword, and suggest another query, or find objects with these keywords. Drawback of this approach: No support for simultaneous spatial and textual information. Approach used so far: Build a collection of keywords similar to the mistyped keyword, and suggest another query, or find objects with these keywords. Drawback of this approach: No support for simultaneous spatial and textual information.

Problem Formulation 8 Advanced Database Systems Raghav Karumur Spring 2011 Object Collection chaochi restaurant starbucks apple store sams club … Object Collection chaochi restaurant starbucks apple store sams club … Find objects in San Jose with keywords similar to chochi & resturant

Problem Formulation Location Based Keyword Search: Given a collection of strings, find those that are similar to the given query string. Consider a collection of spatial objects o 1, …, o n each having a set of keywords and a location. A spatial approximate-keyword query Q = consists of two conditions: - a spatial condition Q s such as a rectangle or a circle, and - an approximate keyword condition Q t having a set of k pairs each representing a keyword w i with an associated similarity threshold i Goal: Find all objects in the collection within Q s that satisfy Q t An object satisfies Q t if for each keyword w i in Q t, the object has a keyword in its description whose similarity to w i is within the corresponding threshold i 9 Advanced Database Systems Raghav Karumur Spring 2011

Problem Formulation Approaches: Combine these two indexes Search the resultant index called LBAK-tree to find answers 10 Advanced Database Systems Raghav Karumur Spring 2011 Trie-based method Inverted-index method

Preliminaries: Location-Based Keyword Search Find objects within a given spatial region that have a given set of keywords Augment a hierarchal spatial index with textual information 11 Advanced Database Systems Raghav Karumur Spring 2011

Preliminaries: Approximate String Search … chaochi chucho church Query q: chochi Query q: chochi Collection of strings s Search Output: strings s that satisfy Sim(q,s) δ Sim functions: Edit distance, Jaccard, Cosine, etc 12 Advanced Database Systems Raghav Karumur Spring 2011

Preliminaries: Approximate String Search chaochi 2-grams {ch, ha, ao, oc, ch, hi} Intuition: similar strings share a certain number of grams Sliding Window Gram-based inverted-index Gram-based inverted-index 13 Advanced Database Systems Raghav Karumur Spring 2011

Solution Tree-based spatial index Approximate string search capability Keyword search capability LBAK-Tree 14 Advanced Database Systems Raghav Karumur Spring 2011

Contributions How to combine those indexes Three Algorithms 1)Simple fixed-level solution 2)Utilizing local spatial distribution of objects 3)Exploiting frequency distribution of keywords How to combine those indexes Three Algorithms 1)Simple fixed-level solution 2)Utilizing local spatial distribution of objects 3)Exploiting frequency distribution of keywords 15 Advanced Database Systems Raghav Karumur Spring 2011

What is used: Queries with spatial condition are typically supported by a tree-based index such as R*-tree, KD-tree, Quad-tree etc. R*-tree is used in this paper. Most trie-based indexes are specific to edit distance and its variants, and do not support other similarity measures such as Jaccard. However, inverted indexes usually support a family of similarity metrics such as edit distance, Jaccard, etc. Inverted-index is therefore used in this paper. In this paper, LBAK tree is used and is augmented with capabilities for approximate keyword search. Gram-based inverted index is used to perform approximate string search. 16 Advanced Database Systems Raghav Karumur Spring 2011

The LBAK tree 17 Advanced Database Systems Raghav Karumur Spring 2011 LBAK nodes may be classified into three categories: S-Nodes: -Do not store any textual information. - Used only for pruning based on spatial condition SA-Nodes: - Store union of keywords of their sub tree. - Stores an approximate index on these keywords. - Used for finding similar keywords, - Used for pruning based on spatial and approximate conditions. SK-Nodes: - Store union of keywords of their sub tree. - Used for pruning with spatial condition and keywords. - Must have previously identified relevant similar keywords by the time we reach this node

Alg 1: Simple Fixed Level Solution 18 Advanced Database Systems Raghav Karumur Spring 2011

Alg 1: Simple Fixed Level Solution 19 Advanced Database Systems Raghav Karumur Spring 2011 Query: objects in San Jose with keywords similar to chochi & resturant – Based on edit distance of 1 – Expressed as Q:, }>. Query: objects in San Jose with keywords similar to chochi & resturant – Based on edit distance of 1 – Expressed as Q:, }>. The query clearly has typos.. Assume nodes A, B, C, D satisfy the spatial condition San Jose. Throughout the traversal of the tree we always check the spatial condition. At the S-Node A, we only rely on spatial condition for pruning. The query clearly has typos.. Assume nodes A, B, C, D satisfy the spatial condition San Jose. Throughout the traversal of the tree we always check the spatial condition. At the S-Node A, we only rely on spatial condition for pruning.

Alg 1: Simple Fixed Level Solution 20 When we reach SA-node B, we search its approximate index to find keywords similar to chochi and resturant according to the edit-distance threshold of 1. We can find two keywords similar to chochi (namely, chaochi and choochi ), and one keyword similar to resturant ( namely restaurant ). When we reach SA-node B, we search its approximate index to find keywords similar to chochi and resturant according to the edit-distance threshold of 1. We can find two keywords similar to chochi (namely, chaochi and choochi ), and one keyword similar to resturant ( namely restaurant ).

Alg 1: Simple Fixed Level Solution 21 Once we visit the SK-nodes C and D, we intersect their stored keywords with { chaochi, choochi } and { restaurant } respectively. Clearly, node C can be pruned as it does not have the keyword restaurant. Once we visit the SK-nodes C and D, we intersect their stored keywords with { chaochi, choochi } and { restaurant } respectively. Clearly, node C can be pruned as it does not have the keyword restaurant.

Alg 1: Simple Fixed Level Solution 22 Since node D has the keywords chaochi and restaurant, we traverse its children. We repeat the process until we find the answers. Since node D has the keywords chaochi and restaurant, we traverse its children. We repeat the process until we find the answers.

How to Choose Level L? Trade off between space and time – until some level (both increase) Usually, about 90% of query time is spent in approx. index lookups. Therefore, choose an optimal level L for placement of approx. indexes and this can greatly improve avg. query time. Trade off between space and time – until some level (both increase) Usually, about 90% of query time is spent in approx. index lookups. Therefore, choose an optimal level L for placement of approx. indexes and this can greatly improve avg. query time. 23 Advanced Database Systems Raghav Karumur Spring 2011

Observations Query time & index size sensitive to approximate-index locations Fixed-level solution ignores local spatial distribution of objects If a node is sparse, we might consider placing the index at its descendents. If a node is dense, we build the index at the node itself because a query region is likely to overlap with many of its children. Query time & index size sensitive to approximate-index locations Fixed-level solution ignores local spatial distribution of objects If a node is sparse, we might consider placing the index at its descendents. If a node is dense, we build the index at the node itself because a query region is likely to overlap with many of its children. Prefer to build approximate index at parent Prefer to build approximate indexes at children 24 Advanced Database Systems Raghav Karumur Spring 2011

Algorithm 2: Placing Approximate Indexes at Variable Levels (Spatial Nodes) (Spatial-Approximate Nodes) (Spatial-Keyword Nodes) 25 Advanced Database Systems Raghav Karumur Spring 2011

Selecting Nodes for Approximate Indexes Goal: Find optimal set of nodes that should have approximate indexes Optimization problem: Given an R*-tree and a space budget, choose nodes from the tree to store approximate indexes, such that the average query time of a given workload is minimized. -- NP Hard Problem! Optimization problem: Given an R*-tree and a space budget, choose nodes from the tree to store approximate indexes, such that the average query time of a given workload is minimized. -- NP Hard Problem! 26 Advanced Database Systems Raghav Karumur Spring 2011

Greedy Algorithm: Selecting Nodes for Approximate Indexes N6 N3 N1 N2 N4 N7 N5 N12 N13 N14 N8 N9 N10N11 N15 27 Advanced Database Systems Raghav Karumur Spring 2011 A greedy algorithm SelectSANodes is developed that traverses the tree top- down and tries to push approx. indexes down the most promising paths.

Selecting Nodes for Approximate Indexes Algorithm maintains a priority queue of nodes to be traversed. Priority of node n is defined as the benefit of storing multiple approximate indexes at its children as compared to building a single index at n. For each visited node n, if the benefit of building multiple approximate indexes at ns children is negative, then the algorithm selects n to be an SA-Node, and it will not traverse its children. If the algorithm reaches a leaf node, it immediately selects the leaf to be an SA-Node. The algorithm terminates when the space budget is exhausted or there is no more benefit to pushing approximate indexes down the tree. If pTime denotes average query time of probing approx. index at parent, cTime denotes this time if the indexes were built at the children, and pSpace and cSpace are corresponding space costs of indexes, then 28 Advanced Database Systems Raghav Karumur Spring 2011

Selecting Nodes for Approximate Indexes W n denotes set of stored keywords at node n. If r is the root, the benefit of storing the approximate index at rs children is computed by b(n) = Benefit of a node can also be given as The algorithm starts traversing the tree by popping the pair with the highest benefit. The cost of building multiple approx. indexes at ns children is called space cost and is computed by s(W) = |W|*( - q + 1)* q – number of grams, W – set of keywords, is avg. keyword length of a particular data set, and is the size of each inverted-list element. 29 Advanced Database Systems Raghav Karumur Spring 2011

Cost/Benefit Estimation Effects of pushing index down – Increase space cost – Increase or decrease average query time Typically – Higher levels: good to push index down – Intermediate levels: unclear whether to push it down Effects of pushing index down – Increase space cost – Increase or decrease average query time Typically – Higher levels: good to push index down – Intermediate levels: unclear whether to push it down

Lookup time of an approx.index Clearly depends on size of the index. Experimentally determined to be of linear nature with slope. Thus the avg. lookup time of an approximate index on W keywords is estimated to be t(W) = *|W| + where slope and intercept are implementation dependent and can be experimentally determined. 31 Advanced Database Systems Raghav Karumur Spring 2011 SizeTimeSlope 10.02- 100000.2070.000019 1M22.2530.000022 10M210.1520.000021

Algorithm3: Exploiting Frequency Distribution of Keywords 32 Advanced Database Systems Raghav Karumur Spring 2011 Frequency distribution of keywords is in general skewed in nature. Ex: A business listings dataset has a keyword such as restaurant more frequently than consulate. In order to reduce the no. of keywords in the approx. indexes, we remove frequent keywords from sibling nodes, and place them in their common parent instead. As a result, approx indexes now appear even in the S-nodes. Thus, S-Nodes now contain approx. indexes for frequent words where as SA- Nodes contain approx. indexes for infrequent words. Frequency distribution of keywords is in general skewed in nature. Ex: A business listings dataset has a keyword such as restaurant more frequently than consulate. In order to reduce the no. of keywords in the approx. indexes, we remove frequent keywords from sibling nodes, and place them in their common parent instead. As a result, approx indexes now appear even in the S-nodes. Thus, S-Nodes now contain approx. indexes for frequent words where as SA- Nodes contain approx. indexes for infrequent words.

Index Construction 33 Advanced Database Systems Raghav Karumur Spring 2011

Index Construction A node n is said to be frequent if the fraction of ns children having that keyword is greater than certain threshold value. A small decreases the space cost of approx. indexes. On the other hand, avg. query time may increase because we could visit false-positive nodes, since not all of ns children actually contain the frequent keywords. Those false positives will be pruned at SK nodes. Updated benefit of a node: 34 Advanced Database Systems Raghav Karumur Spring 2011

Index Construction Updated SelectSANodes Algorithm: To discover frequent keywords in the tree, for each node n two sets of keywords are maintained: a set of infrequent keywords Wn and a set of frequent keywords Fn. Frequent/infrequent keywords are identified by examining its children. Also, it is ensured that popular keywords appear only at the root of a sub tree i.e., if a keyword w is frequent at node n, then w is removed from the approx. keyword sets in all of ns children. The propagation of frequent and infrequent keywords is performed bottom-up until the keyword sets of all nodes have been filled. The next step is to choose nodes to build approx. indexes on. We use the updated benefit of a node, instead of benefit of a node. P(n) denotes the probability of n satisfying the spatial condition of any query in a workload. 35 Advanced Database Systems Raghav Karumur Spring 2011

Incremental Maintenance of Indexes 36 Advanced Database Systems Raghav Karumur Spring 2011 If (split in R*-tree) For the two new nodes, generated after split, recompute the stored set of keywords (frequent, and infrequent) by examining their children. Propagate all the new keywords up to the root, retraverse the tree and rebuild approx. indexes at places where split has occurred (identified by a split marker). Else First insert the object into the leaf acc. to standard R*-tree procedure. Then the keywords of new objects are propagated bottom up. At an SK-Node, we add the new keywords to its stored set of keywords. At an SA-Node, we add the keyword to its approx. index. At an S –Node, we check its children for new frequent keywords, and add them to its approx. index. If (split in R*-tree) For the two new nodes, generated after split, recompute the stored set of keywords (frequent, and infrequent) by examining their children. Propagate all the new keywords up to the root, retraverse the tree and rebuild approx. indexes at places where split has occurred (identified by a split marker). Else First insert the object into the leaf acc. to standard R*-tree procedure. Then the keywords of new objects are propagated bottom up. At an SK-Node, we add the new keywords to its stored set of keywords. At an SA-Node, we add the keyword to its approx. index. At an S –Node, we check its children for new frequent keywords, and add them to its approx. index.

Experiments and Analysis 37 Advanced Database Systems Raghav Karumur Spring 2011 Datasets used: CoPhIR Test Collection – Flickr Business listings data – Florida International University. Packages used: Flamingo Approaches evaluated: Fixed level approach (FL) Variable Level approach (VL) Processed dataset to extract photos taken in US based on their latitude and longitude values. Used the keywords in the title, description and tags of a photo as its textual attribute. Compared with MHR tree (contemporary paper) Used edit distance with threshold 2 for both approaches. Since MHR-tree is probabilistic, it could miss answers, but this tree doesnt. However, MHR has a comparably small index size, that this one doesnt. Datasets used: CoPhIR Test Collection – Flickr Business listings data – Florida International University. Packages used: Flamingo Approaches evaluated: Fixed level approach (FL) Variable Level approach (VL) Processed dataset to extract photos taken in US based on their latitude and longitude values. Used the keywords in the title, description and tags of a photo as its textual attribute. Compared with MHR tree (contemporary paper) Used edit distance with threshold 2 for both approaches. Since MHR-tree is probabilistic, it could miss answers, but this tree doesnt. However, MHR has a comparably small index size, that this one doesnt.

Experiments and Analysis 38 Advanced Database Systems Raghav Karumur Spring 2011 Recall of MHR tree – constantly below 50% Fig(b) – increased signature size to achieve higher recall. Query time also increased as the no. edit distance of computations increase, because approx. keyword condition is validated at level. Recall of MHR tree – constantly below 50% Fig(b) – increased signature size to achieve higher recall. Query time also increased as the no. edit distance of computations increase, because approx. keyword condition is validated at level. Compare VLF with MHR tree MHR has smaller index size But, VLF has smaller query time. Compare VLF with MHR tree MHR has smaller index size But, VLF has smaller query time.

Experiments and Analysis 39 Advanced Database Systems Raghav Karumur Spring 2011 Size of index components for various construction algorithms. As the approx. indexes are pushed down the tree, space requirement increased because of redundant keywords in adj. nodes Query time decreased as fewer smaller indexes are searched than one big index As the approx. indexes are pushed down the tree, space requirement increased because of redundant keywords in adj. nodes Query time decreased as fewer smaller indexes are searched than one big index

Experiments and Analysis 40 Advanced Database Systems Raghav Karumur Spring 2011 Effect on query performance vs index construction methods. VL and VLF curves are smoother because they are more flexible than FL! They intersect at some point because of redundant keywords. At points of intersection, obviously VLF performs better! Effect on query performance vs index construction methods. VL and VLF curves are smoother because they are more flexible than FL! They intersect at some point because of redundant keywords. At points of intersection, obviously VLF performs better! How frequent are key words? Decided by ! = 0 every keyword is frequent >1 no keyword is frequent Whole range of values from [0 1] are plotted. Clear space-time tradeoff with keyword frequency threshold! Increase in threshold more keywords pushed to lower levels space overhead due to infrequent keywords being duplicated at multiple nodes. How frequent are key words? Decided by ! = 0 every keyword is frequent >1 no keyword is frequent Whole range of values from [0 1] are plotted. Clear space-time tradeoff with keyword frequency threshold! Increase in threshold more keywords pushed to lower levels space overhead due to infrequent keywords being duplicated at multiple nodes.

Conclusion 41 Advanced Database Systems Raghav Karumur Spring 2011 Spatial index + Approximate index = LBAK-tree Simple fixed-level solution Utilizing local spatial distribution of objects Exploiting frequency distribution of keywords Developed a cost-based model with reduced index size and query times. Conducted experiments and verified with contemporary techniques. Can improve over minimizing the index size. Spatial index + Approximate index = LBAK-tree Simple fixed-level solution Utilizing local spatial distribution of objects Exploiting frequency distribution of keywords Developed a cost-based model with reduced index size and query times. Conducted experiments and verified with contemporary techniques. Can improve over minimizing the index size.

References [1] http://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDFhttp://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDF [2] http://en.wikipedia.org/wiki/N-gramhttp://en.wikipedia.org/wiki/N-gram [3] http://en.wikipedia.org/wiki/R*-treehttp://en.wikipedia.org/wiki/R*-tree [4] www.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdfwww.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdf [5] http://flamingo.ics.uci.edu/releases/4.0/http://flamingo.ics.uci.edu/releases/4.0/ 42 Advanced Database Systems Raghav Karumur Spring 2011

Thank You! Questions? 43 Advanced Database Systems Raghav Karumur Spring 2011

Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Similar presentations

Presentation on theme: "Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Similar presentations

Presentation on theme: "Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –"— Presentation transcript:

Similar presentations

About project

Feedback