Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Similar presentations


Presentation on theme: "Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011."— Presentation transcript:

1 www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011 - Minneapolis, Minnesota, USA

2 www.ntnu.no Outline Top-k spatial keyword queries Current approaches Spatial inverted index Single-keyword queries Multiple-keyword queries Experimental evaluation Conclusion 2 SSTD 2011 - Minneapolis, Minnesota, USA

3 www.ntnu.no Motivation More and more documents in the Internet are being associated with a spatial location – Ex: tweets, images (Flickr), Wikipedia sites, OpenStreetMap objects,… Most of these geotagged objects are associated with a text (description) 3 SSTD 2011 - Minneapolis, Minnesota, USA

4 www.ntnu.no Top-k spatial keyword queries Query – Spatial location – Query keywords 4 SSTD 2011 - Minneapolis, Minnesota, USA Italian food Returns the k best spatio-textual objects ranked in terms of both – Spatial distance to the query location – Textual relevance to the query keywords

5 www.ntnu.no Another example… Query – Spatial location – Query keywords Returns the k best spatio-textual objects ranked in terms of both – Spatial distance to the query location – Textual relevance to the query keywords 5 SSTD 2011 - Minneapolis, Minnesota, USA q objects query location distance

6 www.ntnu.no Ranking objects Score The spatial proximity (δ) is the normalized Euclidean distance between p and q The textual relevance (θ) is the cosine similarity between the description of p and the query keywords The query preference parameter (α) defines the importance of one measure over the other 6 SSTD 2011 - Minneapolis, Minnesota, USA

7 www.ntnu.no Current approaches Employ a modified R-tree [1,2] – Each node keeps an abstract document representing all documents in the node sub-tree Abstract document – Pairs (term, weight), one pair per term – The weight permits computing an upper-bound score for the objects in the node sub-tree 7 SSTD 2011 - Minneapolis, Minnesota, USA [1] Cong, G., Jensen, C.S., Wu, D.: “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009. [2] Li, Z., Lee, K.C., Zheng, B., Lee, W., Lee, D., Wang, X.: “IR-tree: an efficient index for geographic document search”, TKDE, 2010.

8 www.ntnu.no Example 8 SSTD 2011 - Minneapolis, Minnesota, USA e3e3 e2e2 root: bar:2 pop:2 pub:1 rock:1 samba:1 e1:e1: e2:e2:e3:e3: bar:2 pub:2 samba:1 pop:1 pub:1 samba:1 e1e1 q q e1e1 e2e2 e3e3 p5p5 p7p7 p1p1 p2p2 p3p3 p4p4 p6p6 bar:1 pop:2 pub:1 rock:1 e1e1 e1:e1: p1p1 p2p2 p3p3 For simplicity, we assume that the impact of a term is defined by the frequency rock:1 pub:1 pub:2 pub:1

9 www.ntnu.no Current approaches There are several variations – Incorporating document similarity – Clustering the nodes Main problems – Frequent and infrequent terms are stored in the same way (have the same cost) – Accesses several nodes due to text dimensionality – Complex management of inverted files and/or vectors, one per node 9 SSTD 2011 - Minneapolis, Minnesota, USA

10 www.ntnu.no Spatial inverted index (S2I) Similarly to an inverted index, S2I maps terms to objects that contain the term – The most frequent terms are stored in aggregated R-trees (aR-trees) – The less frequent terms are stored in blocks in a file The aR-tree permits accessing the objects in decreasing order of term relevance The blocks permits storing the less frequent terms efficiently 10 SSTD 2011 - Minneapolis, Minnesota, USA

11 www.ntnu.no Distribution of terms The distribution of terms is very skewed Few hundred terms take up 50% of the text 11 SSTD 2011 - Minneapolis, Minnesota, USA Terms Frequency

12 www.ntnu.no Example 12 SSTD 2011 - Minneapolis, Minnesota, USA bar* pop pub* rock samba

13 www.ntnu.no Aggregated R-tree (max) for frequent terms (e.g., pub) Only relevant objects are evaluated The objects are accessed in decreasing order of score 13 SSTD 2011 - Minneapolis, Minnesota, USA e1e1 e2e2 e0e0 e0:e0: e1:e1: e2:e2: e1(1)e1(1)e2(2)e2(2)p 1 (1)p 2 (1) q p 5 (2)p 6 (2)p 7 (1), max=1, max=2 Term impact Term impact Max value Max value

14 www.ntnu.no Single-keyword queries Only a single block or tree is accessed Block – All the objects are read and the k best are reported Tree – The nodes are accessed in decreasing order of score – The algorithm terminates when the score of the k-th object is higher than the score of any unvisited node 14 SSTD 2011 - Minneapolis, Minnesota, USA

15 www.ntnu.no Example, processing top-1 SSTD 2011 - Minneapolis, Minnesota, USA e1e1 e2e2 e0e0 q, max=1, max=2 e0:e0: e2:e2: e1(1)e1(1)e2(2)e2(2)p 1 (1)p 2 (1)p 5 (2)p 6 (2)p 7 (1) Max-heap: Minimum distance Top-1 e1:e1: Max-heap:

16 www.ntnu.no Multiple-keyword queries Requires aggregating the partial scores of the objects for each term t of the query keywords Similar to Fagin’s algorithm (NRA) – Different bounds Score: 16 SSTD 2011 - Minneapolis, Minnesota, USA Partial score

17 www.ntnu.no Multiple-keyword algorithm For each term t in q, access the objects p in S2I in decreasing of partial score – The objects are retrieved from a tree or block Update the lower bound score of p – Sum of the partial scores know plus the lowest possible partial score (using the spatial distance) Update the upper bound score of the visited objects Return the objects whose lower bond score cannot be overcame by the remaining objects 17 SSTD 2011 - Minneapolis, Minnesota, USA

18 www.ntnu.no Experimental evaluation We compare our approach (S2I) with the DIR- tree proposed by Cong et al. [1] Both approaches are implemented in Java Measures: response time, I/O, update time, and index size Size of tree nodes and blocks: 4KB 18 SSTD 2011 - Minneapolis, Minnesota, USA [1] Cong, G., Jensen C. S., Wu, D. “Efficient retrieval of the top-k most relevant spatial web objects”, VLDB, 2009.

19 www.ntnu.no Datasets 19 SSTD 2011 - Minneapolis, Minnesota, USA Datasets Total no. of objects Avg. no. of unique terms per object Total no. of terms Twitter11M11.9412.5M Twitter22M12.0025M Twitter33M12.2638.6M Twitter44M12.2751.6M Data10.1M131.7032.6M Wikipedia0.4M163.65169.4M Flickr1.4M14.4925.4M OpenStreetMap3M8.7631.5M

20 www.ntnu.no Variables studied Number of results – 10, 20, 30, 40, 50 Number of query keywords – 1, 2, 3, 4, and 5 Query preference rate (α) – 0.1, 0.3, 0.5, 0.7, 0.9 Scalability (twitter dataset) – 1M, 2M, 3M, 4M 20 SSTD 2011 - Minneapolis, Minnesota, USA

21 www.ntnu.no Number of results (k) The response time of S2I is one order of magnitude better due to less disk accesses – DIR-tree reads several nodes before finding the top-k due to text dimensionality 21 SSTD 2011 - Minneapolis, Minnesota, USA

22 www.ntnu.no Number of query keywords One order of magnitude better in I/O and response time 22 SSTD 2011 - Minneapolis, Minnesota, USA

23 www.ntnu.no Insertion time and index size S2I does not require updating inverted files (and vectors), and computing document similarity S2I requires more space 23 SSTD 2011 - Minneapolis, Minnesota, USA

24 www.ntnu.no Conclusions Top-k spatial keyword queries are intuitive and have several applications We propose a new index – Terms with different frequency are stored differently We propose algorithms to single- and multiple- keyword queries The efficiency of our approach is verified through experiments on synthetic and real datasets 24 SSTD 2011 - Minneapolis, Minnesota, USA

25 www.ntnu.no 25 SSTD 2011 - Minneapolis, Minnesota, USA More information… João B. Rocha-Junior joao@idi.ntnu.no http://www.idi.ntnu.no/~joao Thanks!

26 www.ntnu.no Scalability S2I improvement over DIR-tree increases with cardinality of the datasets 26 SSTD 2011 - Minneapolis, Minnesota, USA

27 www.ntnu.no Different datasets The advantage of S2I over DIR-tree is higher for datasets with few terms per documents 27 SSTD 2011 - Minneapolis, Minnesota, USA

28 www.ntnu.no Terms removal Terms with length=1 Terms that have no letter character – ! Character.isLetter(token.charAt(i)) 28 SSTD 2011 - Minneapolis, Minnesota, USA


Download ppt "Www.ntnu.no Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011."

Similar presentations


Ads by Google