Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han University of.
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Efficient Computation of Reverse Skyline Queries VLDB 2007.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
CS4432: Database Systems II Query Processing- Part 2.
On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Secure Data Outsourcing
HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Dense-Region Based Compact Data Cube
Computation of the solutions of nonlinear polynomial systems
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Database Management System
RE-Tree: An Efficient Index Structure for Regular Expressions
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Chapter 12: Query Processing
Introduction to Query Optimization
Preference Query Evaluation Over Expensive Attributes
Chapter 15 QUERY EXECUTION.
Spatio-temporal Pattern Queries
Data Integration with Dependent Sources
Probabilistic Data Management
Lecture 2- Query Processing (continued)
Chapter 3: The Efficiency of Algorithms
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Efficient Processing of Top-k Spatial Preference Queries
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB 2006

2 Outline Introduction Robust Index Compute Robust Index –Exact Solution –Approximate Solution –Multiple Indices Performance Study Discussion and Conclusions

3 Introduction TidA1A2 t t t t t t t t Sample Database R Select top 3 * from R order by A1+A2 asc TidA1A2A1+A2 t t t Query Results Linear Ranking Functions Ranked Query

4 Efficient Processing of Ranked Queries Na ï ve Solution: scan the whole database and evaluate all tuples Using indices or materialized views –Distributed Indexing Sort each attribute individually and merge attributes by a threshold algorithm (TA) [Fagin et al, PODS ’ 96, ’ 99, ’ 01] –Spatial Indexing Organize tuples into R-tree and determine a threshold to prune the search space [Goldstain et al, PODS ’ 97] Organize tuples into R-tree and retrieve data progressively [Papadias et al, SIGMOD ’ 03] –Sequential Indexing Organize tuples into convex hulls [Chang et al, SIGMOD ’ 00] Materialize ranked views according to the preference functions [Hristidis et al, SIGMOD ’ 01] –And More …

5 Sequential Indexing Sequential Index (ranked view) –Linearly sort tuples –No sophisticated data structures –Sequential data access (good for database I/O) Representative work –Onion [Chang et al, SIGMOD’00] –PREFER [Hristidis et al, SIGMOD’01] Our proposal: Robust Index

6 Review: Onion Technique TidA1A2 t t t t t t t t Sample Database R t1 t2 t3 t4 t5 t6 t7 t8 A1 A2 t1 t2 t3 t4 t5 t6 t7 t8 A1 A2 Second layer First layer Second layer Index by Convex hull Retrieve data layer by layer until the top-k results are found In worst case, retrieve top-k layers of tuples If a and b are non-negative (a, b are weighing parameters in linear ranking function) Index by Convex Shell Expect less number of tuples in each layer Select top 3 * from R order by aA1+bA2 asc Ranked Query

7 Review: PREFER System TidA1A2 t t t t t t t t Sample Database R t1 t2 t3 t4 t5 t6 t7 t8 A1 A2 Index by the ranking function: A1+A2 Select top 3 * from R order by w 1 A 1 +w 2 A 2 asc Ranked Query Given query ranking function: A1+2A2 Map query ranking function to index ranking function Will retrieve t1, t2, t3, t4, t6, t7 Index on preference ranking function Query ranking function Map from query to preference

8 Comments on Sequential Indexing PREFER –Works extremely well when query functions are close to the index function; Sensitive to query weights Onion –Less sensitive to query weights; Can we do better? Both methods –Require considerable online computation Motivation for robust indexing –Not sensitive to query weights –Push most computation to index building phase Average #tuples retrieved for 10 random queries asking for top- 50 answers Query weights are randomly selected from 1,2,3,4

9 Outline Introduction Robust Index Compute Robust Index –Exact Solution –Approximate Solution –Multiple Indices Performance Study Discussion and Conclusions

10 Robust Indexing: Motivating Example t1 t2 t3 t4 t5 t6 t7 t8 A1 A2 First layer Second layer t1 t2 t3 t4 t5 t6 t7 t8 A1 A2 First layer Index by Convex hull (shell) Organize data layer by layer In order to keep the convexity, each layer is built conservatively Robust Index Organize data layer by layer Exploit dominating properties between data and push a tuple as deep as possible t7: dominated by t2 and t4 (for any query, at least one of t2 and t4 ranks before t7) t7: dominated by t3 and t5 Layer 3 Layer 4

11 Robust Indexing: Formal Definition How does it work? –Offline phase Put each tuple in its deepest layer: the minimal (best) rank of all possible linear queries –Online phase Retrieve tuples in top-k layers Evaluate all of them, and report top-k What are expected? –Correctness –Less tuples in each layer than convex hull If a tuple does not belong to top-k for any query, it will not be retrieved

12 Robust Indexing: Appealing Properties Database Friendly –No online algorithm required –Simply use the following SQL statement Select top k * from R where layer <=k order by F rank Space efficient –Suppose the upper bound of the value k is given (e.g. k<=100) –Only need to index those tuples in top 100 layers –Robust indexing uses the minimal space comparing with other alternatives

13 Outline Introduction Robust Index Compute Robust Index –Exact Solution –Approximate Solution –Multiple Indices Performance Study Discussion and Conclusions

14 Robust Indexing: Algorithm Highlights Exact Solution –Compute the deepest layer for each tuple –Complexity: n: number of tuples d: number of dimensions Approximate Solution –Compute the lower bound layer for each tuple –Complexity: Multiple Indices –Transform R to different subspaces by linear transformation –Build an index in each subspace

15 Exact Solution t1 t3 t4 t5 t2 t t6 A1 A2 Task: to compute the minimal rank over all possible linear queries for tuple t Given a query Q, with ranking function F=w 1 A 1 +w 2 A 2, 0<=w 1,w 2 <=1, and w 1 +w 2 =1 Q is one-to-one mapped to a line L e.g. A1+2A2 maps to L 1 L1L1 L2L2 Naïve Proposal: Enumerate all possible combinations of (w 1,w 2 ) Not feasible since the enumerating space is infinite Alternative Solution: Only enumerate (w 1,w 2 ) whose corresponding line passes t and another tuple, e.g., L 1, …,L 4 Do not consider t3 and t6 because the corresponding weights does not satisfy 0<=w1,w2<=1 L3 L4

16 Exact Solution, cont. t1 t3 t4 t5 t2 t t6 A1 A2 Task: to compute the minimal rank over all possible linear queries for tuple t Given a query Q, with ranking function F=w 1 A 1 +w 2 A 2, 0<=w 1,w 2 <=1, and w 1 +w 2 =1 L1L1 L2L2 Complexity: to sort all lines takes O(n log n), to compute minimal rank for all t, In general, L3 L4 Lv=>L1: minimal rank is 4 (after t1, t2, t3 ) L1=>L2: minimal rank is 3 (after t2, t3) L2=>L3: minimal rank is 4 (after t2, t3, t4) L3=>L4: minimal rank is 3 (after t3, t4) L4=>L H : minimal rank is 4 (after t3, t4, t5) Minimal rank (the deepest layer) of t is 3 LHLH LVLV

17 Approximate Solution t A1 A2 Task: to compute the lower bould of the minimal rank of tuple t I I1I1 I2I2 I3I3 I4I4 IIIII IV III 1 III 2 III 3 III 4 Four regions II: dominating region, data ranked before t IV: dominated region, data ranked after t I and III? Step 1: Partition regions I and III Step 2: Count cardinalities of region II and sub-regions I 1,…,I 4, III 1,…,III 4 Step 3: Match the cardinalities of the sub-regions and compute the lower bound Lower Bounding Theorem [Minimal ranking of t] >= card(II) + min( card(I 3 +I 2 +I 1 ), card(I 2 +I 1 +III 1 ), card(I 1 +III 1 +III 2 ), card(III 1 +III 2 +III 3 ))

18 Approximate Solution, Cont. t A1 A2 I I1I1 I2I2 I3I3 I4I4 IIIII IV III 1 III 2 III 3 III 4 Step 2: Count cardinalities of region II and sub-regions I 1,…,I 4, III 1,…,III 4 Count the cardinality of region II? 1. All tuples in region II dominate t 2. A reversed version of skyline problem 3. Standard divide and conquer solution (details in the paper) Count the cardinality of region I 1 ? Suppose t: (a1,a2) Line L: A A 2 =a a2 Tuples in region I1 satisfy -A1 <= -a1 A1+0.25A2 <= a a2 TidA1A2 t0.50 t t t TidA1A2 t t t t A1=-A1 A2=A1+0.25A2 L

19 Quality of the Approximate Solution Complexity: –B: number of partitions in each subspace –n: number of tuples –d: number of dimensions Approximate quality: –Assume data forms a uniform distribution –Each subspace is partitioned evenly –Partitioning according to the data distribution is an important and interesting future topic

20 Multiple Indices Why? –To relax the constraint –To decompose and strengthen the constraints How? (e.g., for w1<=w2) –Linearly transform R to R’, and build index on R’ (A1,A2) => (A1+A2, A2) –Rewrite query weights (w1,w2) => (w1,w2-w1) Ranking function: F=w1A1+w2A2 Where 0<=w1,w2<=1 Ranking function: F=w1A1+w2A2 Where 0<=w1<=w2<=1, or 0<=w2<=w1<=1 Relax Strengthen Data are projected to a smaller subspace (e.g., A1’ >=A2’ in the transformed subspace) Tuples can be pushed deeper since more domination can be found

21 Multiple Indices, Cont. Top-kConvex Shell Robust Indexing Number of tuples in top-k layers Synthetic Data: 10K tuples Using the same index space, robust indexing can build 8 indices (if the value of k is up bounded by 100)

22 Outline Introduction Robust Index Compute Robust Index –Exact Solution –Approximate Solution –Multiple Indices Performance Study Discussion and Conclusions

23 Performance Study Data –Synthetic data –Real dataset (abalone3D, cover3D) Measure –Number of tuples retrieved –Execution time not reported, but the robust indexing is expected to be even better Approaches for comparison –Onion (convex shell) –PREFER –Approximate Robust Indexing (AppRI), #partition=10

24 Index Construction Time Convex Shell, Convex Hull and AppRI are implemented by C++ Construction time on PREFER is not included since it is implemented in Java Using the system default parameter, PREFER takes more than 1200 seconds on the 50k data set

25 Query Performance Average Number of tuples retrieved on synthetic data Average Number of tuples retrieved on Cover3D data set

26 Multiple Indices (Views) Synthetic Data, 3 dimensions Build 3 robust indices by decompose the weighting parameters: w1=max(w1,w2,w3) w2=max(w1,w2,w3)

27 Discussion and Conclusions Strength –Easy to integrate with current DBMS –Good query performance –Practical construction complexity Limitation –Online index maintenance is expensive (some weaker maintaining strategies available) –Indexing high dimensional data remains a challenging problem