Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.

Slides:



Advertisements
Similar presentations
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Advertisements

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)
Presented by Russell Myers Paper by Ming-Chuan Wu and Alejandro P. Buchmann.
Analyzing Minerva1 AUTORI: Antonello Ercoli Alessandro Pezzullo CORSO: Seminari di Ingegneria del SW DOCENTE: Prof. Giuseppe De Giacomo.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
NiagaraCQ A Scalable Continuous Query System for Internet Databases Jianjun Chen, David J DeWitt, Feng Tian, Yuan Wang University of Wisconsin – Madison.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Advanced Databases: Lecture 6 Query Optimization (I) 1 Introduction to query processing + Implementing Relational Algebra Advanced Databases By Dr. Akhtar.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Indexes and Views Unit 7.
CS4432: Database Systems II Query Processing- Part 2.
1 Information Retrieval LECTURE 1 : Introduction.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Automatic Labeling of Multinomial Topic Models
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Boolean + Ranking: Querying a Database by K-Constrained Optimization Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
1 VLDB, Background What is important for the user.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Boolean + Ranking: Querying a Database by K-Constrained Optimization
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Parallel Databases.
Supporting Ad-Hoc Ranking Aggregates
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Ishan Sharma Abhishek Mittal Vivek Raj
Database Performance Tuning and Query Optimization
The Relational Model Textbook /7/2018.
Overview of Query Evaluation
Chapter 11 Database Performance Tuning and Query Optimization
Information Retrieval and Web Design
Presentation transcript:

Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin Chang (UIUC)

Boolean database queries 2 ( Relational Algebra and System R papers)

Data Retrieval 3

4 Example: What Boolean queries provide SELECT * FROM Houses H WHERE 200K<price<400K AND #bedroom = 4 query semanticsresults organization Boolean query hard constraints (True or False)  a flat table  too many (few) answers

5 Example: What may be desirable < 500K is more acceptable but willing to pay more for big house close to the lake is a plus avoid locations near airport … query semanticsresults organization Boolean query hard constraints (True or False)  a flat table  too many (few) answers fuzzy retrieval “soft” constraints (preference,similarity, relevance,…)  a ranked list  a grouping of results  etc.

“Retrieval” of DATA: From Boolean query to fuzzy retrieval 6 text data retrieval engine search engine

Retrieval mechanisms: Learning from Web search 7 Navigation Map Categorization Facets … Clustering Ranking

Generalizing SQL constructs for data retrieval 8 Group-By Order-By

9 From crispy ordering to fuzzy ranking Fuzzy ranking Order By Houses.size − 4*Houses.price Limit 5  by ranking function  combine matching criteria  order => desirability : top-k Crispy ordering Order By Houses.size, Houses.price  by attribute values  equality of values  order ≠ desirability

10 From crispy grouping to fuzzy clustering Fuzzy clustering Group By k-means(H.size, H.price) Into 5  by distance function  proximity of values  number of clusters Crispy grouping Group By Houses.size, Houses.price  by attribute values  equality partition  no limit on output size

11 Need for combining ranking with clustering Clustering-only  A group can be big “too many answers” problem persists  How to compare things within each group? Ranking-only  Lack of global view top-k results may come from same underlying group (e.g., cheap and big houses come from a less nice area.)  Different groups may not be comparable

Contributions Concepts  generalize Group-By to fuzzy clustering, parallel to the generalization from Order-By to ranking  integrate ranking with clustering in database queries Efficient processing framework  summary-based approach 12

Related works Clustering  not on dynamic query results  use summary (grid with buckets) (e.g., STING [WangYM97] WaveCluster [SheikholeslamiCZ98] ) Ranking (top-k) in DB: many instances (e.g., [LiCIS05])  use summary (histogram) in top-k to range query translation [ ChaudhuriG99] Categorization of query results [ChakrabartiCH04]  different from clustering  no integration with ranking  focus on reducing navigation overhead, not processing Web search and IR (e.g., [ZamirE99] [LeuskiA00]) 13

14 Integrate the two generalizations Boolean conditions Ranking Clustering ? semantics evaluation

15 Query semantics: ClusterRank queries SELECT * FROM Houses WHERE area=“Chicago” GROUP BY longitude, latitude INTO 5 ORDER BY size – 4*price LIMIT 3 ranking clustering Boolean Semantics: order-within-groups Return the top k tuples within each group (cluster).

Several notes Non-deterministic semantics  clustering is non-deterministic by nature  sacrificing the crispiness of SQL queries  worthy for exploring the fuzziness in data retrieval? Language syntax isn’t our focus  current SQL semantics: order-among-groups Select… From… Where…Group By… Order By…(RankAgg[LiCI06])  OLAP function Clustering function  algorithm, distance measure hidden behind Other semantics  e.g., cluster the global top k 16

17 Query evaluation: Straightforward Materialize-Cluster-Sort approach SELECT * FROM Houses WHERE area=“Chicago” GROUP BYlongitude, latitude INTO 5 ORDER BY size – 4*price LIMIT 3 sorting clustering Boolean

18 Query evaluation: Straightforward Materialize-Cluster-Sort approach sorting clustering Boolean Overkill: cluster and rank all, only top 10 in each cluster are requested Inefficient:  fully generate Boolean results  clustering large amount of results is expensive  sorting big group is costly

19 Query evaluation: Summary-driven approach use summary to cluster use summary for pruning in ranking use bitmap-index  to construct query-dependant summary  to bring together Boolean, clustering, and ranking

20 Summary for clustering latitude K-means on original tuples  weighted K-means on virtual tuples longitude latitude longitude

21 Summary for ranking size ORDER BY size – 4 *price LIMIT 3 [ -40,10] - 4*price 0 size (20,40] 1 (10,30] 2 1 (0,20] *price -40

22 Construct summary by bitmap-index & … -4*price[-40,-30] … size(10,20] … TID t1 t2 t3 t4 t5 … size t2 t5 t1 t4 t3 The advantages of using bitmap index: Small Bit operations (&, |, ~, count) are fast Easily integrate Boolean, clustering, and ranking - 4*price -40

23 Integrating Boolean, clustering, and ranking Vec(cluster1) … Vec(cluster2) … longitude latitude *price size & Vec(area=“chicago”) … longitude latitude longitude latitude & - 4*price size - 4*price size &

Experiments ClusterRank (summary-driven approach) vs. StraightFwd (materialize-cluster-rank)  Processing efficiency: ClusterRank >> StrightFwd  Clustering Quality: ClusterRank ≈ StrightFwd synthetic data various configuration parameters (#tuples, #clusters, #clustering attr, #ranking attr, #paritions per attr, k) 24

Efficiency 25 t: #clusters 4M tuples, 5 clustering attr, 3 ranking attr, top 5 s: #tuples 10 clusters, 5 clustering attr, 3 ranking attr, top 5

Clustering quality 26 t: #clusters 4M tuples, 8 clustering attr s: #tuples 10 clusters, 3 clustering attr close(res_SF, res_CR): closeness of results from StraightFwd and ClusterRank close(res_SF, res_SF): closeness of results from different runs of StraightFwd

27 Conclusions Borrow innovative mechanisms from other areas to support data retrieval applications Ranking and clustering as generalized Order-By and Group-By, integrated in database queries Query semantics: ClusterRank queries Query evaluation: summary-driven approch vs. materialize- cluster-sort  evaluation efficiency: ClusterRank >> StraightFwd  clustering quality: ClusterRank ≈ StraightFwd

Acknowledgement Rishi Rakesh Sinha: source code of bitmap index Jiawei Han: discussions regarding presentation 28

29

30 Alternative semantics? global clustering / local ranking (focus of this paper) clustering: Boolean results ranking: local top k in each cluster local clustering / global ranking clustering: global top k ranking: Boolean results global clustering / global ranking clustering: Boolean results ranking: in each cluster, return those belonging to global top k rank the clusters? (by average of local top k?)

31 Join queries Star-schema fact table, dimension tables, key and foreign key Bitmap join-index index the fact table by the attributes in dimension tables