Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries Vagelis Hristidis University of California, San Diego Nick Koudas AT&T.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing and Information Sciences Florida International University.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Facilitating Document Annotation using Content and Querying Value.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Ranking Instructor: Gautam Das Class notes Prepared by Sushanth Sivaram Vallath.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat.
Presented By Amarjit Datta
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Facilitating Document Annotation Using Content and Querying Value.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Information Retrieval and Web Search
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Probabilistic Ranking of Database Query Results
Information Retrieval and Web Design
Prefer: A System for the Efficient Execution
Probabilistic Ranking of Database Query Results
Probabilistic Information Retrieval
Presentation transcript:

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi Sivaramakrishnan Subramani 1

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 2

3 Motivation Many-answers problem Two alternative solutions: Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking

4 Example – Realtor Database House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year Query: City =`Seattle’ AND Waterfront = TRUE Too Many Results! Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

Rank According to Unspecified Attributes Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values [CIDR2003] ◦ E.g., Newer Houses are generally preferred Conditional Score: Correlations between Specified and Unspecified Attribute Values ◦ E.g., Waterfront  BoatDock Many Bedrooms  Good School District 5

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 6

Key Problems Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR). How to Calculate the Global and Conditional Scores. Use Query Workload and Data. 7

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 8

9 System Architecture

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 10

11 PIR Review Bayes’ Rule Product Rule Document (Tuple) t, Query Q R: Relevant Documents R = D - R: Irrelevant Documents

12 Adaptation of PIR to DB Tuple t is considered as a document Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function until final ranking function is obtained

13 Preliminary Derivation

14 Limited Independence Assumptions Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

15 Continuing Derivation

16 Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Use Workload Use Data

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 17

18 Architecture of Ranking Systems

19 Scan Algorithm Preprocessing - Atomic Probabilities Module Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples

20 Beyond Scan Algorithm Scan algorithm is Inefficient Many tuples in the answer set Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples

Output from Index Module CondList C x {AttName, AttVal, TID, CondScore} B + tree index on (AttName, AttVal, CondScore) GlobList G x {AttName, AttVal, TID, GlobScore} B + tree index on (AttName, AttVal, GlobScore) 21

Index Module 22

Preprocessing Component Preprocessing For Each Distinct Value x of Database, Calculate and Store the Conditional (C x ) and the Global (G x ) Lists as follows ◦ For Each Tuple t Containing x Calculate and add to C x and G x respectively Sort C x, G x by decreasing scores Execution Query Q: X 1 =x 1 AND … AND X s =x s Execute Threshold Algorithm [Fag01] on the following lists: C x1,…,C xs, and G xb, where G xb is the shortest list among G x1,…,G xs 23

List Merge Algorithm 24

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 25

26 Experimental Setup Datasets: ◦ MSR HomeAdvisor Seattle ( ◦ Internet Movie Database ( Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

27 Quality Experiments Conducted on Seattle Homes and Movies tables Collect a workload from users Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

28 Quality Experiment-Average Precision For each query Q i, generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples Let each user mark 10 tuples in H i as most relevant to Q i Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

29 Quality Experiment- Fraction of Users Preferring Each Algorithm 5 new queries Users were given the top-5 results

30 Performance Experiments Datasets Compare 2 Algorithms: Scan algorithm List Merge algorithm

31 Performance Experiments – Pre- computation Time

32 Performance Experiments – Execution Time

Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 33

34 Conclusions – Future Work Conclusions Completely Automated Approach for the Many- Answers Problem which Leverages Data and Workload Statistics and Correlations Based on PIR Drawbacks Mutiple-table query Non-categorical attributes Future Work Empty-Answer Problem Handle Plain Text Attributes

35 Questions?