Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Ranking of Database Query Results

Similar presentations


Presentation on theme: "Probabilistic Ranking of Database Query Results"— Presentation transcript:

1 Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Weimin He

2 Outline Motivation Problem Definition System Architecture
Construction of Ranking Function Implementation Experiments Conclusion and open problems 11/21/2018 Weimin He

3 Motivating example Realtor DB:
Table D=(TID, Price , City, Bedrooms, Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock) SQL query: Select * From D Where City=Seattle AND View=Waterfront 11/21/2018 Weimin He

4 Motivation Many-answers problem Two alternative solutions:
Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking 11/21/2018 Weimin He

5 Problem Definition Given a database table D with n tuples {t1, …, tn} over a set of m categorical attributes A = {A1, …, Am} and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xs where each Xi is an attribute from A and xi is a value in its domain. The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes Let be the answer set of Q How to rank tuples in S and return top-k tuples to the user ? 11/21/2018 Weimin He

6 System Architecture 11/21/2018 Weimin He

7 Intuition for Ranking Function
Select * From D Where City=“Seattle” And View=“Waterfront” Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values E.g., Homes with good school districts are globally desirable Conditional Score: Correlations between Specified and Unspecified Attribute Values E.g., Waterfront  BoatDock 11/21/2018 Weimin He

8 Probabilistic Model in IR
Bayes’ Rule Product Rule Document t, Query Q R: Relevant document set R = D - R: Irrelevant document set 11/21/2018 Weimin He

9 Adaptation of PIR to DB Tuple t is considered as a document
Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function until final ranking function is obtained 11/21/2018 Weimin He

10 Preliminary Derivation
11/21/2018 Weimin He

11 Limited Independence Assumptions
Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed 11/21/2018 Weimin He

12 Continuing Derivation
11/21/2018 Weimin He

13 Workload-based Estimation of
Assume a collection of “past” queries existed in system Workload W is represented as a set of “tuples” Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X 11/21/2018 Weimin He

14 Final Ranking Function
11/21/2018 Weimin He

15 Pre-computing Atomic Probabilities in Ranking Function
Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D 11/21/2018 Weimin He

16 Example for Computing Atomic Probabilities
Select * From D Where City=“Seattle” And View=“Waterfront” Y={SchoolDistrict, BoatDock, …} D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5 p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005 11/21/2018 Weimin He

17 Indexing Atomic Probabilities
{AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttName, AttVal, Prob} B+ tree index on (AttName, AttVal) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob} B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight) 11/21/2018 Weimin He

18 Scan Algorithm Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples 11/21/2018 Weimin He

19 Beyond Scan Algorithm Scan algorithm is Inefficient
Many tuples in the answer set Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples 11/21/2018 Weimin He

20 Two kinds of Ranked List
CondList Cx {AttName, AttVal, TID, CondScore} B+ tree index on (AttName, AttVal, CondScore) GlobList Gx {AttName, AttVal, TID, GlobScore} B+ tree index on (AttName, AttVal, GlobScore) 11/21/2018 Weimin He

21 Index Module 11/21/2018 Weimin He

22 List Merge Algorithm 11/21/2018 Weimin He

23 Experimental Setup Datasets:
MSR HomeAdvisor Seattle ( Internet Movie Database ( Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO 11/21/2018 Weimin He

24 Quality Experiments Conducted on Seattle Homes and Movies tables
Collect a workload from users Compare Conditional Ranking Method in the paper with the Global Method [CIDR03] 11/21/2018 Weimin He

25 Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples Let each user mark 10 tuples in Hi as most relevant to Qi Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm 11/21/2018 Weimin He

26 Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results 11/21/2018 Weimin He

27 Performance Experiments
Datasets Compare 2 Algorithms: Scan algorithm List Merge algorithm 11/21/2018 Weimin He

28 Performance Experiments – Pre-computation Time
11/21/2018 Weimin He

29 Performance Experiments – Execution Time
11/21/2018 Weimin He

30 Performance Experiments – Execution Time
11/21/2018 Weimin He

31 Performance Experiments – Execution Time
11/21/2018 Weimin He

32 Conclusion and Open Problems
Automatic ranking for many-answers Adaptation of PIR to DB Mutiple-table query Non-categorical attributes 11/21/2018 Weimin He


Download ppt "Probabilistic Ranking of Database Query Results"

Similar presentations


Ads by Google