Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Similar presentations


Presentation on theme: "Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International."— Presentation transcript:

1 Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi Sivaramakrishnan Subramani CSE@UTA 1

2 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 2

3 3 Motivation Many-answers problem Two alternative solutions: Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking

4 4 Example – Realtor Database House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year Query: City =`Seattle’ AND Waterfront = TRUE Too Many Results! Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable

5 Rank According to Unspecified Attributes Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values [CIDR2003] ◦ E.g., Newer Houses are generally preferred Conditional Score: Correlations between Specified and Unspecified Attribute Values ◦ E.g., Waterfront  BoatDock Many Bedrooms  Good School District 5

6 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 6

7 Key Problems Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR). How to Calculate the Global and Conditional Scores. Use Query Workload and Data. 7

8 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 8

9 9 System Architecture

10 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 10

11 11 PIR Review Bayes’ Rule Product Rule Document (Tuple) t, Query Q R: Relevant Documents R = D - R: Irrelevant Documents

12 12 Adaptation of PIR to DB Tuple t is considered as a document Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function until final ranking function is obtained

13 13 Preliminary Derivation

14 14 Limited Independence Assumptions Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

15 15 Continuing Derivation

16 16 Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Use Workload Use Data

17 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 17

18 18 Architecture of Ranking Systems

19 19 Scan Algorithm Preprocessing - Atomic Probabilities Module Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples

20 20 Beyond Scan Algorithm Scan algorithm is Inefficient Many tuples in the answer set Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples

21 Output from Index Module CondList C x {AttName, AttVal, TID, CondScore} B + tree index on (AttName, AttVal, CondScore) GlobList G x {AttName, AttVal, TID, GlobScore} B + tree index on (AttName, AttVal, GlobScore) 21

22 Index Module 22

23 Preprocessing Component Preprocessing For Each Distinct Value x of Database, Calculate and Store the Conditional (C x ) and the Global (G x ) Lists as follows ◦ For Each Tuple t Containing x Calculate and add to C x and G x respectively Sort C x, G x by decreasing scores Execution Query Q: X 1 =x 1 AND … AND X s =x s Execute Threshold Algorithm [Fag01] on the following lists: C x1,…,C xs, and G xb, where G xb is the shortest list among G x1,…,G xs 23

24 List Merge Algorithm 24

25 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 25

26 26 Experimental Setup Datasets: ◦ MSR HomeAdvisor Seattle (http://houseandhome.msn.com/) ◦ Internet Movie Database (http://www.imdb.com) Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

27 27 Quality Experiments Conducted on Seattle Homes and Movies tables Collect a workload from users Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]

28 28 Quality Experiment-Average Precision For each query Q i, generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples Let each user mark 10 tuples in H i as most relevant to Q i Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

29 29 Quality Experiment- Fraction of Users Preferring Each Algorithm 5 new queries Users were given the top-5 results

30 30 Performance Experiments Datasets Compare 2 Algorithms: Scan algorithm List Merge algorithm

31 31 Performance Experiments – Pre- computation Time

32 32 Performance Experiments – Execution Time

33 Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 33

34 34 Conclusions – Future Work Conclusions Completely Automated Approach for the Many- Answers Problem which Leverages Data and Workload Statistics and Correlations Based on PIR Drawbacks Mutiple-table query Non-categorical attributes Future Work Empty-Answer Problem Handle Plain Text Attributes

35 35 Questions?


Download ppt "Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International."

Similar presentations


Ads by Google