Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.

Slides:

Advertisements

Similar presentations

Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.

Advertisements

Web Information Retrieval

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Best-Effort Top-k Query Processing Under Budgetary Constraints

Efficient Network Aware Search in Collaborative Tagging Sites… Sihem Amer Yahia, Michael Benedikt Laks V.S. Lakshmanan, Julia Stoyanovichy PRESENTED BY,

Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.

Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,

1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Aggregation Algorithms and Instance Optimality

Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.

HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Complexity of algorithms Algorithms can be classified by the amount of time they need to complete compared to their input size. There is a wide variety:

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Efficient Processing of Top-k Spatial Preference Queries

1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish.

Combining Fuzzy Information: An Overview Ronald Fagin.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

CS4432: Database Systems II Query Processing- Part 2.

NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.

File Systems cs550 Operating Systems David Monismith.

Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.

1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

1 VLDB, Background What is important for the user.

Indexing & querying text

Information Retrieval in Practice

Max-Planck Institute for Informatics

Seung-won Hwang, Kevin Chen-Chuan Chang

Chapter 12: Query Processing

Top-k Query Processing

CS573 Data Privacy and Security

Spatial Online Sampling and Aggregation

Rank Aggregation.

Laks V.S. Lakshmanan Depf. of CS UBC

Popular Ranking Algorithms

8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Range-Efficient Computation of F0 over Massive Data Streams

Implementation of Relational Operations

Evaluation of Relational Operations: Other Techniques

Ch. 2: Getting Started.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC

10/16/20152 Why ranking in query answering? 1/3 Mutimedia data – fuzzy querying: e.g., “find top 2 red objects with a soft texture”. ObjScore D0.85 B0.80 A0.75 E0.65 C0.60 ObjScore A0.9 D0.8 C0.4 B0.3 E0.1 Combine scores Overall score

10/16/20153 Why ranking? 2/3 IR: “find top 5 documents relevant to `computational’, `neuroscience’ and `brain theory’. –IR systems maintain full text indexes; inverted lists of docs w.r.t. each keyword. –Same Q/A paradigm as before. Buying a home: several criteria – price, location, area, #BRs, school district. ORDER BY query in SQL. Finding hotels while traveling.

10/16/20154 Why ranking? 3/3 Data stream, e.g., of network flow data: “find 10 users with the max. BW consumption and max. #packets communicated”. – score may be complex aggregation of these two measures. In a social net, find 5 items tagged as most relevant to “lawn mowing” and blonging to users socially close to the seeker. And now, find top-k recs (recommender systems). etc. Fagin et al. – pioneering papers PODS’96, 01, JCSS Burgeoned into a field now. Focus on middleware algorithm, which given a score combo. function, computes top-k answers by probing diff. subsystems (or ranked lists).

10/16/20155 Computational model Naïve method. How to compute top-K efficiently? Access methods: –Sorted access (sequential access) [SA]. –Random access [RA]. Diff. optimization metrics: –Overall running time of algorithm. –SA < RA: minimize RAs. –RA not possible  # : avoid RAs. –Combined optimization. Has led to a variety of algorithms. Memory vs. disk model. For the most part, assume score agg. is a monotone function; use SUM in examples. #: typical in IR systems.

10/16/20156 Fagin’s Algorithm (FA) m lists sorted by descending scores. Access (SA) all lists in parallel. –For each new object seen, fetch scores from other lists by RA. Overall score t(x) = t(x1, …, xm). Store (obj, score) in set Y. –Remember each object seen (under SA) in all lists in set H. Repeat until |H| >= K. Sort Y in descending order of scores, breaking ties arbitrarily, and output top K.

10/16/20157 Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ Answers seen in >=1 list, i.e., Y unsorted. B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen (under SA) in all 4 lists, i.e., H.

10/16/20158 Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H.

10/16/20159 Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ 3.30 B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H.

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ 3.30 B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H. 2.65

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ 3.30 B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H H

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H H, G

10/16/ Example of FA L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) Answers seen in >=1 list, i.e., Y unsorted. Answers seen (under SA) in all 4 lists, i.e., H H, G, B, C 2.05 |H| = 4.

10/16/ FA Example concluded A, F – not seen in any list. Yet, we are sure they can’t make it to top-4. Why? Based on where the cursors are now, what’s the max. possible score for A, F? What assumptions are being made about t()? FA is shown to be optimal with very high probability [Fagin: PODS 1996]. But can be beaten by other algorithms on specific inputs. What about buffer size?

10/16/ Threshold Algorithm Do parallel SA on all m lists. For each object x seen under SA in a list, fetch its scores from other lists by RA and compute overall score. If |Buffer| < K add x to Buffer; Else if score(x) <= k-th score in buffer, toss; Else replace bottom of buffer with (x, score(x)) & resort. Stop when threshold <= k-th score in buffer. Threshold := t(worst score seen on L1, …, worst score seen on Lm). Output the top-K objects & scores (in buffer).

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30)

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30)

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: x1 x2 x3 x

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: T = x1 x2 x3 x

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: T=3.60. x1 x2 x3 x X 3.05 X 3.15

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: T=3.30. x1 x2 x3 x X 3.05 X X

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: T=3.10. x1 x2 x3 x X 3.05 X X

10/16/ TA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) 3.30 Threshold Bar: T=2.90. ==> can stop! x1 x2 x3 x X 3.05 X X

10/16/ TA Remarks

TA is Instance Optimal 10/16/201528

TA IO Proof (contd.) 10/16/201529

Proof (contd.) 10/16/201530

Proof (contd.) 10/16/201531

Proof (contd.) 10/16/201532

Proof (concluded) 10/16/201533

10/16/ No Random Access Algorithm What if RA > SA or RA wasn’t allowed? Do SA on all lists in parallel. At depth d: –Maintain worst scores x1, …, xm. –x any object seen in lists {1, …, i}. Best(x) = t(x1, …, xi, xi+1, …, xm). Worst(x) = t(x1, …, xi, 0, …, 0). –TopK contains K objects with max worst scores at depth d. Break ties using Best. M = k-th Worst score in TopK. –Object y is viable if Best(y) > M. Stop when TopK contains >=K distinct objects and no object outside TopK is viable. Return TopK.

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [0.95, 3.90] [1.00, 3.90] [0.95, 3.90] [1.00, 3.90]

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [0.95, 3.65] [1.80, 3.65] [1.90, 3.75] [1.00, 3.65] [0.90, 3.60] [0.95, 3.60]

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [1.85, 3.40] [1.80, 3.55] [1.90, 3.65] [1.85, 3.40] [0.90, 3.35] [1.80, 3.35] [0.70, 3.30]

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [3.30, 3.30] [1.80, 3.45] [2.70, 3.55] [1.85, 3.30] [1.75, 3.20] [1.80, 3.25] [0.70, 3.15]

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [3.30, 3.30] [1.80, 3.35] [2.70, 3.50] [2.60, 3.20] [1.75, 3.10] [3.15, 3.15] [1.50, 3.00]

10/16/ NRA Example L1L2L3L4 H(0.95) C(0.80 ABCDEFGHIJABCDEFGHIJ B(0.90) E(0.85) G(0.75) I(0.70) D(0.65) A(0.60) J(0.55) F(0.50) J(1.00) C(0.95) G(0.85) H(0.80) E(0.75) B(0.75) F(0.60) A(0.50) D(0.40) I(0.30) C(0.95) J(0.80) D(0.70) H(0.65) G(0.60) B(0.55) I(0.50) E(0.45) F(0.40) A(0.30) E(1.00) G(0.95) H(0.90) B(0.85) D(0.80) C(0.70) A(0.65) I(0.55) F(0.45) J(0.30) [3.30, 3.30] [1.80, 3.20] [3.40, 3.40] [2.60, 3.15] [3.05, 3.05] [3.15, 3.15] [1.50, 2.95] [0.70, 2.70]

10/16/ NRA Features What sort of t() do we need to assume, for NRA to work correctly? How large can the buffers get? How does the amount of bookkeeping compare with TA? NRA is instance optimal over algo’s not making RA (and of course, not making wild guesses).

10/16/ Combined optimization What if we are told cost(RA) = .cost(SA)? Can we find algo’s better than NRA and TA in this case? Combined algorithm = CA. (See Fagin et al.’s paper for details.)

10/16/ Worrying about I/O cost Based on Bast et al. VLDB Inverted lists of (itemID, score) entries in desc. score order, as usual, but on disk. Blocks sorted by itemID; across blocks still in desc. score order.  Inverted Block Index (IBI) Algorithm. What is an IBI?

10/16/ A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7. Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8, 2.4] Doc25 : [0.7, 2.4] Doc83 : [0.9, 2.4] unseen: ≤ 2.4

10/16/ A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7. Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8, 2.4] Doc25 : [0.7, 2.4] Doc83 : [0.9, 2.4] unseen: ≤ 2.4 Round 2 (SA on 1,2,3) Doc17 : [1.5, 2.0] Doc25 : [0.7, 1.6] Doc83 : [0.9, 1.6] unseen: ≤ 1.4

10/16/ A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7. Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8, 2.4] Doc25 : [0.7, 2.4] Doc83 : [0.9, 2.4] unseen: ≤ 2.4 Round 2 (SA on 1,2,3) Doc17 : [1.5, 2.0] Doc25 : [0.7, 1.6] Doc83 : [0.9, 1.6] unseen: ≤ 1.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5, 2.0] Doc83 : [1.4, 1.6] unseen: ≤ 1.0

10/16/ A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7. Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8, 2.4] Doc25 : [0.7, 2.4] Doc83 : [0.9, 2.4] unseen: ≤ 2.4 Round 2 (SA on 1,2,3) Doc17 : [1.5, 2.0] Doc25 : [0.7, 1.6] Doc83 : [0.9, 1.6] unseen: ≤ 1.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5, 2.0] Doc83 : [1.4, 1.6] unseen: ≤ 1.0 Round 4 (RA for Doc17) Doc17 : 1.7 all others < 1.7 done! Note deviation from round-robin.

10/16/ IBI Algorithm Same setting as NRA/CA, except use IBI. Maintain two lists: Top-K items (T = d1, …, dk) and StillHaveASHot (SHASH) (S = dk+1, …, dk+q) items. Pos_i = curr cursor position on list Li. high_i = score in Li at curr cursor position (upper bounds score of unseen items). For items d in S: –Which attr scores are known E(d). –Which attr scores are unknown E~(d). –Worst(d) = total score from E(d). –Best(d) = Worst(d) +  {high_i(d) | i  E~(d)}. (Exactly as Fagin.)

10/16/ IBI Algorithm (contd.) In each round, compute: –min-k = min{Worst(d) | d  T}. –bestscore that any unseen doc can have = sum of all high_i’s. –For dj  S: def_j = min-k – worst(d_j). [denotes deficit below qualification level for top-k.] T sorted in desc. Worst(); S sorted in desc. Best(). [sorting on (score, ItemID) for fast processing.] Invatiant: min-k >= max{Worst(d) | d  S}. Termination: when min-k >= max{Best(d) | d  S}. Can remove an obj from S whenever its Best <= min-k.  stop when S = {}. Early termination AND minimal bookkeeping are BOTH important for performance.

10/16/ More on IBI Framework Instead of scheduling SAs using RR, use a differential approach for diff. lists based on expected score reductions at future cursor positions (Knapsack). Do SA*RA*. Order RAs based on estimated Prob[dj can get into top-k answers].