Laks V.S. Lakshmanan Depf. of CS UBC

Slides:



Advertisements
Similar presentations
Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
Advertisements

Web Information Retrieval
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Efficient Network Aware Search in Collaborative Tagging Sites… Sihem Amer Yahia, Michael Benedikt Laks V.S. Lakshmanan, Julia Stoyanovichy PRESENTED BY,
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
CS4432: Database Systems II Query Processing- Part 2.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 VLDB, Background What is important for the user.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
CSE373: Data Structures & Algorithms Priority Queues
Indexing & querying text
Information Retrieval in Practice
Max-Planck Institute for Informatics
Seung-won Hwang, Kevin Chen-Chuan Chang
Indexing & querying text
Priority Queues Chuan-Ming Liu
Algorithm Analysis CSE 2011 Winter September 2018.
Chapter 12: Query Processing
Top-k Query Processing
CS573 Data Privacy and Security
Join Processing in Database Systems with Large Main Memories (part 2)
Spatial Online Sampling and Aggregation
Rank Aggregation.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Popular Ranking Algorithms
Implementation Based on Inverted Files
Chapters 15 and 16b: Query Optimization
Evaluation of Relational Operations: Other Techniques
General External Merge Sort
Ch. 2: Getting Started.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Laks V.S. Lakshmanan Depf. of CS UBC Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC

Why ranking in query answering? 1/3 Mutimedia data – fuzzy querying: e.g., “find top 2 red objects with a soft texture”. Obj Score D 0.85 B 0.80 A 0.75 E 0.65 C 0.60 Obj Score A 0.9 D 0.8 C 0.4 B 0.3 E 0.1 Overall score Combine scores 11/29/2018

Why ranking? 2/3 IR: “find top 5 documents relevant to `computational’, `neuroscience’ and `brain theory’. IR systems maintain full text indexes; inverted lists of docs w.r.t. each keyword. Same Q/A paradigm as before. 11/29/2018

Why ranking? 3/3 Data stream, e.g., of network flow data: “find 10 users with the max. BW consumption and max. #packets communicated”. In a social net, find 5 items tagged as most relevant to “lawn mowing” by user’s friends. etc. Fagin et al. – pioneering papers PODS’96, 01, TODS 2003. Burgeoned into a field now. Focus on middleware algorithm, which given a score combo. function, computes top-K answers by probing diff. subsystems (or ranked lists). 11/29/2018

Computational model Naïve method. How to compute top-K efficiently? Access methods: Sorted access (sequential access) [SA]. Random access [RA]. Diff. optimization metrics: Overall running time of algorithm. SA < RA: minimize RAs. RA not possible#: avoid RAs. Combined optimization. Has led to a variety of algorithms. Memory vs. disk model. #: typical in IR systems. 11/29/2018

Fagin’s Algorithm (FA) m lists sorted by descending scores. Access (SA) all lists in parallel. For each new object seen, fetch scores from other lists by RA. Overall score t(x) = t(x1, …, xm). Store (obj, score) in set Y. Remember each object seen (under SA) in all lists in set H. Repeat until |H| >= K. For each seen object, do RA on lists as needed to find “missing” scores. Compute score of x as t(x) = t(x1, …, xm). Sort Y in descending order of scores, breaking ties arbitrarily, and output top K. 11/29/2018

Example of FA L1 L2 L3 L4 A B C D E F G H I J Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) I(0.70) B(0.75) B(0.55) C(0.70) D(0.65) F(0.60) I(0.50) A(0.65) A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C D E F G H I J Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) I(0.70) B(0.75) B(0.55) C(0.70) D(0.65) F(0.60) I(0.50) A(0.65) A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C D E F G H I J 3.30 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C D E F G H I J 3.30 2.65 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C D E 3.40 F G H I 3.05 J 3.30 2.65 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C 3.05 D E 3.40 F G H I 3.05 J 3.15 3.30 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.05 3.40 B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C 3.05 D E 3.40 F G 2.55 H I 3.05 J 3.15 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.05 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

Example of FA L1 L2 L3 L4 A B C 3.05 D E 3.40 F G 2.55 H I 3.05 J 3.15 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.05 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) H 11/29/2018

Example of FA L1 L2 L3 L4 A B C 3.05 D E 3.40 F G 2.55 H I 3.05 J 3.15 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.05 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) H, G 11/29/2018

Example of FA L1 L2 L3 L4 A B C 3.05 D E 3.40 F G 2.55 H I 3.05 J 3.15 Answers seen in >=1 list, i.e., Y unsorted. L1 L2 L3 L4 H(0.95) C(0.80 A B C D E F G H I J J(1.00) C(0.95) E(1.00) 3.05 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 E(0.85) G(0.85) D(0.70) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 2.05 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. F(0.50) I(0.30) A(0.30) J(0.30) H, G, B, C 11/29/2018 |H| = 4.

FA Example concluded A, F – not seen in any list. Yet, we are sure they can’t make it to top-4. Why? Based on where the cursors are now, what’s the max. possible score for A, F? What assumptions are being made about t()? FA is shown to be optimal with very high probability [Fagin: PODS 1996]. But can be beaten by other algorithms on specific inputs. What about buffer size? 11/29/2018

Threshold Algorithm Do parallel SA on all m lists. For each new object x, fetch its scores from other lists and compute overall score. If |Buffer| < K add x to Buffer; Else if score(x) <= k-th score in buffer, toss; Else replace bottom of buffer with (x, score(x)). Stop when threshold <= k-th score in buffer. Threshold := t(worst score seen on L1, …, worst score seen on Lm). Output the top-K objects & scores (in buffer). 11/29/2018

TA Example L1 L2 L3 L4 A B C D E F G H I J H(0.95) C(0.80 J(1.00) 11/29/2018

TA Example L1 L2 L3 L4 A B C D E F G H I J H(0.95) C(0.80 J(1.00) 11/29/2018

TA Example L1 L2 L3 L4 A B C D E F G H I J 3.30 H(0.95) C(0.80 J(1.00) Threshold Bar: J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.95 1.00 0.95 1.00 11/29/2018

TA Example L1 L2 L3 L4 A B C D E 3.40 F G H I 3.05 J 3.30 2.65 H(0.95) Threshold Bar: T = 3.90. J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.95 1.00 0.95 1.00 11/29/2018

TA Example L1 L2 L3 L4 A B C 3.05 X D E 3.40 F G H I 3.05 J 3.15 3.30 Threshold Bar: T=3.60. J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.90 0.95 0.80 0.95 11/29/2018

TA Example L1 L2 L3 L4 A B C 3.05 X D E 3.40 F G 2.55 X H I 3.05 J 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X A(0.60) A(0.50) E(0.45) I(0.55) Threshold Bar: T=3.30. J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.85 0.85 0.70 0.90 11/29/2018

TA Example L1 L2 L3 L4 A B C 3.05 X D E 3.40 F G 2.55 X H I 3.05 J 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X A(0.60) A(0.50) E(0.45) I(0.55) Threshold Bar: T=3.10. J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.80 0.80 0.65 0.85 11/29/2018

TA Example L1 L2 L3 L4 A B C 3.05 X D E 3.40 F G 2.55 X H I 3.05 J 3.15 I(0.70) B(0.75) B(0.55) C(0.70) 3.30 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X A(0.60) A(0.50) E(0.45) I(0.55) Threshold Bar: T=2.90. ==> can stop! J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) x1 x2 x3 x4 0.75 0.75 0.60 0.80 11/29/2018

TA Remarks What properties do we require of t() for TA to be correct? How large does the buffer ever get with TA? What happened with FA? Performance guarantee of TA (instance optimality): D – class of DBs; A – class of algorithms; A A is instance optimal provided BA, DD, cost(A,D) = c.cost(B,D) + c’, for some fixed constants c, c’. c = optimality ratio. TA is instance optimal over algo’s not making wild guesses. 11/29/2018

No Random Access Algorithm What if RA > SA or RA wasn’t allowed? Do SA on all lists in parallel. At depth d: Maintain worst scores x1, …, xm. x any object seen in lists {1, …, i}. Best(x) = t(x1, …, xi, xi+1, …, xm). Worst(x) = t(x1, …, xi, 0, …, 0). TopK contains K objects with max worst scores at depth d. Break ties using Best. M = k-th Worst score in TopK. Object y is viable if Best(y) > M. Stop when TopK contains >=K distinct objects and no object outside TopK is viable. Return TopK. 11/29/2018

NRA Example L1 L2 L3 L4 A B C D E F G H I J H(0.95) C(0.80 J(1.00) [0.95, 3.90] B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) [1.00, 3.90] H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) G(0.60) D(0.80) [0.95, 3.90] I(0.70) B(0.75) B(0.55) C(0.70) D(0.65) F(0.60) I(0.50) A(0.65) [1.00, 3.90] A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) F(0.50) I(0.30) A(0.30) J(0.30) 11/29/2018

NRA Example L1 L2 L3 L4 A B C D E F G H I J H(0.95) C(0.80 J(1.00) [0.90, 3.60] [1.90, 3.75] B(0.90) C(0.95) J(0.80) G(0.95) E(0.85) G(0.85) D(0.70) H(0.90) [1.00, 3.65] H(0.80) H(0.65) B(0.85) G(0.75) E(0.75) [0.95, 3.60] G(0.60) D(0.80) [0.95, 3.65] I(0.70) B(0.75) B(0.55) C(0.70) D(0.65) F(0.60) I(0.50) A(0.65) [1.80, 3.65] A(0.60) A(0.50) E(0.45) I(0.55) J(0.55) D(0.40) F(0.40) F(0.45) A(0.30) J(0.30) 11/29/2018 F(0.50) I(0.30)

NRA Features What sort of t() do we need to assume, for NRA to work correctly? How large can the buffers get? How does the amount of bookkeeping compare with TA? NRA is instance optimal over algo’s not making RA 11/29/2018

Combined optimization What if we are told cost(RA) = .cost(SA)? Can we find algo’s better than NRA and TA in this case? Combined algorithm = CA. (See Fagin et al.’s paper for details.) 11/29/2018

Worrying about I/O cost Based on Bast et al. VLDB 2006. Inverted lists of (itemID, score) entries in desc. score order, as usual, but on disk. Blocks sorted by itemID; across blocks still in desc. score order.  Inverted Block Index (IBI) Algorithm. What is an IBI? 11/29/2018

A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4 11/29/2018

A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4 Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 11/29/2018

A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4 Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5 , 2.0] Doc83 : [1.4 , 1.6] unseen: ≤ 1.0 11/29/2018

A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5 , 2.0] Doc83 : [1.4 , 1.6] unseen: ≤ 1.0 Note deviation from round-robin. Round 4 (RA for Doc17) Doc17 : 1.7 all others < 1.7 done! 11/29/2018

IBI Algorithm Same setting as NRA/CA, except use IBI. Maintain two lists: Top-K items (T = d1, …, dk) and StillHaveASHot (SHASH) (S = dk+1, …, dk+q) items. Pos_i = curr cursor position on list Li. high_i = score in Li at curr cursor position (upper bounds score of unseen items). For items d in S: Which attr scores are known E(d). Which attr scores are unknown E~(d). Worst(d) = total score from E(d). Best(d) = Worst(d) +  {high_i(d) | i E~(d)}. (Exactly as Fagin.) 11/29/2018

IBI Algorithm (contd.) In each round, compute: min-k = min{Worst(d) | d  T}. bestscore that any unseen doc can have = sum of all high_i’s. For dj  S: def_j = min-k – worst(d_j). [denotes deficit below qualification level for top-k.] T sorted in desc. Worst(); S sorted in desc. Best(). [sorting on (score, ItemID) for fast processing.] Invatiant: min-k >= max{Worst(d) | d  S}. Termination: when min-k >= max{Best(d) | d  S}. Can remove an obj from S whenever its Best <= min-k.  stop when S = {}. Early termination AND minimal bookkeeping are BOTH important for performance. 11/29/2018

More on IBI Framework Instead of scheduling SAs using RR, use a differential approach for diff. lists based on expected score reductions at future cursor positions (Knapsack). Do SA*RA*. Order RAs based on estimated Prob[dj can get into top-k answers]. 11/29/2018