An Access Cost-Aware Approach for Object Retrieval over Multiple Sources Benjamin Arai, UC Riverside Gautam Das, UT Arlington Dimitrios Gunopulos, U of.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Chapter 13: Query Processing
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Advanced Piloting Cruise Plot.
Optimal Adaptive Execution of Portfolio Transactions
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Dynamic Programming Introduction Prof. Muhammad Saeed.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
Year 6 mental test 5 second questions
Query optimisation.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
SADC Course in Statistics Objectives and analysis Module B2, Session 14.
Xia Zhou*, Stratis Ioannidis ♯, and Laurent Massoulié + * University of California, Santa Barbara ♯ Technicolor Research Lab, Palo Alto + Technicolor Research.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Google News Personalization: Scalable Online Collaborative Filtering
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Configuration management
Information Systems Today: Managing in the Digital World
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Eftychia Baikousi Panos Vassiliadis
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
ABC Technology Project
Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
Copyright © Cengage Learning. All rights reserved.
Cache and Virtual Memory Replacement Algorithms
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Phase II/III Design: Case Study
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Squares and Square Root WALK. Solve each problem REVIEW:
Traditional IR models Jian-Yun Nie.
The x- and y-Intercepts
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
1 General Iteration Algorithms by Luyang Fu, Ph. D., State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting LLP 2007 CAS.
Addition 1’s to 20.
25 seconds left…...
Equal or Not. Equal or Not
Slippery Slope
Week 1.
Part 25: Qualitative Data 25-1/21 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Chapter 8 Estimation Understandable Statistics Ninth Edition
A SMALL TRUTH TO MAKE LIFE 100%
Excel Lesson 13 Using Powerful Excel Functions Microsoft Office 2010 Advanced Cable / Morrison 1.
Introduction Distance-based Adaptable Similarity Search
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
all-pairs shortest paths in undirected graphs
Probabilistic Reasoning over Time
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Probabilistic Ranking of Database Query Results
Presentation transcript:

An Access Cost-Aware Approach for Object Retrieval over Multiple Sources Benjamin Arai, UC Riverside Gautam Das, UT Arlington Dimitrios Gunopulos, U of Athens Vagelis Hristidis, Florida International U Nick Koudas, U of Toronto

Motivating Example – Single-source Optimal Access Strategy PubMed Dataset – 17 million publications User searches papers that mention cancer and also cite a paper that contains the phrase genetic propensity in its title. Not supported by advanced search interface. User looking for 5 satisfying papers, given that cancer returns more than two million papers. What is best retrieval strategy? Access Cost-Aware Object Retrieval VLDB'10 2

Motivating Example – Multi-source Optimal Access Strategy Commercial legal research sources (e.g., LexisNexis, Westlaw) offer pay-per-search data services. Each source has distinct access cost (e.g., latency, per-object monetary cost) Their public interfaces allow conditions like filing date range, but not keyword. User looking for 5 cases including civil and riverside. What is best retrieval strategy? Access Cost-Aware Object Retrieval VLDB'10 3

Challenges Source selection: which source(s) to retrieve objects from. Object selection. how many objects to retrieve from each selected source. Cost may be in the form of money, bandwidth, etc. Looking for any-k satisfying objects; dont care about ranking. Access Cost-Aware Object Retrieval VLDB'10 4

Source Access Cost Model Cost Function for PubMed: Average cost (seconds) to retrieve 1 to 1000 publication ids from the PubMed online database for query cancer. Use PubMed ESearch online access system. a = sec, b = sec Access Cost-Aware Object Retrieval VLDB'10 5 We assume each source has a cost function, monotonic on # items l retrieved. Well accepted cost model: – fixed per-access overhead a – Per-item cost b – Cost i (l) = a i + b i · l Other cost models also supported

Assumptions Cost of retrieving l objects from source S i is Cost i (l), monotone (increasing) on l. – E.g., Costi(l) = a i + b i · l, a i : access overhead cost, b i : per-object access cost. GetNext i (l): request l new objects from source S i Query satisfying probability p i (Q) of an object from S i. Statistics about objects stored in a source. Leverage existing techniques to sample query-independently during pre-processing. – p i (Q) unavailable in some problem variants. Access Cost-Aware Object Retrieval VLDB'10 6

Problem Variants Single Source, Query Satisfaction Probability (SSQSP): Given any-k query Q, data source S i with access cost Cost i (l), and query satisfaction probability p i (Q), find best access strategy. Single Source, No Distribution Data (SSNDD): Given Q, Cost i (l), and no query satisfaction information, find best access strategy. Multi Source, Query Satisfaction Probability (MSQSP): Given Q, S 1,..., S q with Cost i (l), p i (Q), find best access strategy. Multi Source, No Distribution Data (MSNDD): Given Q, S 1,..., S q with Cost i (l), and no query satisfaction information, find best access strategy. Access Cost-Aware Object Retrieval VLDB'10 7

Example – 3 Sources, Any-10 Query Sample source overheads, per-object costs and probabilities for sources S1, S2, and S3. 1 st Iteration: – Expected # retrieved objects for any-10 satisfying for S1,S3: 1000 – Cost 1 (1000) = × 1000 = 1300 – Cost 3 (1000) = 100+2×1000 = nd iteration (assume we retrieved 9 satisfying objects in 1 st ) – Expected # retrieved objects for any-1 satisfying for S1,S3: 100 – Cost 1 (100) = × 100 = 400 – Cost3(100) = × 100 = 300 Access Cost-Aware Object Retrieval VLDB'10 8 S1S2S3 Access Overhead (a) Per-object Overhead (b)112 Query Satisfaction Probability (p)1%0%1%

P-SSQSP: Probabilistic Algorithm for SSQSP Estimate # objects to be retrieved to complete query with given probability (e.g., 95%) Probability that exactly s satisfying objects contained in next l objects computed using binomial distribution Probability that Q satisfied by retrieving l objects given by cumulative binomial distribution Pick minimum l such that P(Q completed) > x%. Drawback: how can x be estimated? Access Cost-Aware Object Retrieval VLDB'10 9

DP-SSQSP: Dynamic Programming Algorithm for SSQSP Computes optimal access strategy, i.e. how many objects to retrieve at each step. Optimality intuitively because it takes into account the cost incurred when next step does not complete the query (does not retrieve all any-k objects). Access Cost-Aware Object Retrieval VLDB'10 10

DP-SSQSP (contd) – key formulas C(n, l): optimal expected cost of completing Q given that n satisfying objects have been retrieved so far, and l objects will be retrieved in next step C(n): optimal expected cost to complete Q given that n satisfying objects have been retrieved so far. Note recursion! P(s sat in l retr): probability that exactly s satisfying objects contained in l retrieved objects Access Cost-Aware Object Retrieval VLDB'10 11

DP-SSQSP (contd) – handle recursion Dynamic Programming Table: Access Cost-Aware Object Retrieval VLDB'10 12 Find the l N value for which the following equation produces the minimum C(n). Can use analytical (derivative) or numerical methods.

DP-SSQSP (contd) – Lookup table For performance, we precompute a table specifying optimal # retrieved objects given # remaining requested satisfying objects. E.g., Lookup table for p = 0.01, k = 3, Cost(l) = 10+1 · l Access Cost-Aware Object Retrieval VLDB'10 13 k-nl

Multiple Sources Variants P-MSQSP: Probabilistic Algorithm for MSQSP At every step compute # l i of objects to retrieve from each source to complete the query with probability x%. Pick source S i with minimum access cost Cost i (l i ) to do the next retrieval of l i objects. Access Cost-Aware Object Retrieval VLDB'10 14

DP-MSQSP: Dynamic Programming Algorithm for MSQSP Lookup table: for each n value, the lookup table stores the source id v, in addition to the number l of objects to retrieve Access Cost-Aware Object Retrieval VLDB'10 15 Additional variable in dynamic programming: source id v Dynamic Programming table

DP-MSQSP (contd) Find l to minimize where C(n) is the expected cost of completing Q given that n satisfying objects have been retrieved so far, and the next access will be at source S v and will retrieve l objects. Access Cost-Aware Object Retrieval VLDB'10 16

Experiments - Datasets IMDB (Internet Movie Database) – 236,627 titles – Extracted a list of keywords describing each movie – 25 queries, e.g., based-on-novel, revenge, with frequency 1% to 5% – Retrieve movies in alphabetical order Westlaw legal cases – 60,000 case filings for the state of California over a two- month period – 25 queries, e.g., copyright, trustee with frequency 0.1% to 1.0% – retrieve objects in random order Access Cost-Aware Object Retrieval VLDB'10 17

Single-Source Experiments – Baseline algorithms BAR (Base-AdaptStep-a/b): retrieve a/b objects from source in first round, doubling the sample size each round BAK(Base-AdaptStep-k): retrieve k objects from source in first round, doubling the retrieval size each round BFR (Base-FixedStep-a/b): retrieve a/b objects from the source as a fixed-step process Optimal: Assume an oracle says how far the k-th satisfying result is. Access Cost-Aware Object Retrieval VLDB'10 18

Experiments -Cost k=25, IMDB k=100, IMDB Access Cost-Aware Object Retrieval VLDB'10 19

Experiments – Relative Error Relative error in cost= (Cost Optimal) / Optimal Access Cost-Aware Object Retrieval VLDB'10 20 k=25, IMDB k=100, IMDB

Experiments – Compute Lookup tables Time (seconds) to generate dynamic programming tables for single and multiple source data sets. Access Cost-Aware Object Retrieval VLDB'10 21

Related work GIOSS, CORI create collection selection index. Each source ranked per query based on an object ranking algorithm. Mihaila et al present a framework to discover and combine Internet sources to answer complex queries. Do not address problem of determining the optimal number of objects to retrieve from a selected source or subsequent accesses. Top-k and probabilistic top-k algorithms mostly focus on vertically partitioned sources. In our setting it is horizontally partitioned. Fuhr 1999 require precision-recall graph of each source to compute optimal number of retrieved tuples from each source. Probing, sampling sources to predict utility. Access Cost-Aware Object Retrieval VLDB'10 22

Conclusions Presented cost-aware approach to source and object selection Dynamic programming algorithms Exploit cost access characteristics of sources Single or multiple sources Access Cost-Aware Object Retrieval VLDB'10 23

Thank you! ACKNOWLEDGMENTS: Gautam Das was supported by NSF grants , and , a grant from Dept of Education, and unrestricted gifts from Microsoft Research and Nokia Research. Dimitrios Gunopulos was supported by NSF, and by the SemsorGrid4Env, Health-e-Child, and MODAP projects funded by the European Commission. Vagelis Hristidis was supported by NSF grants IIS , IIS , and DHS grant 2009-ST Access Cost-Aware Object Retrieval VLDB'10 24

Variants where no distribution data is known For the problems SSNDD (Single-Source, No Distribution Data) and MSNND (Multi-Source, No Distribution Data), where no information on the distribution of query satisfying objects is available, we modify our dynamic programming algorithms to learn the distribution from the objects retrieved during query execution. Access Cost-Aware Object Retrieval VLDB'10 25

DP-SSQSP - complexity To complete the Lookup table Space: O(k) Time: O(k·l max ·(k n)) l max : number of points computed in order to find the minimum Access Cost-Aware Object Retrieval VLDB'10 26