Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.

Similar presentations


Presentation on theme: "Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data."— Presentation transcript:

1 Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data Sources VLDB 2014 - Hangzhou, China

2 Motivation for Source Ordering VLDB 2014 - Hangzhou, China 2  For online query answering, users want results as soon as possible.  For some domains, there are hundreds to thousands of relevant data sources. * * Dalvi et. al. VLDB 2012.  All sources cannot be queried in parallel due to bandwidth limitation, etc.  Hence, we must consider the order in which sources are queried. Would like all listings in Pasadena, California?

3 Source Ordering 3 Consider this example of 5 sources  Each area in the venn diagram represents the number of answers given by the set of sources that it covers.  5 out of 120 possible orderings are shown  Orderings are compared by the area-under- the-curve measure. VLDB 2014 - Hangzhou, China 3 1 4 1 2 2 31 1 2 3 5 1 0 1 A B D E C 5 Source Venn Diagram

4 Challenges 4  Source ordering needs to consider three factors:  Coverage – number of answers provided by a source.  Overlap – percentage of overlapping answers between sources.  Cost – monetary or latency cost incurred when connecting or retrieving answers from a source.  Challenges  Gathering all coverage and overlap statistics is infeasible.  20 sources => 1 million overlap statistics  30 sources => 1 billion overlap statistics  Such statistics are typically stale. VLDB 2014 - Hangzhou, China

5 Source Querying Statistics Enrichment Planning Source Ordering Statistic Server Overlap Estimation Statistics Repository Collected New Statistics QAnswers Overlap Ordering Data Statistics Enrichment Plan Data Sources We consider 3 problems:  Overlap Estimation - Given a partial set of overlap statistics, how to estimate overlap statistics that are not known.  Source Ordering – How to order sources to maximize the area- under-the-curve.  Statistics Enrichment - How to select additional ‘unknown’ statistics to improve accuracy of Overlap Estimation and in-turn improve Source Ordering. OASIS: Online Query Answering System 5 VLDB 2014 - Hangzhou, China

6 Basic Overlap Estimation Solution VLDB 2014 - Hangzhou, China P(A ∩ B) = ABC’D’E’+ABC’D’E+ ABC’DE’+ABC’DE+ABCD’E’+ABCD’E+ ABCDE’+ABCDE = 0.30 Ex. P(A ∩ B ) = 0.30 Ex. P(A ∩ B ∩ C ∩ D) = 0.03 P( A ∩ B ∩ C ∩ D) = ABCDE’+ABCDE = 0.03  Provides highest likelihood under given constraints with no additional assumptions.  Changes smoothly with addition/change in statistics.  Find MaxEnt solution under given constraints.  Given coverage and partial set of overlap statistics, formulate constraints:

7 Overlap Estimation (Cont.) 7  Challenges  Formulating the problem exactly requires the definition of 2 n variables, where n is the number of data sources.  Ex. 30 sources = 1 billion variables.  Observation  Number of non-zero variables should not exceed the number of answers, which is usually much smaller than 2 n. VLDB 2014 - Hangzhou, China

8 Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou, China 8 Given Statistics P(A) P(A ∩ B) P(B)P(A ∩ D) P(C)P(A ∩ B ∩ C ∩ D) P(D) P(E) V = { A B'C'D'E', A’ B C’D’E’, A’B’ C D’E’,A’B’C’ D E’, A’B’C’D’ E, AB C’D’E’, A B’C’ D E’, ABCD E’, A’B’C’D’E’} 1) Define constraints using a subset of variables with high cardinality. 2)Solve MaxEnt problem

9 Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou, China 9 3)Include additional variable that are expected to have high cardinality, and remove variables whose value is close to zero. 4)Repeat procedure until no new variables are added.

10 Source Ordering 10  An optimal ordering of sources returns answers as fast as possible, measured by the area-under-the-curve.  Since an optimal solution is NP-Hard, we propose a greedy algorithm which orders sources based on highest residual coverage over cost ratio.  We propose two source ordering strategies:  STATIC Ordering  DYNAMIC Ordering VLDB 2014 - Hangzhou, China

11 Solve MaxEnt problem Select next source with highest residual coverage over cost ratio Probed selected source. STATIC Ordering 11 VLDB 2014 - Hangzhou, China Iterate until threshold is reached.

12 Solve MaxEnt problem Select next source to probe Probed selected source Iterate until threshold is reached. DYNAMIC Ordering 12 VLDB 2014 - Hangzhou, China Compute additional statistics

13 Statistics Enrichment VLDB 2014 - Hangzhou, China 13  The Statistics Enrichment component chooses additional ‘unknown’ statistics with the goal of improving source ordering.  Incorporating additional statistics into Static and Dynamic ordering:  STATIC+ Ordering  DYNAMIC + Ordering STATIC+ STATIC DYNAMIC+ DYNAMIC Requests Additional Statistics? Adaptable?

14 Experimental Evaluation 14  Data Set  Snapshot of Computer Science book listings from AbeBooks.com  1,028 bookstores (sources)  1,256 unique books / 25,347 book records in total  Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based on empirical tests)  Ordering Strategies  STATIC / STATIC+  DYNAMIC / DYNAMIC+  Random: Randomly choose an order of the sources  Coverage: Order the sources in decreasing order of their coverage  Baseline: Naïve usage of given coverage and overlap statistics  FullKnowledge: Greedy algorithm with accurate and complete set of coverage and overlap statistics. VLDB 2014 - Hangzhou, China

15 Evaluation of Algorithms 15 VLDB 2014 - Hangzhou, China  DYNAMIC yields a larger area-under-the-curve, and probes fewer sources to get 90% coverage, than STATIC.  DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC counterparts.

16 Conclusions 16  Proposed Overlap Estimation method generates good overlap estimates for the purpose of source ordering.  An adaptive ordering strategy (DYNAMIC ordering) generates a better source ordering compared to a static ordering strategy.  Incorporating new statistics (whether accurate, approximate, or stale) can improve source ordering (DYNAMIC+)  As long as the statistic selection procedure is fast, incorporating new statistics on-the-fly can improve source ordering. VLDB 2014 - Hangzhou, China

17 Thank You Questions?


Download ppt "Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data."

Similar presentations


Ads by Google