Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.

Slides:

Advertisements

Similar presentations

Overview of Inferential Statistics

Advertisements

Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs-Research) VLDB’2013.

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Hadi Goudarzi and Massoud Pedram

CrowdER - Crowdsourcing Entity Resolution

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Fast Algorithms For Hierarchical Range Histogram Constructions

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)

Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

Active Learning and Collaborative Filtering

Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

INFO 624 Week 3 Retrieval System Evaluation

Identifying Conflicts in Overconstrained Temporal Problems Mark H. Liffiton, Michael D. Moffitt, Martha E. Pollack, and Karem A. Sakallah University of.

Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.

Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.

UNIVERSITY OF MARYLAND GOOGLE INC. AT&T LABS-RESEARCH THEODOROS REKATSINAS XIN LUNA DONG DIVESH SRIVASTAVA CHARACTERIZING AND SELECTING FRESH DATA SOURCES.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

Experimental Evaluation

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Fixed Parameter Complexity Algorithms and Networks.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Application-Layer Anycasting By Samarat Bhattacharjee et al. Presented by Matt Miller September 30, 2002.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Network Aware Resource Allocation in Distributed Clouds.

ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.

Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.

Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.

Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.

Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.

VLDB 2006, Seoul1 Indexing For Function Approximation Biswanath Panda Mirek Riedewald, Stephen B. Pope, Johannes Gehrke, L. Paul Chew Cornell University.

1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.

Practical Asynchronous Neighbor Discovery and Rendezvous for Mobile Sensing Applications Prabal Dutta and David Culler Computer Science Division University.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Unsupervised Streaming Feature Selection in Social Media

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.

Optimal Relay Placement for Indoor Sensor Networks Cuiyao Xue †, Yanmin Zhu †, Lei Ni †, Minglu Li †, Bo Li ‡ † Shanghai Jiao Tong University ‡ HK University.

Computacion Inteligente Least-Square Methods for System Identification.

Confidential & Proprietary – All Rights Reserved Internal Distribution, October Quality of Service in Multimedia Distribution G. Calinescu (Illinois.

A paper on Join Synopses for Approximate Query Answering

Learning With Dynamic Group Sparsity

ISP and Egress Path Selection for Multihomed Networks

Qi Han, Matthew Ba Nguyen Sandy Irani, Nalini Venkatasubramanian

Data Integration with Dependent Sources

Alan Kuhnle*, Victoria G. Crawford, and My T. Thai

Presentation transcript:

Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data Sources VLDB Hangzhou, China

Motivation for Source Ordering VLDB Hangzhou, China 2  For online query answering, users want results as soon as possible.  For some domains, there are hundreds to thousands of relevant data sources. * * Dalvi et. al. VLDB  All sources cannot be queried in parallel due to bandwidth limitation, etc.  Hence, we must consider the order in which sources are queried. Would like all listings in Pasadena, California?

Source Ordering 3 Consider this example of 5 sources  Each area in the venn diagram represents the number of answers given by the set of sources that it covers.  5 out of 120 possible orderings are shown  Orderings are compared by the area-under- the-curve measure. VLDB Hangzhou, China A B D E C 5 Source Venn Diagram

Challenges 4  Source ordering needs to consider three factors:  Coverage – number of answers provided by a source.  Overlap – percentage of overlapping answers between sources.  Cost – monetary or latency cost incurred when connecting or retrieving answers from a source.  Challenges  Gathering all coverage and overlap statistics is infeasible.  20 sources => 1 million overlap statistics  30 sources => 1 billion overlap statistics  Such statistics are typically stale. VLDB Hangzhou, China

Source Querying Statistics Enrichment Planning Source Ordering Statistic Server Overlap Estimation Statistics Repository Collected New Statistics QAnswers Overlap Ordering Data Statistics Enrichment Plan Data Sources We consider 3 problems:  Overlap Estimation - Given a partial set of overlap statistics, how to estimate overlap statistics that are not known.  Source Ordering – How to order sources to maximize the area- under-the-curve.  Statistics Enrichment - How to select additional ‘unknown’ statistics to improve accuracy of Overlap Estimation and in-turn improve Source Ordering. OASIS: Online Query Answering System 5 VLDB Hangzhou, China

Basic Overlap Estimation Solution VLDB Hangzhou, China P(A ∩ B) = ABC’D’E’+ABC’D’E+ ABC’DE’+ABC’DE+ABCD’E’+ABCD’E+ ABCDE’+ABCDE = 0.30 Ex. P(A ∩ B ) = 0.30 Ex. P(A ∩ B ∩ C ∩ D) = 0.03 P( A ∩ B ∩ C ∩ D) = ABCDE’+ABCDE = 0.03  Provides highest likelihood under given constraints with no additional assumptions.  Changes smoothly with addition/change in statistics.  Find MaxEnt solution under given constraints.  Given coverage and partial set of overlap statistics, formulate constraints:

Overlap Estimation (Cont.) 7  Challenges  Formulating the problem exactly requires the definition of 2 n variables, where n is the number of data sources.  Ex. 30 sources = 1 billion variables.  Observation  Number of non-zero variables should not exceed the number of answers, which is usually much smaller than 2 n. VLDB Hangzhou, China

Scalable Overlap Estimation Solution VLDB Hangzhou, China 8 Given Statistics P(A) P(A ∩ B) P(B)P(A ∩ D) P(C)P(A ∩ B ∩ C ∩ D) P(D) P(E) V = { A B'C'D'E', A’ B C’D’E’, A’B’ C D’E’,A’B’C’ D E’, A’B’C’D’ E, AB C’D’E’, A B’C’ D E’, ABCD E’, A’B’C’D’E’} 1) Define constraints using a subset of variables with high cardinality. 2)Solve MaxEnt problem

Scalable Overlap Estimation Solution VLDB Hangzhou, China 9 3)Include additional variable that are expected to have high cardinality, and remove variables whose value is close to zero. 4)Repeat procedure until no new variables are added.

Source Ordering 10  An optimal ordering of sources returns answers as fast as possible, measured by the area-under-the-curve.  Since an optimal solution is NP-Hard, we propose a greedy algorithm which orders sources based on highest residual coverage over cost ratio.  We propose two source ordering strategies:  STATIC Ordering  DYNAMIC Ordering VLDB Hangzhou, China

Solve MaxEnt problem Select next source with highest residual coverage over cost ratio Probed selected source. STATIC Ordering 11 VLDB Hangzhou, China Iterate until threshold is reached.

Solve MaxEnt problem Select next source to probe Probed selected source Iterate until threshold is reached. DYNAMIC Ordering 12 VLDB Hangzhou, China Compute additional statistics

Statistics Enrichment VLDB Hangzhou, China 13  The Statistics Enrichment component chooses additional ‘unknown’ statistics with the goal of improving source ordering.  Incorporating additional statistics into Static and Dynamic ordering:  STATIC+ Ordering  DYNAMIC + Ordering STATIC+ STATIC DYNAMIC+ DYNAMIC Requests Additional Statistics? Adaptable?

Experimental Evaluation 14  Data Set  Snapshot of Computer Science book listings from AbeBooks.com  1,028 bookstores (sources)  1,256 unique books / 25,347 book records in total  Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based on empirical tests)  Ordering Strategies  STATIC / STATIC+  DYNAMIC / DYNAMIC+  Random: Randomly choose an order of the sources  Coverage: Order the sources in decreasing order of their coverage  Baseline: Naïve usage of given coverage and overlap statistics  FullKnowledge: Greedy algorithm with accurate and complete set of coverage and overlap statistics. VLDB Hangzhou, China

Evaluation of Algorithms 15 VLDB Hangzhou, China  DYNAMIC yields a larger area-under-the-curve, and probes fewer sources to get 90% coverage, than STATIC.  DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC counterparts.

Conclusions 16  Proposed Overlap Estimation method generates good overlap estimates for the purpose of source ordering.  An adaptive ordering strategy (DYNAMIC ordering) generates a better source ordering compared to a static ordering strategy.  Incorporating new statistics (whether accurate, approximate, or stale) can improve source ordering (DYNAMIC+)  As long as the statistic selection procedure is fast, incorporating new statistics on-the-fly can improve source ordering. VLDB Hangzhou, China

Thank You Questions?