Download presentation
Presentation is loading. Please wait.
Published byAvice Jordan Modified over 8 years ago
1
Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi Dipartimento di Elettronica e Informazione – Politecnico di Milano VLDB 2008 2009. 02. 19. Presented by Babar Tareen, IDS Lab., Seoul National University Based on Conference Presentation
2
Copyright 2008 by CEBT Mutli-Domain Queries Queries that can be answered by combining knowledge from two or more domains Example Where can I attend an interesting database workshop close to a sunny beach? Who are the strongest experts on service computing based upon their recent publication record and accepted European projects ? Can I spend an April week-end in a city served by a low-cost direct flight from Milano offering a Mahler's symphony? 2
3
Copyright 2008 by CEBT Intro General-purpose search engines (e.g. Yahoo, Google) Very large search space, yet Not able to index deep Web data Domain-specific search engines (e.g. an airline’s flight search form, Amazon’s book search facility) Typically of high quality, but Limited to restricted domains We lack the ability to answer multi-domain queries 3
4
Copyright 2008 by CEBT In general: “Given a query over a set of services, find the query plan that minimizes the expecte d execution cost according to a given met ric in order to obtain the best k answers.” Scenario: a multi-domain query Reference query: –“Find all database conferences in the next six months in locations where the average temperature is at least 28°C degrees and for which a cheap travel solution including a luxury accommodation exists.” Answering this query requires: –Finding interesting conferences in the desired timeframe via online services by the scientific community; –Understanding whether the conference location is served by low-cost flights; –Finding luxury hotels close to the conference location with available rooms; and –Checking the expected average temperature of the location 4
5
Copyright 2008 by CEBT Overall Picture 5
6
Copyright 2008 by CEBT Preliminaries – (1) Characteristics of information sources (services) Search services: return answers in ranking order Exact services: indistinguishible tuples (no ranking) Services have access patterns – Combination of Input and Output parameters corresponding to different ways of invocation 6
7
Copyright 2008 by CEBT Preliminaries – (2) Characteristics of information sources (services) Expected result size per invocation (ERSPI): – proliferative (ERSPI>1) – selective (0≤ERSPI≤ 1) services Chunking/paging of result sets: bulk vs. chunked services Joins Can be considered system services ERSPI: selectivity of the join condition, ERSPIs of services – Product of the ERSPI values of the services multiplied by the selectivity of the join condition 7
8
Copyright 2008 by CEBT Preliminaries – (3) Query plan: indicates the invocations of services and their conjunctive composition through joins Represented as directed acyclic graphs (DAGs) Nodes = atoms in the conjuncitve query (service, join) Arcs = precedence constaints + data flows Joins: join strategy + number of fetches per service 8 Directed Acyclic Graph
9
Copyright 2008 by CEBT Preliminaries – (4) Cost metrics: associate a cost to a plan Sum cost metric = sum of the costs of each operator Execution time metric = expected time from query input to result output Request-response cost metric = special case of sum cost metric where each invocation has a costs of 1 9
10
Copyright 2008 by CEBT Optimization Approach Exploring a highly combinatorial solution space 1 st Phase: selection of a given query rewriting such that every service is called with one of available access patterns 2 nd Phase: selection of query plan 3 rd Phase: assignment of the exact number of fetches to be performed over chunked services 10
11
Copyright 2008 by CEBT Services, access patterns, queries Web services and access patterns: The example query (in Datalog-like syntax): Services with alternative access patterns 11
12
Copyright 2008 by CEBT Query plans Representation as DAGs Placing a node = invoking the respective service/join Two nodes connected by an arc = sequential execution Two nodes without connection = parallel execution Graphical notation (note the parallel vs. pipe join): 12
13
Copyright 2008 by CEBT Joing strategies for parallel joins Nested loop: one service “dominates” the other Merge-scan: no a-priori distinction of services 13
14
Copyright 2008 by CEBT Annotated query plans In order to estimate the number of tuples in output, we further need to know: The number of tuples in output of each service The number of fetches for each chunked service The join strategy for each parallel join The final annotation is the output of the optimization 14
15
Copyright 2008 by CEBT Instrumented branch and bound Possible service combinations: Not feasible: City would need to be an input parameter to the query! α 1 has more input fields than α 2 Access pattern selection Heuristic: “Bound is better” = the more input fields in the access pattern, the better Query plan selection Heuristic: “Selective and parallel are better” = selective services in series (with increasing ERSPI) and proliferative services in parallel Chunked service selection Heuristic: “Greedy and square are better” = either we increment the number of fetches to chunked services individually (greedy) or together (square) 15
16
Copyright 2008 by CEBT Final annotation of query plan Execution time cost metric: Service characterization: Fetching factors: Annotated query plan 16
17
Copyright 2008 by CEBT Query execution Execution environment Service registration: signature, patterns, ERSPI, repsonse times, chunk sizes, indication of join strategy,... Service orchestration: query execution Multi-threading: to leverage parallelisms Logical caching (speed + elimination of duplicates) No cache = each call individually repeated One-call cache = caching of the last call to each service Optimal cache = all calls to all services are cached 17
18
Copyright 2008 by CEBT # of calls under varying chache settings 18
19
Copyright 2008 by CEBT Results of the optimal plan Screenshot of the prototype query engine 19
20
Copyright 2008 by CEBT Conclusion In this work, we have defined an formal model for the optimization of multi-domain queries over web services (conjunctive queries) defined query plans similar to relational physical access plans derived an optimization technique based on a classical branch and bound technique given experimental evidence that the proposed model fits real world settings (existing web service and wrapped ones) Next Generic query engine + declarative rep. of query plans User interface for the mashup of sevices/queries 20
21
Copyright 2008 by CEBT Discussion Very Simple Experimental Setup No details about Semi-automatically generated Wrappers How to decide which service to select for a specific domain? How to map Input Output parameters between different services? If we have to pre-program the system for new domains, it is like developing a special purpose application How effective is the system for answering Multi-Domain Queries? 21
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.