Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian.

Similar presentations


Presentation on theme: "Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian."— Presentation transcript:

1 Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian Daniel, Davide Martinenghi Dipartimento di Elettronica e Informazione – Politecnico di Milano VLDB 2008 2009. 02. 19. Presented by Babar Tareen, IDS Lab., Seoul National University Based on Conference Presentation

2 Copyright  2008 by CEBT Mutli-Domain Queries  Queries that can be answered by combining knowledge from two or more domains  Example Where can I attend an interesting database workshop close to a sunny beach? Who are the strongest experts on service computing based upon their recent publication record and accepted European projects ? Can I spend an April week-end in a city served by a low-cost direct flight from Milano offering a Mahler's symphony? 2

3 Copyright  2008 by CEBT Intro  General-purpose search engines (e.g. Yahoo, Google) Very large search space, yet Not able to index deep Web data  Domain-specific search engines (e.g. an airline’s flight search form, Amazon’s book search facility) Typically of high quality, but Limited to restricted domains  We lack the ability to answer multi-domain queries 3

4 Copyright  2008 by CEBT In general: “Given a query over a set of services, find the query plan that minimizes the expecte d execution cost according to a given met ric in order to obtain the best k answers.” Scenario: a multi-domain query Reference query: –“Find all database conferences in the next six months in locations where the average temperature is at least 28°C degrees and for which a cheap travel solution including a luxury accommodation exists.” Answering this query requires: –Finding interesting conferences in the desired timeframe via online services by the scientific community; –Understanding whether the conference location is served by low-cost flights; –Finding luxury hotels close to the conference location with available rooms; and –Checking the expected average temperature of the location 4

5 Copyright  2008 by CEBT Overall Picture 5

6 Copyright  2008 by CEBT Preliminaries – (1)  Characteristics of information sources (services) Search services: return answers in ranking order Exact services: indistinguishible tuples (no ranking) Services have access patterns – Combination of Input and Output parameters corresponding to different ways of invocation 6

7 Copyright  2008 by CEBT Preliminaries – (2)  Characteristics of information sources (services) Expected result size per invocation (ERSPI): – proliferative (ERSPI>1) – selective (0≤ERSPI≤ 1) services Chunking/paging of result sets: bulk vs. chunked services  Joins Can be considered system services ERSPI: selectivity of the join condition, ERSPIs of services – Product of the ERSPI values of the services multiplied by the selectivity of the join condition 7

8 Copyright  2008 by CEBT Preliminaries – (3)  Query plan: indicates the invocations of services and their conjunctive composition through joins Represented as directed acyclic graphs (DAGs) Nodes = atoms in the conjuncitve query (service, join) Arcs = precedence constaints + data flows Joins: join strategy + number of fetches per service 8 Directed Acyclic Graph

9 Copyright  2008 by CEBT Preliminaries – (4)  Cost metrics: associate a cost to a plan Sum cost metric = sum of the costs of each operator Execution time metric = expected time from query input to result output Request-response cost metric = special case of sum cost metric where each invocation has a costs of 1 9

10 Copyright  2008 by CEBT Optimization Approach  Exploring a highly combinatorial solution space 1 st Phase: selection of a given query rewriting such that every service is called with one of available access patterns 2 nd Phase: selection of query plan 3 rd Phase: assignment of the exact number of fetches to be performed over chunked services 10

11 Copyright  2008 by CEBT Services, access patterns, queries  Web services and access patterns: The example query (in Datalog-like syntax): Services with alternative access patterns 11

12 Copyright  2008 by CEBT Query plans  Representation as DAGs Placing a node = invoking the respective service/join Two nodes connected by an arc = sequential execution Two nodes without connection = parallel execution  Graphical notation (note the parallel vs. pipe join): 12

13 Copyright  2008 by CEBT Joing strategies for parallel joins  Nested loop: one service “dominates” the other  Merge-scan: no a-priori distinction of services 13

14 Copyright  2008 by CEBT Annotated query plans  In order to estimate the number of tuples in output, we further need to know: The number of tuples in output of each service The number of fetches for each chunked service The join strategy for each parallel join  The final annotation is the output of the optimization 14

15 Copyright  2008 by CEBT Instrumented branch and bound Possible service combinations: Not feasible: City would need to be an input parameter to the query! α 1 has more input fields than α 2  Access pattern selection Heuristic: “Bound is better” = the more input fields in the access pattern, the better  Query plan selection Heuristic: “Selective and parallel are better” = selective services in series (with increasing ERSPI) and proliferative services in parallel  Chunked service selection Heuristic: “Greedy and square are better” = either we increment the number of fetches to chunked services individually (greedy) or together (square) 15

16 Copyright  2008 by CEBT Final annotation of query plan Execution time cost metric: Service characterization: Fetching factors: Annotated query plan 16

17 Copyright  2008 by CEBT Query execution  Execution environment Service registration: signature, patterns, ERSPI, repsonse times, chunk sizes, indication of join strategy,... Service orchestration: query execution Multi-threading: to leverage parallelisms  Logical caching (speed + elimination of duplicates) No cache = each call individually repeated One-call cache = caching of the last call to each service Optimal cache = all calls to all services are cached 17

18 Copyright  2008 by CEBT # of calls under varying chache settings 18

19 Copyright  2008 by CEBT Results of the optimal plan  Screenshot of the prototype query engine 19

20 Copyright  2008 by CEBT Conclusion  In this work, we have defined an formal model for the optimization of multi-domain queries over web services (conjunctive queries) defined query plans similar to relational physical access plans derived an optimization technique based on a classical branch and bound technique given experimental evidence that the proposed model fits real world settings (existing web service and wrapped ones)  Next Generic query engine + declarative rep. of query plans User interface for the mashup of sevices/queries 20

21 Copyright  2008 by CEBT Discussion  Very Simple Experimental Setup  No details about Semi-automatically generated Wrappers  How to decide which service to select for a specific domain?  How to map Input Output parameters between different services?  If we have to pre-program the system for new domains, it is like developing a special purpose application  How effective is the system for answering Multi-Domain Queries? 21


Download ppt "Center for E-Business Technology Seoul National University Seoul, Korea Optimization of Multi-Domain Queries on the Web Daniele Braga, Stefano Ceri, Florian."

Similar presentations


Ads by Google