Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Query Processing: Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)

Similar presentations


Presentation on theme: "Adaptive Query Processing: Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)"— Presentation transcript:

1 Adaptive Query Processing: Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)

2 Data Integration Systems Uniform query capability across autonomous, heterogeneous data sources on LAN, WAN, or Internet: in enterprises, WWW, big science.

3 Recent Trends in Data Integration Research Issues such as: architectures, query reformulation, wrapper construction are reasonably well understood (but still good work going on). Query execution and optimization raise significant challenges. Problems for traditional query processing model: –Few statistics (autonomous sources) –Unanticipated delays and failures (network-bound sources). Conclusion (ours): cannot afford to separate optimization from execution. Need to be adaptive. See IEEE Data Engineering Bulletin, June, 2000.

4 Outline Tukwila (version 1): –Interleaving optimization and execution at the core. The unsolved problem: when to switch? The complicating new challenges: –XML, want first tuples fast. Tukwila (version 2): – completely pipelined XML query processing. Some experiences from Nimble

5 Tukwila: Version 1 Key idea: build adaptive features into the core of the system. Interleave planning an execution (replan when you know more about your data) –Rule-based mechanism for changing behavior. Adaptive query operators: –Revival of the double-pipelined join. –Collectors (a.k.a. “smart union”). See details in SIGMOD-99.

6 Tukwila Data Integration System Novel components: –Event handler –Optimization-execution loop

7 Handling Execution Events Adaptive execution via event-condition-action rules During execution, events generated Timeout, n tuples read, operator opens/closes, memory overflows, execution fragment completes, … Events trigger rules: –Test conditions Memory free, tuples read, operator state, operator active, … –Execution actions Re-optimize, reduce memory, activate/deactivate operator, …

8 Interleaving Planning and Execution Re-optimize if at unexpected state: –Evaluate at key points, re-optimize un- executed portion of plan [Kabra/DeWitt SIGMOD98] –Plan has pipelined units, fragments –Send back statistics to optimizer. –Maintain optimizer state for later reuse. WHEN end_of_fragment(0) IF card(result) > 100,000 THEN re-optimize

9 Adaptive Operators: Double Pipelined Join Hybrid Hash Join  Partially pipelined: no output until inner read  Asymmetric (inner vs. outer) — optimization requires source behavior knowledge Double Pipelined Hash Join Enhancement to [Wilschut PDIS91]:uses multithreading, handles overflow Outputs data immediately Symmetric — requires less source knowledge to optimize

10 Adaptive Operators: Collector Utilize mirrors and overlapping sources to produce results quickly –Dynamically adjust to source speed & availability –Scale to many sources without exceeding net bandwidth –Based on policy expressed via rules WHEN timeout(CustReviews) DO activate(NYTimes), activate(alt.books) WHEN timeout(NYTimes) DO activate(alt.books)

11 Highlights from Version 1 It worked well (graphs to prove it)! Unified architecture that encompassed previous techniques: –Choose nodes (Cole & Graefe) –Mid-stream re-optimization (Kabra & DeWitt) –Query scrambling (Urhan, Franklin, Amsaleg) Optimizer can have global view of different factors affecting adaptive behavior.

12 The Unsolved Problem Find interleaving points? When to switch from optimization to execution? Some straightforward solutions worked reasonably, but student who was supposed to solve the problem graduated prematurely. Some work on this problem: –Rick Cole (Informix) –Benninghoff & Maier (OGI). One solution being explored: execute first and break pipeline later as needed. Another solution: change operator ordering in mid-flight (Eddies, Avnur & Hellerstein).

13 More Urgent Problems Users want answers immediately: –Optimize time to first tuple –Give approximate results earlier. XML emerges as a preferred platform for data integration: –But all existing XML query processors are based on first loading XML into a repository.

14 Tukwila Version 2 Able to transform, integrate and query arbitrary XML documents. Support for output of query results as early as possible: –Streaming model of XML query execution. Efficient execution over remote sources that are subject to frequent updates. Philosophy: how can we adapt relational and object-relational execution engines to work with XML?

15 Tukwila V2 Highlights The X-scan operator that maps XML data into tuples of subtrees. Support for efficient memory representation of subtrees (use references to minimize replication). Special operators for combining and structuring bindings as XML output.

16 Tukwila V2 Architecture

17 Example XML File Readings in Database Systems Stonebraker Hellerstein 123-456-X Morgan Kaufmann San Mateo CA

18 XML Data Graph

19 Example Query WHERE $t ELEMENT_AS $b IN "books.xml", $p $pr IN "amazon.xml", $pr < 49.95 CONSTRUCT $t $p

20 Query Execution Plan

21 X-Scan The operator at the leaves of the plan. Given an XML stream and a set of regular expressions – produces a set of bindings. Supports both trees and graph data. Uses a set of state machines to traverse match the patterns. Maintains a list to unseen element Ids, and resolves them upon arrival.

22 X-scan Data Structures

23 State Machines for X-scan

24 Other Features of Tukwila V.2 X-scan: –Can also be made to preserve XML order. –Careful handling of cycles in the XML graph. –Can apply certain selections to the bindings. Uses much of the code of Tukwila I. No modifications to traditional operators. XML output producing operators. Nest operator.

25 In the “Pipeline” Partial answers: no blocking. Produce approximate answers as data is streaming. Policies for recovering from memory overflow [More Zack]. Efficient updating of XML documents (and an XML update language) [w/Tatarinov] Dan Suciu: a modular/composable toolset for manipulating XML. Automatic generation of data source descriptions (Doan & Domingos)

26 First 5 Results

27 Completion Time

28 Intermediate Conclusions First scalable XML query processor for networked data. Work done in relational query processing is very relevant to XML query processing. We want to avoid decomposing XML data into relational structures.

29 Some Observations from Nimble What is Nimble? –Founded in June, 1999 with Dan Weld. –Data integration engine built on an XML platform. –Query language is XML-QL. –Mostly geared to enterprise integration, some advanced web applications. –70+ person company (and hiring!) –Ships in trucks (first customer is Paccar).

30 XML Query User Applications Lens™ FileInfoBrowser™ Software Developers Kit NIMBLE™ APIs Front-End XML Lens Builder™ Management Tools Management Tools Integration Builder Integration Builder Security Tools Data Administrator Data Administrator System Architecture Concordance Developer Integration Layer Nimble Integration Engine ™ CompilerExecutor Metadata Server Cache Relational Data Warehouse/ Mart Legacy Flat FileWeb Pages Common XML View

31 The Current State of Enterprise Information Explosion of intranet and extranet information 80% of corporate information is unmanaged By 2004 30X more enterprise data than 1999 The average company: –maintains 49 distinct enterprise applications –spends 35% of total IT budget on integration- related efforts Source: Gartner, 1999

32 Design Issues Query language for XML: tracking the W3C committee. The algebra: –Needs to handle XML, relational, hierarchical and support it all efficiently! –Need to distinguish physical from logical algebra. Concordance tables need to be an integral part of the system. Need to think of data cleaning. Need to deal with down times of data sources (or refusal times). Need to provide range of options between on- demand querying and pre-materialization.

33 Non-Technical Issues SQL not really a standard. Legacy systems are not necessarily old. IT managers skeptical of truths. People are very confused out there. Need a huge organization to support the effort.


Download ppt "Adaptive Query Processing: Progress and Challenges Alon Halevy University of Washington [Gore] Joint work with Zack Ives, Dan Weld (later: Nimble Technology)"

Similar presentations


Ads by Google