Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

2 Overview Autonomic Computing is a technique for making computer systems self-managing. But, real issue is supporting autonomic organizations. They need information in a timely and tailored manner. workflows, business flows, news, business intelligence, sensor input, … The information infrastructure is the “nervous system” of the organization. Adaptive, responsive data management systems are essential.

3 Opportunities for Autonomic Data Mgmt. The functions currently expected of the DBA must be automated. (see next talk) – Data placement, index selection, caching, pre-fetching, “freshening”, … – We are pursuing an approach based on User Profiles. The query processing engine itself must be made “autonomic”. – This is the focus of today’s talk. – But, issues are related: profiles are a type of standing query.

4 Example 1: “Real-Time Business” Event-driven processing B2B and Enterprise apps – Supply-Chain, CRM – Trade Reconciliation, Order Processing etc. (Quasi) real-time flow of events and data Must manage these flows to drive business processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.

5 Example 2: Information Dissemination User Profiles Users Filtered Data Data Sources Doc creation or crawler initiates flow of data towards users. Users + system initiate flow of profiles back towards data.

6 Example 3: Sensor Nets Tiny devices measure the environment. – Berkeley “motes”, Smart Dust, Smart Tags, … Major research thrust at Berkeley. – Apps: Transportation, Seismic, Energy,… Form dynamic ad hoc networks, aggregate and communicate streams of values. Sensors are the stimulus receptors of the autonomic organization.

7 Common Features Centrality of dataflow – Architecture is focused on data movement – Moving streams of data through code in a network Requires intelligent, low-overhead routing Volatility of the environment – Dynamic resources & topology, partial failures – Long-running (never-ending?) tasks – Potential for user interaction during the flow – Large Scale: users, data, resources, … Requires adaptivity

8 Review: Distributed Query Processing SELECT eid,ename,title FROM Emp, Proj, Assign WHERE Emp.eid = Assign.eid AND Proj.pid = Assign.pid AND Emp.loc <> Proj.loc System handles query plan generation & optimization; ensures correct execution. Originally conceived for corporate networks. ©1998 Ozsu and Valduriez

9 – Sources may be unreachable or slow to respond. – Data statistics/cost estimates may be unavailable or unreliable. – No view into internal processing and structure. Traditional, static query processing approaches cannot cope with such problems at run-time. Wide-area + Wrapped sources  Unpredictability

10 Batch processing is inappropriate for many apps. – Must provide feedback to the user as quickly as possible. Data access becomes a cooperative, iterative process: – User may correct, refine, redirect, or change the query. User-in-the-loop  Unpredictability

11 – Mobility Location-centric queries Moving endpoints change data staging needs – Data Streams/Sensors Varying data arrival rates Adapting resolutions Push vs. Pull Queries can be long lived and streams may be infinite so a static plan is doomed to failure. Mobility & Data Streams  Unpredictability

12 Adaptive Query Processing Existing systems are adaptive but at a slow rate. – Collect Stats – Compile and Optimize Query – Eventually collect stats again or change schema – Re-compile and optimize if necessary. [Survey: Hellerstein, Franklin, et al., DE Bulletin 2000] static plans current DBMS

13 Adaptive Query Processing Goal: allow adjustments for runtime conditions. Basic idea: leave “options” in compiled plan. At runtime, bind options based on observed state: – available memory, load, cache contents, etc. Once bound, plan is followed for entire query. [HP88,GW89,IN+92,GC94,AC+96, AZ96,LP97] Dynamic, Parametric, Competitive, … static plans late binding current DBMS

14 Adaptive Query Processing Start with a compiled plan. Observe performance between blocking (or blocked) operators. Re-optimize remainder of plan if divergence from expected data delivery rate [AF+96,UFA98] or data statistics [KD98]. Dynamic, Parametric, Competitive, … static plans late binding inter- operator current DBMS Query Scrambling, MidQuery Re-opt

15 Adaptive Query Processing Join operators themselves can be made adaptive: – to user needs (Ripple Joins [HH99]) – to memory limitations (DPHJ [IF+99]) – to memory limiations and delays (XJoin [UF00]) Plan Re-optimization can also be done in mid- operation (Convergent QP [IHW02]) Dynamic, Parametric, Competitive, … static plans late binding inter- operator current DBMS Query Scrambling, MidQuery Re-opt Ripple Join, XJoin, DPHJ, Convergent QP intra- operator

16 Adaptive Query Processing This is the region that we are exploring in the Telegraph project at Berkeley. Dynamic, Parametric, Competitive, … static plans ??? late binding inter- operator per tuple current DBMS Query Scrambling, MidQuery Re-opt Eddies, CACQ XJoin, DPHJ Convergent QP ??? PSoup intra- operator

17 Telegraph Overview An adaptive system for large-scale dataflow processing.  A convergence of Data Management and Networking approaches. Based on an extensible set of operators: 1) Ingress (data access) operators  Screen Scraper, Napster/Gnutella readers,  File readers, Sensor Proxies 2) Non-Blocking Data processing operators  Selections (filters), XJoins, … 3) Adaptive Routing Operators  Eddies, STeMs, FLuX, etc.

18 Routing Operators: Eddies How to order and reorder operators over time? –Traditionally, use performance, economic/admin feedback –won’t work for never-ending queries over volatile streams Instead, use adaptive record routing. Reoptimization = change in routing policy [Avnur & Hellerstein, SIGMOD 00] static dataflow AB C D eddy A B C D

19 Eddy – Per Tuple Adaptivity Adjusts flow adaptively – Tuples routed through ops in diff. orders – Must visit each operator once before output State is maintained on a per-tuple basis Two complementary routing mechanisms – Back pressure: each operator has a queue, don’t route to ops with full queue – avoids expensive operators. – Lottery Scheduling: Give preference to operators that are more selective. Initial Results showed eddy could “learn” good plans and adjust to changes in stream contents over time. Currently in the process of exploring the inherent tradeoffs of such fine-grained adaptivity.

20 Traditional Hash Joins block when one input stalls. Non-Blocking Operators – Join Build Probe Source A Source B Hash Table A

21 Non-Blocking Operators – Join Hash Table A Hash Table B Symm Hash Join [WA91] blocks only if both stall. Processes tuples as they arrive from sources. Produces all answer tuples and no spurious duplicates. Build Probe Source A Source B

22 SteMs:“State Modules” [Raman 01] A generalization of the Symmetric Join Split join into modules that contain current state of the operators. – Basically a store with one or more indexes. – Allows sharing of state across multiple queries. Supports: XJoin, Continuous Queries, Competitive processing, …

23 Using SteMs for Joins Hash A Hash B Hash C Hash D A B C D SteMs maintain intermediate state for multiple joins. Use Eddy to route tuples through the necessary modules. Be sure to enforce “build then probe” processing. Note, can also maintain results of individual joins in SteMs.

24 CACQ [Madden et al, SIGMOD 02]   Stocks. symbol = ‘MSFT’ Stocks. symbol = ‘APPL’ Query 2Query 1 Stock Quotes ‘MSFT’ ‘APPL’ Stock Quotes   Expect hundreds to thousands of queries over same sources Continuously Adaptive Continuous Queries – combine operators from many queries to improve efficiency (i.e. share work).

25 Combining Queries in CACQ Q1 = Select * From S where S.a = s1 and S.b = s4; Q2 = Select * From S where S.a = s2 and S.b = s5; Q3 = Select * From S where S.a = s3 and S.b = s6; Can also use SteMs to store query specifications. Need a good predicate index if many queries. Eddy now is routing tuples for multiple queries simultaneously CACQ leverages Eddies for adaptive CQ processing Results show advantages over static approaches.

26 In The Beginning Data Query Index Result

27 Pub Sub/CQ/Filtering Queries Data Index Result These systems can be looked at as database systems turned upsidedown.

28 PSoup [Chandrasekaran 02] Logical (?) conclusion: – Treat queries and data as duals. CACQ etc are “filters”: – No historical data. – Answers continuously returned as new data streams into the system. Such continuous answers are often: – Infeasible – intermittent connectivity and “Data Recharging” profiles. – Inefficient – often user does not want to be continuously interrupted.

29 PSoup (cont.) Query processing is treated as an n-way symmetric join between data tuples and query specifications. PSoup model: repeated invocation of standing queries over windows of streaming data.

30 PSoup: Query & Data Duality Queries Index Result Data Index

31 PSoup: Query & Data Duality Queries Index Result Data Index Query

32 PSoup (continued) Queries specify a “window” (e.g. last 10 min) Queries are registered with the system – These are entered into a “Query SteM” Data tuples stream into the system – These are entered into a “Data SteM” Matching is done on the fly (in background). – Effectively keeping shared materialized views. When a query is invoked: – The query window is applied to the materialized answer and results are returned to the user.

33 Data Store IDR.bR.a 34 00 83 37 48 51 50 48 52 49 PSoup Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b=3 23 22 20 21 PSoup: New Select Query Arrival SELECT * FROM R WHERE R.a <= 4 AND R.b >=3 BEGIN (NOW – 600) END (NOW) NEW QUERY B U I L D R.a<=4 AND R.b>=324

34 PROBE Data Store IDR.bR.a 34 00 83 37 48 51 50 48 52 49 R.a<=4 AND R.b>=324 New Selection Spec (continued) Queries 2120 51 50 48 52 49 2322 T F 24 F T F D a t a RESULTS

35 Data Store IDR.bR.a 34 00 83 37 48 51 50 48 52 49 Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b=3 23 22 20 21 24R.a<=4 AND R.b>=3 Selection – new data 6353 N EW D ATA B U I L D 63 PSoup

36 PROBE 6353 Query Store IDPredicate 0<R.a<=5 R.a=4 AND R.b=3 0>R.b>4 R.a>4 AND R.b=3 23 22 20 21 24R.a<=4 AND R.b>=3 Selection – new data (cont.) Queries 2120 51 50 48 52 49 2322 24 D a t a RESULTS 53 FTFF T

37 PSoup – Query Invocation PSoup continuously maintains materialized views over streaming data and queries. Data is not returned to user until the query is invoked. – Invocation requires applying “windows” to precomputed results. Adaptive approach allows system to continuously absorb new data and new queries without recompilation. Lots of issues to study: – Query indexing, Spilling to disk, bulk processing

38 Conclusions The goal is to support autonomic organizations. – The information infrastructure is the “nervous system” of such organizations. Existing QP technology is simply too brittle. The Telegraph project at Berkeley is “pushing the envelope” on adaptive query processing. static plans ??? late binding inter- operator per tuple Eddies, CACQ ??? PSoup intra- operator

Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

Similar presentations

Presentation on theme: "Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.

Similar presentations

Presentation on theme: "Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group."— Presentation transcript:

Similar presentations

About project

Feedback