Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley.

Slides:



Advertisements
Similar presentations
Declarative Networking: Language, Execution and Optimization Boon Thau Loo 1, Tyson Condie 1, Minos Garofalakis 2, David E. Gay 2, Joseph M. Hellerstein.
Advertisements

Declarative Networking Mothy Joint work with Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, Petros Maniatis, Ion Stoica Intel Research and U.C. Berkeley.
Distributed Hash Tables
Implementing Declarative Overlays From two talks by: Boon Thau Loo 1 Tyson Condie 1, Joseph M. Hellerstein 1,2, Petros Maniatis 2, Timothy Roscoe 2, Ion.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Coflow A Networking Abstraction For Cluster Applications UC Berkeley Mosharaf Chowdhury Ion Stoica.
Scalable Content-Addressable Network Lintao Liu
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Declarative sensor networks David Chu Computer Science Division EECS Department UC Berkeley DBLunch UC Berkeley 2 March 2007.
CAST i CAST iCAST / TRUST Collaboration Presenter : David Chu 2007 June 5 A Declarative Sensor Network Architecture.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
We Lose Joe Hellerstein UC Berkeley HPTS History Generic.com, HPTS 1999 Everyone, et al., HPTS 2001.
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
15-441: Computer Networking Lecture 26: Networking Future.
High Performance All-Optical Networks with Small Buffers Yashar Ganjali High Performance Networking Group Stanford University
Traffic Engineering With Traditional IP Routing Protocols
CSE 561 – Multicast Applications David Wetherall Spring 2000.
P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
The Cougar Approach to In-Network Query Processing in Sensor Networks By Yong Yao and Johannes Gehrke Cornell University Presented by Penelope Brooks.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.
Data-Intensive Systems Michael Franklin UC Berkeley
1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.
CS211/Fall /06 Outline for This Lecture Application of e2e over wireless Application Level Framing Integrated Layer Processing Course Project Introduction.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
Telegraph Continuously Adaptive Dataflow Joe Hellerstein.
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Copyright © 2002 Intel Corporation. Intel Labs Towards Balanced Computing Weaving Peer-to-Peer Technologies into the Fabric of Computing over the Net Presented.
October 7, 1999Reactive Sensor Network1 Workshop - RSN Update Richard R. Brooks Head Distributed Intelligent Systems Dept. Applied Research Laboratory.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
What’s Ahead for Embedded Software? (Wed) Gilsoo Kim
Evacuating the Comfort Zone: (Via Curriculum Reform…)
Societal-Scale Computing: The eXtremes Scalable, Available Internet Services Information Appliances Client Server Clusters Massive Cluster Gigabit Ethernet.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Query Optimization for Stream Databases Presented by: Guillermo Cabrera Fall 2008.
Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:
Yiting Xia, T. S. Eugene Ng Rice University
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Introduction to Wireless Sensor Networks
Applying Control Theory to Stream Processing Systems
Telegraph: An Adaptive Global-Scale Query Engine
Distributing Queries Over Low Power Sensor Networks
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Distributed Hash Tables
TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
Adaptive Query Processing (Background)
Information Capture and Re-Use
Presentation transcript:

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley

Road Map How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks

Background: CONTROL project Online/Interactive query processing Online aggregation Scalable spreadsheets & refining visualizations Online data cleaning (Potter’s Wheel) Pipelining operators (ripple joins, online reordering) over streaming samples

Example: Online Aggregation

Online Data Visualization CLOUDS

Potter’s Wheel

Goals for Online Processing Performance metric: Statistical (e.g. conf. intervals) User-driven (e.g. weighted by widgets) New “greedy” performance regime Maximize 1 st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL Time 100% Online Traditional

CONTROL  Volatility Goals and data may change over time User feedback, sample variance Goals and data may be different in different “regions” Group-by, scrollbar position [An aside: dependencies in selectivity estimation] Q: Query optimization in this world? Or in any pipelining, volatile environment?? Where else do we see volatility?

Continuous Adaptivity: Eddies A little more state per tuple Ready/done bits (extensible a la Volcano/Starburst) Query processing = dataflow routing!! We'll come back to this! Eddy

Eddies: Two Key Observations Break the set-oriented boundary Usual DB model: algebra expressions: (R S) T Usual DB implementation: pipelining operators! Subexpressions never materialized Typical implementation is more flexible than algebra We can reorder in-flight operators Other gains possible by breaking the set-oriented boundary… Don’t rewrite graph. Impose a router Graph edge = absence of routing constraint Observe operator consumption/production rates Consumption: cost Production: cost*selectivity

Road Map How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks

Coincidence: Eddie Comes to Berkeley CLICK: a NW router is a query plan! “The Click Modular Router”, Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Figure 3:Example Router Graph Also Scout Paths the key to comm-centric OS “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96.

More Interaction: CS262 Experiment w/ Eric Brewer Merge OS & DBMS grad class, over a year Eric/Joe, point/counterpoint Some tie-ins were obvious: memory mgmt, storage, scheduling, concurrency Surprising: QP and networks go well side by side E.g. eddies and TCP Congestion Control Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment Eddies close to the n-armed bandit problem

Networking Overview for DB People Like Me Core function of protocols: data xfer Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation) Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing) -- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 Basic Internet assumption: “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

Exchange! Data Modeling! Query Opt! Thesis: nets are good at xfer control, not so good at data manipulation Some C&T wacky ideas for better data manipulation Xfer semantic units, not packets (ALF) Auto-rewrite layers to flatten them (ILP) Minimize cross-layer ordering constraints Control delivery in parallel via packet content C & T’s Wacky Ideas

Wacky New Ideas in QP What if… We had unbounded data producers and consumers (“streams” … “continuous queries”) We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) We couldn’t predict user behavior? (“control”) We couldn’t predict behavior of components in the dataflow? (“networked services”) We had partial failure as a given? (oops, have we ignored this?) Yes … networking people have been here! Remember Van Jacobson’s quote?

The Cosmic Convergence NETWORKING RESEARCH Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion Adaptivity, Federated Control, GeoScalability DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability

The Cosmic Convergence Adaptivity, Federated Control, GeoScalability NETWORKING RESEARCH Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability Telegraph

Road Map How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks

What’s in the Sweet Spot? Scenarios with: Structured Content Volatility Rich Queries Clearly: Long-running data analysis a la CONTROL Continuous queries Queries over Internet sources and services Two emerging scenarios: Sensor networks P2P query processing

Telegraph: Engineering the Sweet Spot An adaptive dataflow system Dataflow programming model A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter) SQL parser for convenience (looking at XQuery) Adaptivity operators Eddies + Extensible rules for routing constraints, Competition SteMs (state modules) FLuX (Fault-tolerant Load-balancing eXchange) Bounded and continuous: Data sources Queries

State Modules (SteMs) Goal: Further adaptivity through competition Multiple mirrored sources Handle rate changes, failures, parallelism Multiple alternate operators Join = Routing + State SteM operator manages tradeoffs State Module, unifies caches, rendezvous buffers, join state Competitive sources/operators share building/probing SteMs Join algorithm hybridization! Vijayshankar Raman static dataflow eddy + stems

FLuX: Routing Across Cluster Fault Tolerance, Load Balancing Continuous/long-running flows need high availability Big flows need parallelism Adaptive Load-Balancing req’d FLuX operator: Exchange plus… Adaptive flow partitioning (River) Transient state replication & migration RAID for SteMs Needs to be extensible to different ops: Content-sensitivity History-sensitivity Dataflow semantics Optimize based on edge semantics Networking tie-in again: At-least-once delivery? Exactly-once delivery? In/Out of order? Migration policy: the ski rental analogy Mehul Shah

Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all this stuff! Address adaptivity 1st. 4 Ideas in CACQ: Use eddies to allow reordering of ops. But one eddy will serve for all queries Explicit tuple lineage Mark each tuple with per-op ready/done bits Mark each tuple with per-query completed bits Queries are data: join with Grouped Filter Much like XFilter, but for relational queries Joins via SteMs, shared across all queries Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! Delete a tuple from flow only if it matches no query Next: F.T. CACQ via FLuXen Sam Madden, Mehul Shah, Vijayshankar Raman

Road Map How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks

Sensor Nets “Smart Dust” + TinyOS Thousands of “motes” Expensive communication Power constraints Query workload: Aggregation & approximation Queries and Continuous Queries Challenges: Push the processing into the network Deal with volatility & failure CONTROL issues: data variance, user desires Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab) Simple example: Aggregation query

P2P QP Starting point: P2P as grassroots phenomenon Outrageous filesharing volume (1.8Gfiles in October 2001) No business case to date Challenge: scale DDBMS QP ideas to P2P Motivate why Pick the right parts of DBMS research to focus on Storage: no! QP: yes. Make it work: Scalability well beyond our usual target Admin constraints Unknown data distributions, load Heterogeneous comm/processing Partial failure Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

A Grassroots Example: TeleNap

Themes Throughout Adaptivity Requires clever system design The Exchange model: encapsulate in ops? Interesting adaptive policy problems E.g. eddy routing, flux migration Control Theory, Machine Learning Encompasses another CS goal? “No-knobs”, “Autonomic”, etc. New performance regimes Decent performance in the common case Mean/Variance more important than MAX Interactive Metrics Time to completion often unimportant/irrelevant

More Themes Set-valued thinking as albatross? E.g. eddies vs. Kabra/DeWitt or Tukwila E.g. SteMs vs. Materialized Views E.g. CACQ vs. NiagaraCQ Some clean theory here would be nice Current routing correctness proofs are inelegant Extensibility Model/language of choice is not clear SEQ? Relational? XQuery? Extensible operators, edge semantics [A whine about VLDB’s absurd “Specificity Factor”]

Conclusions? Too early for technical conclusions Of this I’m sure: The CS262 experiment is a success Our students are getting a bigger picture than before I’m learning, finding new connections May morph to OS/Nets, Nets/DB Eventually rethink the systems software curriculum at the undergraduate level too Nets folks are coming our way Doing relevant work, eager to collaborate DB community needs to branch out Outbound: Better proselytizing in CS Inbound: Need new ideas

Conclusions, cont. Sabbatical is a good invention Hasn’t even started, I’m already grateful!