Telegraph Continuously Adaptive Dataflow Joe Hellerstein.

Slides:



Advertisements
Similar presentations
anywhere and everywhere. omnipresent A sensor network is an infrastructure comprised of sensing (measuring), computing, and communication elements.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Telegraph Endeavour Retreat 2000 Joe Hellerstein.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
CS538: Advanced Topics in Information Systems. 2 Secure Location transparency Consistent Real-Time Available Black Box: Distributed Storage [GMM] ? Data.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.
Data-Intensive Systems Michael Franklin UC Berkeley
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Knowledge Portals and Knowledge Management Tools
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Streams – DataStage Integration InfoSphere Streams Version 3.0
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Hosted on the Powerful Microsoft Azure Platform, Advent Countdown Lets Companies Run Reliable and Scalable Holiday Marketing Campaigns MICROSOFT AZURE.
Configuration Management and Server Administration Mohan Bang Endeca Server.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
Cluster Reliability Project ISIS Vanderbilt University.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Maximize Return on Engagement via Scalable Omni-Channel Online Services in the Cloud COMPANY PROFILE: XOMNI, INC. Founded in 2011 and headquartered in.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
ProActive Infrastructure Eric Brewer, David Culler, Anthony Joseph, Randy Katz Computer Science Division U.C. Berkeley ninja.cs.berkeley.edu Active Networks.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Societal-Scale Computing: The eXtremes Scalable, Available Internet Services Information Appliances Client Server Clusters Massive Cluster Gigabit Ethernet.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
COMP1321 Digital Infrastructure Richard Henson March 2016.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Connected Infrastructure
Open Source distributed document DB for an enterprise
Applying Control Theory to Stream Processing Systems
Modern Data Management
Connected Infrastructure
Using Microsoft Azure, Crowdnetic Launches Innovative Lending Gateway Platform That Connects Borrowers to Alternative Lenders MICROSOFT AZURE SOLUTION.
The Top 10 Reasons Why Federated Can’t Succeed
Telegraph: An Adaptive Global-Scale Query Engine
Mapping the Data Warehouse to a Multiprocessor Architecture
Software Defined Networking (SDN)
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Ch 4. The Evolution of Analytic Scalability
Media365 Portal by Ctrl365 is Powered by Azure and Enables Easy and Seamless Dissemination of Video for Enhanced B2C and B2B Communication MICROSOFT AZURE.
Information Capture and Re-Use
GGF10 Workflow Workshop Summary
Presentation transcript:

Telegraph Continuously Adaptive Dataflow Joe Hellerstein

Scenarios Ubiquitous computing: more than clients –sensors and their data feeds are key smart dust, biomedical (MEMS sensors) each consumer good records (mis)use –disposable computing video from surveillance cameras, broadcasts, etc. Global Data Federation –all the data is online – what are we waiting for? –The plumbing is coming XML/HTTP, etc. give LCD communication but how do you flow, summarize, query and analyze data robustly over many sources in the wide area?

Dataflow in Volatile Environments Federated query processors a reality –Cohera, IBM DataJoiner –No control over stats, performance, administration Large Cluster Systems “Scaling Out” –No control over “system balance” User “CONTROL” of running dataflows –Long-running dataflow apps are interactive –No control over user interaction Sensor Nets: the next killer app –E.g. “Smart Dust” –No control over anything! Telegraph –Dataflow Engine for these environments

Data Flood: Main Features What does it look like? –Never ends: interactivity required Online, controllable algorithms for all tasks! –Big: data reduction/aggregation is key –Volatile: this scale of devices and nets will not behave nicely

The Telegraph Dataflow Engine Key technologies –Interactive Control interactivity with early answers and examples online aggregation for data reduction –Dataflow programming via paths/iterators Elevate query processing frameworks out of DBMSs Long tradition of static optimization here –Suggestive, but not sufficient for volatile environments –Continuously adaptive flow optimization massively parallel, adaptive dataflow via Rivers and Eddies

CONTROL Continuous Output and Navigation Technology with Refinement On Line Data-intensive jobs are long-running. How to give early answers and interactivity? –online interactivity over feeds pipelining “online” operators, data “juggle” –online data correlation algs: ripple joins, online mining and aggregation –statistical estimators, and their performance implications Deliver data to satisfy statistical goals Appreciate interplay of massive data processing, stats, and HCI “ Of all men's miseries, the bitterest is this: to know so much and have control over nothing” –Herodotus

Performance Regime for CONTROL New “Greedy” Performance Regime –Maximize 1 st derivative of the user-happiness function Time 100% CONTROL Traditional

CONTROL Continuous Output and Navigation Technology with Refinement On Line

Potter’s Wheel Anomaly Detection

River We built the world’s fastest sorting machine –On the “NOW”: 100 Sun workstations + SAN –But it only beat the record under ideal conditions! River: performance adaptivity for data flows on clusters –simplifies management and programming –perfect for sensor-based streams

Declarative Dataflow: NOT new Database Systems have been doing this for years –Xlate declarative queries into an efficient dataflow plan –“query optimization” considers: Alternate data sources (“access methods”) Alternate implementations of operators Multiple orders of operators A space of alternatives defined by transformation rules Estimate costs and “data rates”, then search space But in a very static way! –Gather statistics once a week –Optimize query at submission time –Run a fixed plan for the life of the query And these ideas are ripe to elevate out of DBMSs –And outside of DBMSs, the world is very volatile –There are surely going to be lessons “outside the box”

Static Query Plans Volatile environments like sensors need to adapt at a much finer grain

Continuous Adaptivity: Eddies How to order and reorder operators over time – based on performance, economic/admin feedback Vs.River: –River optimizes each operator “horizontally” –Eddies optimize a pipeline “vertically” Eddy

Competitive Eddies Eddy R2R1 R3 S1S2 S3  hash blockindex1  index2

Telegraph: Putting it Together Scalable, adaptive dataflow infrastructure. Apps include… –sensor nets –massively parallel and wide-area query engines –net appliances: chaining xform8n/aggreg8n/compression/ etc. in proxies –any volatile dataflow scenario Technology: a marriage of… –CONTROL, Rivers & Eddies Many research questions here E.g. how to combine River and Eddy adaptivity E.g. how to tune Eddies for statistical performance goals –Combinations of browse/query/mine at UI –Storage management to handle new hardware realities Look for a live service this summer!

Integration with Endeavour Give –Be data-intensive backbone to diverse clients –Be replication/delivery dataflow engine for OceanStore –Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore –Provide platform for data-intensive “tacit info mining” Take –Leverage OceanStore to manage distributed metadata, security –Leverage protocols out of TinyOS for sensors

Connectivity & Heterogeneity Lots of folks working on data format translation, parsing –we will borrow, not build –currently using JDBC & Cohera Net Query commercial tool, donated by Cohera Corp. gateways XML/HTML (via http) to ODBC/JDBC –we may write “Teletalk” gateways from sensors Heterogeneity –never a simple problem –Control project developed interactive, online data transformation tool: ABC

More Info Collaborators: –Mike Franklin, Eric Brewer, Christos Papadimitriou –Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah Me: Web: – –

Extra slides for backup