Telegraph Endeavour Retreat 2000 Joe Hellerstein.

Slides:

Advertisements

Similar presentations

Unified Communications Bill Palmer ADNET Technologies, Inc.

Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

Chapter 9 Designing Systems for Diverse Environments.

Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.

Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.

Distributed Computations

Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.

Distributed Database Management Systems

CS538: Advanced Topics in Information Systems. 2 Secure Location transparency Consistent Real-Time Available Black Box: Distributed Storage [GMM] ? Data.

Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.

Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.

Building Enterprise Information Portal using Oracle Portal 3

eGovernance Under guidance of Dr. P.V. Kamesam IBM Research Lab New Delhi Ashish Gupta 3 rd Year B.Tech, Computer Science and Engg. IIT Delhi.

Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.

Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.

Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.

Slide 1 Ubiquitous Storage Breakout Group Endeavour mini-retreat January, 2000.

Distributed Computations MapReduce

Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.

Data-Intensive Systems Michael Franklin UC Berkeley

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

ETL By Dr. Gabriel.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Ch 4. The Evolution of Analytic Scalability

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Hosted on the Powerful Microsoft Azure Platform, Advent Countdown Lets Companies Run Reliable and Scalable Holiday Marketing Campaigns MICROSOFT AZURE.

Telegraph Continuously Adaptive Dataflow Joe Hellerstein.

Course Introduction Introduction to Databases Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.

MapReduce M/R slides adapted from those of Jeff Dean’s.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

Afresco Overview Document management and share

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.

Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.

IPS Infrastructure Technological Overview of Work Done.

Societal-Scale Computing: The eXtremes Scalable, Available Internet Services Information Appliances Client Server Clusters Massive Cluster Gigabit Ethernet.

Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Information Retrieval in Practice

Connected Infrastructure

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

Vidcoding Introduces Scalable Video and TV Encoding in the Cloud at an Affordable Price by Utilizing the Processing Power of Azure Batch MICROSOFT AZURE.

Large-scale file systems and Map-Reduce

Open Source distributed document DB for an enterprise

Applying Control Theory to Stream Processing Systems

Software Design and Architecture

Connected Infrastructure

Hosted on Azure, LoginRadius’ Customer Identity

Using Microsoft Azure, Crowdnetic Launches Innovative Lending Gateway Platform That Connects Borrowers to Alternative Lenders MICROSOFT AZURE SOLUTION.

Telegraph: An Adaptive Global-Scale Query Engine

Ch 4. The Evolution of Analytic Scalability

XtremeData on the Microsoft Azure Cloud Platform:

DAT381 Team Development with SQL Server 2005

Information Capture and Re-Use

Presentation transcript:

Telegraph Endeavour Retreat 2000 Joe Hellerstein

Roadmap Motivation & Goals Application Scenarios Quickie core technology overview –Adaptive dataflow –Event-based storage manager –Come hear more about these tonight/tomorrow! Status and Plans –Dataflow infrastructure & apps –Storage manager?

Motivations Global Data Federation –All the data is online – what are we waiting for? –The plumbing is coming XML/HTTP, XML/WAP, etc. give LCD communication but how do you flow, summarize, query and analyze data robustly over many sources in the wide area? Ubiquitous computing: more than clients –sensors and their data feeds are key smart dust, biomedical (MEMS sensors) each consumer good records (mis)use –disposable computing video from surveillance cameras, broadcasts, etc. Huge Data flood a’comin’! –will it capsize the good ship Endeavour?

Initial Telegraph Goals Unify data access & dataflow apps –Commercial wrappers for most infosources –Most info-centric apps can be cast as dataflow –The data flood needs a big dataflow manager! –Goal: a robust, adaptive dataflow engine Unify storage –Currently lots of disparate data stores Databases, Files, servers (and http access on these) –Goal: A single, clean storage manager that can serve: DB records & semantics Files and “semantics” folders, calendars, etc. and semantics

Challenge for Dataflow: Volatility! Federated query processors –A la Cohera, IBM DataJoiner –No control over stats, performance, administration Large Cluster Systems “Scaling Out” –No control over “system balance” User “CONTROL” of running dataflows –Long-running dataflow apps are interactive –No control over user interaction Sensor Nets –No control over anything! Telegraph –Dataflow Engine for these environments

The Data Flood: Main Features What does it look like? –Never ends: interactivity required Online, controllable algorithms for all tasks! –Big: data reduction/aggregation is key –Volatile: this scale of devices and nets will not behave nicely

The Telegraph Dataflow Engine Key technologies –Interactive Control interactivity with early answers and examples online aggregation for data reduction –Dataflow programming via paths/iterators Elevate query processing frameworks out of DBMSs Long tradition of static optimization here –Suggestive, but not sufficient for volatile environments –Continuously adaptive flow optimization massively parallel, adaptive dataflow Rivers and Eddies

Static Query Plans Volatile environments like sensors need to adapt at a much finer grain

Continuous Adaptivity: Eddies How to order and reorder operators over time – based on performance, economic/admin feedback Vs.River: –River optimizes each operator “horizontally” –Eddies optimize a pipeline “vertically” Eddy

Unifying Storage Storage management buried inside specific systems Elevate and expose the core services & semantic options –Layout/indexing –Concurrent access/modification –Recovery Design for clustered environments –Replicate for reliability (tie-ins with Ninja) –Cluster options: your RAM vs. my disk –Events & State Machines for scalability Unify eventflow and dataflow? Share optimization lessons?

Status: Adaptive Dataflow Initial Eddy results promising, well received (SIGMOD 2K) Finishing Telegraph v0 in Java/Jaguar –Prototype now running Demo service to go live on web this summer –Analysis queries over web sites We’ve picked a provocative app to go live with (stay tuned!) Incorporates Ninja “path” project for caching –Goal: Telegraph is to “facts and figures” as search engines are to “documents” Longer-term goals: –Formalize & optimize Eddy/River scheduling policies –Study HCI/systems/stats issues for interaction –Crawl “Dark Matter” on the web –Attack streams from sensors Sequence queries and mining, data reduction, browsing, etc.

Status: Unified Storage Manager Prototype implementation in Java/Jaguar –ACID transactions + (non-ACID) Java file access –Robust enough to get TPC-W numbers –Events/states vs. threads Echoes Gribble/Welsh results: better than threaded under load, but Java complicates detailed mesurement Time to re-evaluate importance of this part –Interest? More mindshare in dataflow infrastructure. –Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)? –Goal? unified lessons about dataflow/eventflow optimization on clusters.

Integration with Rest of Endeavour Give –Be dataflow backbone for diverse “clients” Our own Telegraph apps (federated dataflow, sensors) Replication/delivery dataflow engine for OceanStore Scalable infrastructure for tacit info mining algorithms? Pipes for next version of Iceberg? –Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore? Ninja? Take –OceanStore to manage distributed metadata, security –Leverage protocols out of TinyOS for sensors –Partner with Ninja to manage local metadata? –Work with GUIR on interacting with streams?

More Info People: –Joe Hellerstein, Mike Franklin, Eric Brewer, Christos Papadimitriou –Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah Software – coming soon –ABC interactive data anlysis/cleansing at Papers: –See

Extra slides for backup

Connectivity & Heterogeneity Lots of folks working on data format translation, parsing –we will borrow, not build –currently using JDBC & Cohera Net Query commercial tool, donated by Cohera Corp. gateways XML/HTML (via http) to ODBC/JDBC –we may write “Teletalk” gateways from sensors Heterogeneity –never a simple problem –Control project developed interactive, online data transformation tool: ABC

CONTROL Continuous Output and Navigation Technology with Refinement On Line Data-intensive jobs are long-running. How to give early answers and interactivity? –online interactivity over feeds pipelining “online” operators, data “juggle” –online data correlation algs: ripple joins, online mining and aggregation –statistical estimators, and their performance implications Deliver data to satisfy statistical goals Appreciate interplay of massive data processing, stats, and HCI “ Of all men's miseries, the bitterest is this: to know so much and have control over nothing” –Herodotus

Performance Regime for CONTROL New “Greedy” Performance Regime –Maximize 1 st derivative of the user-happiness function Time 100% CONTROL Traditional

CONTROL Continuous Output and Navigation Technology with Refinement On Line

River We built the world’s fastest sorting machine –On the “NOW”: 100 Sun workstations + SAN –But it only beat the record under ideal conditions! River: performance adaptivity for data flows on clusters –simplifies management and programming –perfect for sensor-based streams

Declarative Dataflow: NOT new Database Systems have been doing this for years –Xlate declarative queries into an efficient dataflow plan –“query optimization” considers: Alternate data sources (“access methods”) Alternate implementations of operators Multiple orders of operators A space of alternatives defined by transformation rules Estimate costs and “data rates”, then search space But in a very static way! –Gather statistics once a week –Optimize query at submission time –Run a fixed plan for the life of the query And these ideas are ripe to elevate out of DBMSs –And outside of DBMSs, the world is very volatile –There are surely going to be lessons “outside the box”

Static Query Plans Volatile environments like sensors need to adapt at a much finer grain

Continuous Adaptivity: Eddies How to order and reorder operators over time – based on performance, economic/admin feedback Vs.River: –River optimizes each operator “horizontally” –Eddies optimize a pipeline “vertically” Eddy

Competitive Eddies Eddy R2R1 R3 S1S2 S3  hash blockindex1  index2

Potter’s Wheel Anomaly Detection

The Data Flood is Real Source: J. Porter, Disk/Trend, Inc.

Disk Appetite, cont. Greg Papadopoulos, CTO Sun: –Disk sales doubling every 9 months Note: only counts the data we’re saving! Translate: –Time to process all your data doubles every 18 months –MOORE’S LAW INVERTED! (and Moore’s Law may run out in the next couple decades?) Big challenge (opportunity?) for SW systems research –Traditional scalability research won’t help “Ideal” linear scaleup is NOT NEARLY ENOUGH!

Data Volume: Prognostications Today –SwipeStream E.g. Wal-Mart 24 Tb Data Warehouse –ClickStream –Web Internet Archive: ?? Tb –Replicated OS/Apps Tomorrow –Sensors Galore –DARPA/Berkeley “Smart Dust” Note: the privacy issues only get more complex! –Both technically and ethically Temperature, light, humidity, pressure, accelerometer, magnetics

Explaining Disk Appetite Areal density increases 60%/yr Yet Mb/$ rises much faster! Source: J. Porter, Disk/Trend, Inc.