IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Software Quality Assurance Plan
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Applying Genetic Algorithms to Decision Making in Autonomic Computing Systems Authors: Andres J. Ramirez, David B. Knoester, Betty H.C. Cheng, Philip K.
Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.
Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
New Challenges in Cloud Datacenter Monitoring and Management
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Improving Network I/O Virtualization for Cloud Computing.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Network Computing Laboratory A programming framework for Stream Synthesizing Service.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Matthew Winter and Ned Shawa
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.
Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.
Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
NOVA University of Lisbon
Antonis Papadimitriou, Arjun Narayan, Andreas Haeberlen
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Performance Evaluation of Adaptive MPI
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Kijung Shin1 Mohammad Hammoud1
Introduction to Spark.
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
StreamApprox Approximate Stream Analytics in Apache Flink
湖南大学-信息科学与工程学院-计算机与科学系
Providing Secure Storage on the Internet
Machine Learning Platform Life-Cycle Management
StreamApprox Approximate Stream Analytics in Apache Spark
StreamApprox Approximate Computing for Stream Analytics
Standard-Cell Mapping Revisited
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Smita Vijayakumar Qian Zhu Gagan Agrawal
Overview of big data tools
Slides prepared by Samkit
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Chi: A Scalable & Programmable Control Plane for Distributed Stream Processing Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Shivaram Venkataraman,
Uncertainty-driven Ensemble Forecasting of QoS in Software Defined Networks Kostas Kolomvatsos1, Christos Anagnostopoulos2, Angelos Marnerides3, Qiang.
Heavy Hitters in Streams and Sliding Windows
DryadInc: Reusing work in large-scale computations
Streaming data processing using Spark
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden & *IST Lisbon)

2 Information Data analytics systems Raw data

3 Massive scale Low latency High throughput Big data systems

To strike a balance 4 Low latency High throughput Tension “Novel” computing paradigms “Novel” computing paradigms

Observation: Compute over a sub-set of data items instead of the entire data-set! Take less time and resources for computation How do these computing paradigms make this trade-off? 5

Two such computing paradigms 6 Inc Incremental computing Approx Approximate computing

Incremental computation 7 Application Small changed input Incrementally updated output Common workflow: Rerun the same application over evolving input Incremental updates: Reuse memoized parts of the computation that are unaffected by the changed input Incremental updates: Reuse memoized parts of the computation that are unaffected by the changed input

Approximate computation 8 Common use-case: Approximate output is good enough! Application Approximate output Input Approximate output: Compute only parts of the input selected by representative sampling Approximate output: Compute only parts of the input selected by representative sampling

Basic idea 9 Both paradigms compute over a sub-set of data items ! Incremental computation Approximate computation Affected by the changed input Selected by the input sampling Biased sampling: Select input items for which we already have memoized result from previous runs Biased sampling: Select input items for which we already have memoized result from previous runs IncApprox

Motivation Design Evaluation Outline 10

Overview of IncApprox 11 Input data stream Incremental computing Approximate computing + IncApprox Approximate output Streaming query Query budget (Latency or resource constraints) Query budget provides adaptive execution interface to systematically tune b/w latency & throughput! Query budget provides adaptive execution interface to systematically tune b/w latency & throughput!

Computation model “ Batched stream processing” 12 Input data stream M M M M M M M M M M M M M M M M M M R R R R R R R R Output Input For each sliding window Run a data-parallel job Computation window

High-level approach 13 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3

#1: Stratified sampling 14 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3

#1: Why stratified sampling? 15 Stream aggregator (Kafka) Stream aggregator (Kafka) Stream processing system Input stream Sub-streams S1 S2 Sn … Need proportional allocation of data-items for all sub-streams Need proportional allocation of data-items for all sub-streams Sub-streams: Disparate events with different distributions Different arrival rates Sub-streams: Disparate events with different distributions Different arrival rates

#1: Stratified sampling in IncApprox 16 Stream aggregator (Kafka) Stream aggregator (Kafka) Sub-streams S1 S2 Sn … Sample size IncApprox Computation window for the input stream Stratified reservoir sampling (see the paper for details) Query budget

#2: Biased sampling 17 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3

#2: Why biased sampling? 18 Input data stream Window at T1 Window at T2 Overlap Successive overlapping computation windows provide an opportunity to reuse result

#2: Biased sampling in IncApprox 19 IncApprox T1 T2 Overlapping windows w/ fluctuating arrival rates “Adaptive” budget / Sample size Biased sampling (see the paper for details)

#3: Run job incrementally 20 Step #1 Stratified sampling Computation input window Approximate output Biased sampling Step #2 Run job incrementally Step #3

#3: Why incremental run? 21 Computation window new old (with old and new data-items) To reuse results: Design and implement “Dynamic algorithms” To reuse results: Design and implement “Dynamic algorithms” Need for automatic and efficient mechanism to incrementally update the output

#3: Incremental run in IncApprox 22 Self-adjusting computation (see the paper for details) Window M M M M M M M M R R R R R R Dependence graph Change in a data item M M R R R R Change propagation

Motivation Design Evaluation Outline 23

Performance gains of IncApprox 1.Twitter stream analytics 2.Network monitoring Implementation Apache Spark Streaming Platform 24 nodes distributed computing cluster Evaluation 24 See the paper for more results!

Performance gains 25 Higher the better 2X over native Spark Streaming 1.4X over individual Inc & Approx modules 2X over native Spark Streaming 1.4X over individual Inc & Approx modules

A data analytics system for incremental approximate computing Transparent : Targets existing applications w/o any code changes Practical: Supports adaptive execution based on the query budget Efficient: Employs a mix of Inc & Approx computing paradigms Summary: IncApprox 26

IncApprox Transparent + Practical + Efficient 27 IncApprox also provides error estimation approximate output = output ± error-estimate IncApprox also provides error estimation approximate output = output ± error-estimate See the paper for details! Thank you!