Supporting Fault-Tolerance in Streaming Grid Applications

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.

1 Regular expression matching with input compression ： a hardware design for use within network intrusion detection systems Department of Computer Science.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

A Self-Manageable Infrastructure for Supporting Web-based Simulations Yingping Huang Xiaorong Xiang Gregory Madey Computer Science & Engineering University.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

1 A Grid-Based Middleware for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Unpredictable Software-based Attestation Solution for Node Compromise Detection in Mobile WSN Xinyu Jin 1 Pasd Putthapipat 1 Deng Pan 1 Niki Pissinou 1.

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Presented by Niwan Wattanakitrungroj

E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.

Efficient Evaluation of XQuery over Streaming Data

Data Management on Opportunistic Grids

Applying Control Theory to Stream Processing Systems

QianZhu, Liang Chen and Gagan Agrawal

Modeling Stream Processing Applications for Dependability Evaluation

Real-time Software Design

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Year 2 Updates.

RAID RAID Mukesh N Tekwani

StreamApprox Approximate Stream Analytics in Apache Flink

湖南大学-信息科学与工程学院-计算机与科学系

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Fault Tolerance Distributed Web-based Systems

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Soft Error Detection for Iterative Applications Using Offline Training

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

An Adaptive Middleware for Supporting Time-Critical Event Response

Smita Vijayakumar Qian Zhu Gagan Agrawal

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Yi Wang, Wei Jiang, Gagan Agrawal

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Resource Allocation in a Middleware for Streaming Data

Resource Allocation for Distributed Streaming Applications

RAID RAID Mukesh N Tekwani April 23, 2019

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme

MapReduce: Simplified Data Processing on Large Clusters

Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Supporting Fault-Tolerance in Streaming Grid Applications Qian Zhu, Liang Chen, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2008 Conference April 15th, 2008 Miami, Florida IPDPS 2008

Data Streaming Applications Computational Steering Interactively control scientific simulations Computer Vision Based Surveillance Track people and monitor critical infrastructure Images captured by multiple cameras Online Network Intrusion Detection Analyze connection request logs Identify unusual patterns IPDPS 2008

Fault-Tolerance Definition Fault-Tolerance in Grid Applications The ability of a system to respond gracefully to an unexpected hardware or software failure Fault-Tolerance in Grid Applications Redundancy-based fault-tolerance Checkpointing-based fault-tolerance IPDPS 2008

Fault-Tolerance in Data Streaming Applications Fault-Tolerance is Important for Data Stream Processing Distributed data sources Pipelined real-time processing and long running nature Frequent and large-volume data transfers Dynamic and unpredictable resource availability IPDPS 2008

Overview of GATES Middleware Distributed Data Stream Processing Automatic Resource Discovery To Achieve the Best Accuracy While Maintaining the Real-Time Constraints (Self-Adaptation Algorithm) Easy-To-Use (Java, XML, Web Services) Our previous work: HPDC04, SC06, IPDPS06 IPDPS 2008

Outline Motivation and Introduction Overall Design for Fault-Tolerance Experimental Evaluation Related Work Conclusion IPDPS 2008

Overall Design for Fault-Tolerance Design Alternatives Redundancy-based Checkpointing-based Drawbacks Resource requirements Synchronization of states for all replicas Platform dependent Large-volume checkpoints IPDPS 2008

Our Proposed Approach Light-Weight Summary Structure (LSS) Locally updated each processing round Transferred to remote nodes Heartbeat-based Fault Detection Failure Recovery using LSS Other Issues and Enhancements Data Backup Buffer Efficient Resource Allocation Algorithm IPDPS 2008

Definition of Light-weight Summary Structure (LSS) Data Stream Processing Structure Summary Information Accumulated Each Processing Loop Iteration A Small Memory Size ... while(true) { read_data_from_streams(); process_data(); accumulate_intermediate_results(); reset_auxiliary_structures(); } IPDPS 2008

LSS: An Example Application: Counting Samples counting-lss: S M F Data Source Computing the m most frequent numbers Computing the 10 most frequent numbers counting-lss: int: value of m int array: the m most frequent numbers int array: corresponding frequencies IPDPS 2008

Using LSS for Fault-Tolerance Much Smaller Memory Size Than That of the Application Auxiliary Structures are reset at the end of each iteration Approximate Processing on Data Streams IPDPS 2008

Using LSS for Fault-Tolerance –cont’d Compare LSS-based Fault-Tolerance to checkpointing in grid environments Much smaller memory size than that of the application A small amount of data is lost during failure recovery LSS is independent of platforms IPDPS 2008

GATES Implementation for Fault-Tolerance Application: // Initialize auxiliary structures initialize_auxiliary_structures(); // Get an LSS instance from GATES counting-lss lss = GATES.getLSS(”counting-lss”); // Process streaming data while true // check if input buffer is invalid if inBuffer.getInputBufferStatus()==INVALID // Stop processing then break; read_data_from_streams(); process_data(); accumulate_intermediate_results_to_LSS(lss); update_local_LSS(lss); GATES: // Monitor service if local LSS updated then send_LSS_to_Candidates(lss); // Replication service remote_store_LSS(lss); IPDPS 2008

Failure Recovery Procedure IPDPS 2008

Other Issues and Enhancements Data Backup Buffer Data is stored in the backup buffer until acknowledgment is received Obsolete data in the backup buffer will be replaced Efficient Resource Allocation Algorithm Candidate nodes Dijkstra’s shortest path algorithm IPDPS 2008

Outline Motivation and Introduction Overall Design for Fault-Tolerance Experimental Evaluation Related Work Conclusion IPDPS 2008

Streaming Applications Counting Samples (count-samps) To determine the n most frequent numbers LSS: m most frequent numbers Clustering Evolving Streams (clustream) To group data into n clusters LSS: m micro-clusters Distributed Frequency Counting (dist-freq-counting) To find the most frequent itemset with threshold LSS: most frequent itemset with threshold IPDPS 2008

Goals for the Experiments Show that LSS Uses a Small Amount of Memory Evaluate the Overhead of LSS for Fault-Tolerance Show the Impact on Accuracy IPDPS 2008

Experiment Setup and Datasets 64-Node Computing Cluster Simulate Different Inter-node Bandwidths Datasets count-samps: data generated by a simulator clustream: KDD-CUP’99 Network Intrusion Detection dataset dist-freq-counting: IBM synthetic data generator IPDPS 2008

Size of count-samps (KB) Memory Usage of LSS Value of m Size of LSS (KB) Size of count-samps (KB) 20 6 954 80 1149 160 36 1432 200 48 1662 LSS only occupied approximately 0.6%, 1.7%, 2.5% and 2.9%, respectively, of memory used by the entire application LSS consumed 0.9% of the clustream application and 1.1% of the dist-freq-counting application IPDPS 2008

Using LSS for Fault-Tolerance: Performance Execution Time of count-samps 4% 7% 10% IPDPS 2008

Using LSS for Fault-Tolerance: Performance Execution Time of clustream 2.5% IPDPS 2008

Using LSS for Fault-Tolerance: Performance Execution Time of dist-freq-counting 3.5% IPDPS 2008

Using LSS for Fault-Tolerance: Accuracy Accuracy of count-samps 1% 6% IPDPS 2008

Using LSS for Fault-Tolerance: Accuracy Accuracy of clustream IPDPS 2008

Outline Motivation and Introduction Overall Design for Fault-Tolerance Experimental Evaluation Related Work Conclusion IPDPS 2008

Related Work Application-Level Checkpointing Bronevetsky et al. (PPoPP03, ASPLOS04, SC06) Replication-based Fault Tolerance Abawajy et al. (IPDPS04), Murty et al. (HotDep06) Hwang et al. (ICDE05), Zheng et al. (Cluster04) Fault Tolerance in Distributed Data Stream Processing Balazinska et al. (SIGMOD05, ICDE05) IPDPS 2008

Outline Motivation and Introduction Overall Design for Fault-Tolerance Experimental Evaluation Related Work Conclusion IPDPS 2008

Conclusion Use of LSS to Enable Efficient Failure-Recovery Use of Additional Buffers to Control Data Loss Efficient Resource Allocation Algorithm Modest Overhead Associated with Fault-Detection and Failure-Recovery Small Loss of Accuracy IPDPS 2008

Thank you! IPDPS 2008