A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Slides:

Advertisements

Similar presentations

Remus: High Availability via Asynchronous Virtual Machine Replication

Advertisements

Distributed Data Processing

AT LOUISIANA STATE UNIVERSITY CCT: Center for Computation & LSU Stork Data Scheduler: Current Status and Future Directions Sivakumar Kulasekaran.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Operating Systems CS451 Brian Bershad

CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Transis 1 Fault Tolerant Video-On-Demand Services Tal Anker, Danny Dolev, Idit Keidar, The Transis Project.

DISTRIBUTED COMPUTING

Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

1 The Google File System Reporter: You-Wei Zhang.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Submitted by: Shailendra Kumar Sharma 06EYTCS049.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Cluster Reliability Project ISIS Vanderbilt University.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Introduction to Hadoop and HDFS

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Module 3 Planning and Deploying Mailbox Services.

George Kola Computer Sciences Department University of Wisconsin-Madison DiskRouter: A Mechanism for High.

Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

OSIsoft High Availability PI Replication

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.

1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Section 2.1 Distributed System Design Goals Alex De Ruiter

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau.

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.

Security Operations Chapter 11 Part 2 Pages 1262 to 1279.

Studies of LHCb Trigger Readout Network Design Karol Hennessy University College Dublin Karol Hennessy University College Dublin.

Next Generation of Apache Hadoop MapReduce Owen

1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.

© 2010 VMware Inc. All rights reserved Why Virtualize? Beng-Hong Lim, VMware, Inc.

OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.

Condor DAGMan: Managing Job Dependencies with Condor

On-Line Transaction Processing

Unit OS10: Fault Tolerance

Chapter 16: Distributed System Structures

Overview Introduction VPS Understanding VPS Architecture

Outline What users want ? Data pipeline overview

STORK: A Scheduler for Data Placement Activities in Grid

Distributed computing deals with hardware

CS 345A Data Mining MapReduce This presentation has been altered.

Ron Carovano Manager, Business Development F5 Networks

Presentation transcript:

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison June 2004

2/20 What is the talk about ? You heard about streaming videos and switching video quality to mitigate congestion You heard about streaming videos and switching video quality to mitigate congestion This talk is about getting these videos This talk is about getting these videos Encoding/Processing videos using commodity clusters/grid resources Encoding/Processing videos using commodity clusters/grid resources Replicating videos over wide-area network Replicating videos over wide-area network Insights into issues in the above process Insights into issues in the above process

3/20 Motivation Education Research, Bio-medical engineering, … have a large amount of videos Education Research, Bio-medical engineering, … have a large amount of videos Digital Libraries: Videos need to be processed Digital Libraries: Videos need to be processed WCER ‘Transana’ WCER ‘Transana’ Collaboration => Videos need to be shared Collaboration => Videos need to be shared Conventional approach: Mail tapes Conventional approach: Mail tapes Work to load tape into collaborator digital library Work to load tape into collaborator digital library High turn-around time High turn-around time Desire for full electronic solution Desire for full electronic solution

4/20 Hardware Encoding Issues Products do not support tape robot. Users need multiple formats Products do not support tape robot. Users need multiple formats Need a lot of operators! Need a lot of operators! Non-availability of hardware miniDV to MPEG-4 encoders Non-availability of hardware miniDV to MPEG-4 encoders Lack of flexibility. No video processing support Lack of flexibility. No video processing support Night video, want to change white balance Night video, want to change white balance Some hardware encoders are essentially PCs, but cost a lot more ! Some hardware encoders are essentially PCs, but cost a lot more !

5/20 Our Goals Fully electronic solution Fully electronic solution Shorter turn-around time Shorter turn-around time Full automation Full automation No need for operators. More reliable No need for operators. More reliable Flexible software based solution Flexible software based solution Use idle CPUs in commodity clusters/grid resources Use idle CPUs in commodity clusters/grid resources Cost effective Cost effective

6/20 Issues 1 hour DV video is ~13 GB 1 hour DV video is ~13 GB A typical educational research video uses 3 cameras => 39 GB for 1 hour A typical educational research video uses 3 cameras => 39 GB for 1 hour Transferring these videos over the network Transferring these videos over the network Intermittent network outage => retransfer whole file =>may not complete Intermittent network outage => retransfer whole file =>may not complete Need for fault-tolerance and failure handling Need for fault-tolerance and failure handling Software/Machine crash Software/Machine crash downtime for upgrades (we do not control the machines!) downtime for upgrades (we do not control the machines!) Problems with existing distributed scheduling systems Problems with existing distributed scheduling systems

7/20 Problems with Existing Systems Couple data movement and computation. Failure of either results in redo of both Couple data movement and computation. Failure of either results in redo of both Do not schedule data movement Do not schedule data movement 100+ nodes, each picking up a different 13GB file 100+ nodes, each picking up a different 13GB file Server thrashing Server thrashing Some transfers may never complete Some transfers may never complete 100+ nodes, each writing a 13GB file to server 100+ nodes, each writing a 13GB file to server Distributed Denial of Service Distributed Denial of Service Server crash Server crash

8/20 Our Approach Decouple data placement and computation Decouple data placement and computation =>Fault isolation =>Fault isolation Make data placement full-fledged job Make data placement full-fledged job Improved failure handling Improved failure handling Alternate task failure recovery /Protocol switching Alternate task failure recovery /Protocol switching Schedule data placement Schedule data placement Prevents thrashing and crashing due to overload Prevents thrashing and crashing due to overload Can optimize schedule using storage server and end host characteristics Can optimize schedule using storage server and end host characteristics Can tune TCP buffers at run-time Can tune TCP buffers at run-time Can optimize for full-system throughput Can optimize for full-system throughput

9/20 Fault Tolerance Small network outages were most common failures Small network outages were most common failures Data placement scheduler made fault aware and retries to success. Data placement scheduler made fault aware and retries to success. Can tolerate system upgrade during processing Can tolerate system upgrade during processing Software had to be upgraded during operation Software had to be upgraded during operation Avoiding bad compute nodes Avoiding bad compute nodes Persistent logging to resume from whole system crash Persistent logging to resume from whole system crash

10/20 Three Designs Some clusters have a stage area Some clusters have a stage area Design 1. Stage to cluster stage area Design 1. Stage to cluster stage area Some clusters have a stage area and allow running computation there Some clusters have a stage area and allow running computation there Design 2. Run Hierarchical buffer server in stage area Design 2. Run Hierarchical buffer server in stage area No cluster stage area No cluster stage area Design 3. Direct staging to compute node Design 3. Direct staging to compute node

11/20 Design 1 & 2: Using Stage Area Source Compute Nodes Stage Area Wide Area

12/20 Design 1 versus Design 2 Design 1 Design 1 uses the default transfer protocol to transfer data from stage area to compute node uses the default transfer protocol to transfer data from stage area to compute node Not scheduled Not scheduled Problems when number of concurrent transfers increases Problems when number of concurrent transfers increases Design 2 Design 2 uses a hierarchical buffer server at the stage node. Client runs at the compute node to pick up the data uses a hierarchical buffer server at the stage node. Client runs at the compute node to pick up the data Scheduled Scheduled Hierarchical buffer server crashes need to be handled Hierarchical buffer server crashes need to be handled 25%-30% performance improvement in our current setting 25%-30% performance improvement in our current setting

13/20 Design 3: Direct Staging Source Compute Nodes Wide Area

14/20 Design 3 Applicable when there is no stage area Applicable when there is no stage area Most flexible Most flexible CPU wasted during data transfer/Need additional features CPU wasted during data transfer/Need additional features Optimization possible if transfer/compute times can be estimated Optimization possible if transfer/compute times can be estimated

15/20 SRB Staging Condor WCER Video Pipeline WCER Input Data flow Output Data flow Processing Split files

16/20 WCER Video Pipeline Data transfer protocols had 2 GB file size limit Data transfer protocols had 2 GB file size limit Split files and rejoin them Split files and rejoin them File size limits with Linux video decoders File size limits with Linux video decoders Picked up new decoder from CVS Picked up new decoder from CVS File system performance issues File system performance issues Flaky network connectivity Flaky network connectivity Got network administrators to fix it Got network administrators to fix it

17/20 WCER Video Pipeline EncodingResolution File Size Average Time MPEG-1 Half (320 x 240) 600 MB 2 hours MPEG-2 Full (720x480) 2 GB 8 hours MPEG-4 Half (320 x 240) 250 MB 4 hours Started processing in Jan 2004 DV video encoded to MPEG-1, MPEG-2 and MPEG-4 Has been a good test for data intensive distributed computing Fault tolerance issues were the most important In a real system, downtime for software upgrades should be taken into account

18/20 How can I use this ? Stork data placement scheduler Stork data placement scheduler Dependency manager (DAGMan) enhanced with DaP support) Dependency manager (DAGMan) enhanced with DaP support) Condor/Condor-G distributed scheduler Condor/Condor-G distributed scheduler Flexible DAG generator Flexible DAG generator Pick our tools and you can perform data intensive computing on commodity cluster/grid resources Pick our tools and you can perform data intensive computing on commodity cluster/grid resources

19/20 Conclusion & Future Work Successfully processed and replicated several terabytes of video Successfully processed and replicated several terabytes of video Working on extending design 3 Working on extending design 3 Building a client-centric data-aware distributed scheduler Building a client-centric data-aware distributed scheduler Deployment of the new scheduler inside existing schedulers Deployment of the new scheduler inside existing schedulers Idea from virtual machines Idea from virtual machines

20/20Questions Contact Contact George Kola George Kola