John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.

John Bent Computer Sciences Department University of Wisconsin-Madison johnbent@cs.wisc.edu http://www.cs.wisc.edu/condor Explicit Control in a Batch-aware Distributed File System

www.cs.wisc.edu/condor Focus of work › Harnessing, managing remote storage › Batch-pipelined I/O intensive workloads › Scientific workloads › Wide-area grid computing

www.cs.wisc.edu/condor Batch-pipelined workloads › General properties  Large number of processes  Process and data dependencies  I/O intensive › Different types of I/O  Endpoint  Batch  Pipeline

www.cs.wisc.edu/condor Batch-pipelined workloads Endpoint Batch dataset Pipeline Endpoint Pipeline

www.cs.wisc.edu/condor Wide-area grid computing Home storage Internet

www.cs.wisc.edu/condor Cluster-to-cluster (c2c) › Not quite p2p  More organized  Less hostile  More homogeneity  Correlated failures › Each cluster is autonomous  Run and managed by different entities › An obvious bottleneck is wide-area Internet Home store How to manage flow of data into, within and out of these clusters?

www.cs.wisc.edu/condor Current approaches › Remote I/O  Condor standard universe  Very easy  Consistency through serialization › Prestaging  Condor vanilla universe  Manually intensive  Good performance through knowledge › Distributed file systems (AFS, NFS)  Easy to use, uniform name space  Impractical in this environment

www.cs.wisc.edu/condor Pros and cons Practical Easy to use Leverages workload info Remote I/O √√ X Pre-staging √ X √ Trad. DFS X √ X

www.cs.wisc.edu/condor BAD-FS › Solution: Batch-Aware Distributed File System › Leverages workload info with storage control  Detail information about workload is known  Storage layer allows external control  External scheduler makes informed storage decisions › Combining information and control results in  Improved performance  More robust failure handling  Simplified implementation Practical Easy to use Leverages workload info BAD-FS √√√

www.cs.wisc.edu/condor › User-level; requires no privilege › Packaged as a modified Condor system › A Condor system which includes BAD-FS › General; glide-in works everywhere Practical and deployable Internet SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS Home store

www.cs.wisc.edu/condor BAD-FS == Condor ++ Condor DAGMan Compute node Condor startd Compute node Condor startd Compute node Condor Startd Compute node Condor startd Job queue 12 34 Home storage Job queue 3) Expanded Condor submit language Condor DAGMan ++ 4) BAD-FS scheduler 1) NeST storage management 2) Batch-Aware Distributed File System NeST BAD-FS

www.cs.wisc.edu/condor BAD-FS knowledge › Remote cluster knowledge  Storage availability  Failure rates › Workload knowledge  Data type (batch, pipeline, or endpoint)  Data quantity  Job dependencies

www.cs.wisc.edu/condor Control through lots › Abstraction that allows external storage control › Guaranteed storage allocations  Containers for job I/O  e.g. “I need 2 GB of space for at least 24 hours” › Scheduler  Creates lots to cache input data Subsequent jobs can reuse this data  Creates lots to buffer output data Destroys pipeline, copies endpoint  Configures workload to access lots

www.cs.wisc.edu/condor Knowledge plus control › Enhanced performance  I/O scoping  Capacity-aware scheduling › Improved failure handling  Cost-benefit replication › Simplified implementation  No cache consistency protocol

www.cs.wisc.edu/condor I/O scoping › Technique to minimize wide-area traffic › Allocate lots to cache batch data › Allocate lots for pipeline and endpoint › Extract endpoint › Cleanup AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint BAD-FS Scheduler Compute node Internet Steady-state: Only 5 of 705 MB traverse wide-area.

www.cs.wisc.edu/condor Capacity-aware scheduling › Technique to avoid over-allocations › Scheduler has knowledge of  Storage availability  Storage usage within the workload › Scheduler runs as many jobs as fit › Avoids wasted utilizations › Improves job throughput

www.cs.wisc.edu/condor Improved failure handling › Scheduler understands data semantics  Data is not just a collection of bytes  Losing data is not catastrophic Output can be regenerated by rerunning jobs › Cost-benefit replication  Replicates only data whose replication cost is cheaper than cost to rerun the job › Can improve throughput in lossy environment

www.cs.wisc.edu/condor Simplified implementation › Data dependencies known › Scheduler ensures proper ordering › Build a distributed file system  With cooperative caching  But without a cache consistency protocol

www.cs.wisc.edu/condor Real workloads › AMANDA  Astrophysics study of cosmic events such as gamma-ray bursts › BLAST  Biology search for proteins within a genome › CMS  Physics simulation of large particle colliders › HF  Chemistry study of non-relativistic interactions between atomic nuclei and electrons › IBIS  Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)

www.cs.wisc.edu/condor Real workload experience › Setup  16 jobs  16 compute nodes  Emulated wide-area › Configuration  Remote I/O  AFS-like with /tmp  BAD-FS › Result is order of magnitude improvement

www.cs.wisc.edu/condor BAD Conclusions › Schedulers can obtain workload knowledge › Schedulers need storage control  Caching  Consistency  Replication › Combining this control with knowledge  Enhanced performance  Improved failure handling  Simplified implementation

www.cs.wisc.edu/condor For more information › http://www.cs.wisc.edu/condor/publications.html › Questions? “Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003. “Explicit Control in a Batch-Aware Distributed File System,” John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. NSDI ‘04, 2004.

www.cs.wisc.edu/condor Why not BAD-scheduler and traditional DFS? › Practical reasons  Deployment  Interoperability › Technical reasons  Cooperative caching  Data sharing Traditional DFS –assume sharing is exception –provision for arbitrary, unplanned sharing Batch workloads, sharing is rule Sharing behavior is completely known  Data committal Traditional DFS must guess when to commit –AFS uses close, NFS uses 30 seconds Batch workloads precisely define when

www.cs.wisc.edu/condor Is capacity awareness important in real world? 1. Heterogeneity of remote resources 2. Shared disk 3. Workloads changing; some are very, very large and still growing.

www.cs.wisc.edu/condor User burden › Additional info needed in declarative lang. › User probably already knows this info  Or can readily obtain › Typically, this info already exists  Scattered across collection of scripts, Makefiles, etc.  BAD-FS improves current situation by collecting this info into one central location

www.cs.wisc.edu/condor In the wild

www.cs.wisc.edu/condor Capacity-aware scheduling evaluation › Workload  64 synthetic pipelines  Varied pipe size › Environment  16 compute nodes › Configuration  Breadth-first  Depth-first  BAD-FS Failures directly correlate to workload throughput.

www.cs.wisc.edu/condor I/O scoping evaluation › Workload  64 synthetic pipelines  100 MB of I/O each  Varied data mix › Environment  32 compute nodes  Emulated wide-area › Configuration  Remote I/O  Cache volumes  Scratch volumes  BAD-FS Wide-area traffic directly correlates to workload throughput.

www.cs.wisc.edu/condor Cost-benefit replication evaluation › Workload  Synthetic pipelines of depth 3  Runtime 60 seconds › Environment  Artificially injected failures › Configuration  Always-copy  Never-copy  BAD-FS Trade-off overhead in environment without failure to gain throughput in environment with failure.

www.cs.wisc.edu/condor Real workloads › Workload  Real workloads  64 pipelines › Environment  16 compute nodes  Emulated wide-area › Cold and warm  First 16 are cold  Subsequent 48 warm › Configuration  Remote I/O  AFS-like  BAD-FS

www.cs.wisc.edu/condor Example workflow language: Condor DAGman › Keyword job names file w/ execute instrs › Keywords parent, child express relations › … no declaration of data job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D A B C D

www.cs.wisc.edu/condor Adding data primitives to a workflow language › New keywords for container operations  volume: create a container  scratch: specify container type  mount: how the app addresses the container  extract: the desired endpoint output › User must provide complete, exact I/O information to the scheduler  Specify which procs use which data  Specify size of data read and written

www.cs.wisc.edu/condor Extended workflow language job A “instructions.A” job B “instructions.B” job C “instructions.C” job D “instructions.D” parent A child B parent C child D volume B1 ftp://home/data 1GB volume P1 scratch 500 MB volume P2 scratch 500 MB A mount B1 /data C mount B1 /data A mount P1 /tmp B mount P1 /tmp C mount P2 /tmp D mount P2 /tmp extract P1/out ftp://home/out.1 extract P2/out ftp://home/out.2 out A B C D ftp://home /data out.1out.2 B1 out P2P1

John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.

Similar presentations

Presentation on theme: "John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.

Similar presentations

Presentation on theme: "John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware."— Presentation transcript:

Similar presentations

About project

Feedback