Download presentation
Presentation is loading. Please wait.
1
The Difficulties of Distributed Data Douglas Thain thain@cs.wisc.edu Condor Project University of Wisconsin http://www.cs.wisc.edu/condor
2
The Condor Project Established in 1985. Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. Example installations: 643 CPUs at UW-Madison in CS building Comp architecture simulations 264 CPUs at INFN all across Italy CMS simulations Serves two communities: Production software and computer science research.
3
No Repository Here! No master source of anyone’s data at UW-CS Condor! But, large amount of buffer space: 128 * 10 GB + 64 *30 GB Ultimate store is at other sites: NCSA mass store CERN LHC repositories We concentrate on software for loading, buffering, caching, and producing output efficiently.
4
The Challenges of Large-Scale Data Access are… 1 - Correctness! Single stage: crashed machines, lost connections, missing libraries, wrong permissions, expired proxies… End-to-end: A job is not “complete” until the output has been verified and written to disk. 2 - Heterogeneity By design: aggregated clusters. By situation: Disk layout, buffer capacity, net load.
5
Your Comments: Jobs need scripts that check readiness of system before execution. (Tim Smith) Single node failures not worth investigating: Reboot, reimage, replace. (Steve DuChene) “A cluster is a large error amplifier.” (Chuck Boeheim)
6
Data Management in Condor Production -> Research Remote I/O DAGMan Kangaroo Common denominators: Hide errors from jobs -- they cannot deal with “connection refused” or “network down.” Propagate failures first to scheduler, and perhaps later to the user.
7
Remote I/O Relink job with Condor C library. I/O is performed along TCP connection to the submit site: either fine-grained RPCs or whole-file staging. Exec Site Submit Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Job On any failure: 1 - Kill -9 job 2 - Log event 3 - Email user? 4 - Reschedule Some failures: NFS down DNS down Node rebooting Missing input
8
DAGMan (Directed Acyclic Graph Manager) A persistent ‘make’ for distributed computing. Handles dependencies and failures in multi- job tasks, including cpu and data movement. Run Remote Job Stage Input Run Remote Job Stage Output Check Output Begin DAG DAG Complete If results are bogus… Retry up to 10 times. If transfer fails… Retry up to 5 times.
9
Kangaroo Simple Idea: Use all available net, mem, and disk to buffer data. “Hop” it to destination. Background process, not job, is responsible for handling both faults and variations. Allows overlap of CPU and I/O. Storage Site Execution Site K K K K Data Movement System App Disk
10
I/O Models OUTPUT CPU OUTPUT Stage Output: Kangaroo Output: INPUT OUTPUT CPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU PUSH
11
In Summary… Correctness is a major obstacle to high-throughput cluster computing. Jobs must be protected from all of the possible errors in data access. Handle failures in two ways: Abort, and inform scheduler (not user.) Fall back to alternate resource. Pleasant side effect: higher throughput!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.