Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006.

Similar presentations


Presentation on theme: "Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006."— Presentation transcript:

1 Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006

2 Bad News: Many large distributed systems fall to pieces under heavy load!

3 Example: Grid3 (OSG) Robert Gardner, et al. (102 authors) The Grid3 Production Grid Principles and Practice IEEE HPDC 2004 The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by… ATLAS, CMS, SDSS, LIGO…

4 Grid2003: The Details The good news: –27 sites with 2800 CPUs –40985 CPU-days provided over 6 months –10 applications with 1300 simultaneous jobs The bad news on ATLAS jobs: –40-70 percent utilization –30 percent of jobs would fail. –90 percent of failures were site problems –Most site failures were due to disk space!

5 CPU A Thought Experiment CPU shared disk CPU in out task Job in out task x 1,000,000 task 1 - Only a problem when load > capacity. 2 – Grids are employed by users with infinite needs!

6 Need Space Allocation Grid storage managers: –SRB - Storage Resource Broker at SDSC. –SRM – Storage Resource Manager at LBNL. –NeST – Networked Storage at UW-Madison. –IBP – Internet Backplane Protocol at UTK. But, do not have any help from the OS. –A runaway logfile can invalidate the careful accounting of the grid storage mgr.

7 Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

8 A Model of Space Allocation root jobshome j1j2 alicebetty size:1000 GB used: 0 GB size: 100 GB used: 0 GB size: 100 GB used: 0 GB size: 500 GB used: 0 GB size: 10 GB used: 0 GB size:1000 GB used: 100 GB size: 100 GB used: 10 GB data size: 10 GB used: 5 GB core size: 100 GB used: 100 GB size:1000 GB used: 700 GB Three commands: mkalloc (dir) (size) lsalloc (dir) rm –rf (dir)

9 No Built-In Allocation Policy In order to make an allocation: –Must have permission to mkdir. –New allocation must fit in available space. Need something more complex? –Check remote database re global quota? –Delete allocation after a certain time? –Send email when allocation is full? Use a storage manager at a higher level. –SRB, SRM, NeST, IBP, etc...

10 No Built-In Allocation Policy grid storage manager need 10 GB ok, use jobs/j5 jobs j4j5 size: 100 GB used: 10 GB size: 10 GB used: 0 GB size: 10 GB used: 5 GB check database, charge credit card, consult human... size: 10 GB used: 0 GB size: 100 GB used: 20 GB (writeable by alice) mkalloc /jobs/j5 10GB setacl /jobs/j5 alice write ordinary file access task1task2 size: 5 GB used: 0 GB size: 5 GB used: 0 GB

11 Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

12 User Level Library root jobs j1j2 size:1000 GB used: 0 GB size: 10 GB used: 0 GB size: 100 GB used: 0 GB Appl LibAlloc file 1 - lock/read file 2 - stat/write 3 - unlock/write 1 - lock/read 2 - stat/write 3 - write/unlock size: 10 GB used: 2 GB size: 100 GB used: 5 GB

13 User Level Library Some details about locking: see paper. Applicability –Must modify apps or servers to employ. –Fails if non-enabled apps interfere. –But, can employ anywhere without privileges. Performance –Optimization: Cache locks until idle 2 sec. –At best, writes double in latency. –At worst, shared directories ping-pong locks. Recovery –fixalloc: traverses the directory structure and recomputes current allocations.

14 size:1000 GB Loopback Filesystems size: 100 GB jobs size: 10 GB root j1j2 file dd if=/dev/zero of=/jobs.fs 100GB losetup /dev/loopN /jobs.fs mke2fs /dev/loopN mount /dev/loopN /jobs

15 Loopback Filesystems Applicability –Works with any standard application. –Must be root to deploy and manage allocations. –Limited to approx 10-100 allocations. Performance –Ordinary reads and writes: no overhead. –Allocations: Must touch every block to reserve! –Massively increases I/O traffic to disk. Recovery –Must scan hierarchy, fsck and mount every allocation. –Disastrous for large file systems!

16 AllocFS: Kernel-Level Filesystem #uidsizeusedparent 201000 GB700 GB2 30100 GB99 GB2 43410 GB5 GB3 5344 6563 7 7 root jobs j1j2 file 2 3 4 5 6 7 Inode Table 1 – To update allocation state, update fields in incore-inode. 2 – To create/delete an allocation, update the parent’s allocation state, which is already cached for other reasons.

17 AllocFS: Kernel-Level Filesystem Applicability –Works with any ordinary application. –Must load module and be root to install. –Binary compatible with existing EXT2 filesystem. –Once loaded, ordinary users may employ. Performance –No measurable overhead on I/O. –Creating an allocation: touch two inodes. –Deleting an allocation: same as deleting directory. Recovery –fixalloc: traverses the directory structure and recomputes current allocations.

18 Library Adds Latency

19 Allocation Performance Loopback Filesystem –1 second per 25 MB of allocation. (40 sec/GB) –Must touch every single block. –Big increase in unnecessary I/O traffic! Allocation Library –227 usec regardless of size. –Several synchronous disk ops. Kernel Level Filesystem –32 usec regardless of size. –Touch one inode.

20 Comparison Priv. Reqd. Guarantee?Max #Write Perf. Alloc Perf. Recovery Libraryany user nono limit2x latency usecfixalloc once Loopbackroot to install, use yes10-100no change secs to mins fsck and mount each alloc Kernelroot to install yesno limitno change usecfixalloc once

21 Outline Grids Need OS Support for Allocation A Model of Space Allocation Three Implementations –User-Level Library –Loopback Devices –AllocFS: Kernel Filesystem Application to a Cluster

22 CPU A Physical Experiment CPU shared disk CPU in out task Job in out task Three configurations: 1 – No allocations. 2 – Backoff when failures detected. 3 – Heuristic: don’t start job unless space > threshhold. 4 – Allocate space for each job. Only space for 10. Vary load: # of simultaneous jobs.

23 Allocations Improve Robustness

24 Summary Grids require space allocations in order to become robust under heavy loads. Explicit operating system support for allocations is needed in order to make them manageable and efficient. User level approximations are possible, but have overheads in perf and mgmt. AllocFS provides allocations compatible with EXT2 with no measurable overhead.

25 Library Implementation http://www.cctools.org/chirp Solaris, Linux, Mac, Windows Start server with –Q 100GB

26 Kernel Implementation http://www.cctools.org/allocfs Works with Linux 2.4.21. Install over existing EXT2 FS. –(And, uninstall without loss.) % mkalloc /mnt/alloctest/adir 25M mkalloc: /mnt/alloctest/adir allocated 25600 blocks. % lsalloc -r /mnt/alloctest USED TOTAL PCT PATH 25.01M 87.14M 28% /mnt/alloctest 10.00M 25.00M 39% /mnt/alloctest/adir

27 A Final Thought [Some think] traditional OS issues are either solved problems or minor problems. We believe that building such vast distributed systems upon the fragile infrastructure provided by today’s operating systems is analogous to building castles on sand. The Persistent Relevance of the Local Operating System to Global Applications Jay Lepreau, Bryan Ford, and Mike Hibler SIGOPS European Workshop, September 1996

28 For More Information: Cooperative Computing Lab: –http://www.cse.nd.edu/~ccl Douglas Thain –dthain@cse.nd.edudthain@cse.nd.edu Related Talks: –“Grid Deployment of Bioinformatics Apps...” Session 4A Friday –“Cacheable Decentralized Groups...” Session 5B Friday

29 Extra Slides

30 Existing Tools Not Suitable for the Grid User and Group Quotas –Don’t always correspond to allocation needs! User might want one alloc per job. Or, many users may want to share an alloc. Disk Partitions –Very expensive to create, change, manage. –Not hierarchical: only root can manage. ZFS Allocations –Cheap to create, change, manage. –Not hierarchical: only root can manage.

31 Library Suffers on Small Writes

32 Recovery Linear wrt # of Files


Download ppt "Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006."

Similar presentations


Ads by Google