Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,

Similar presentations


Presentation on theme: "Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,"— Presentation transcript:

1 Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth hollings@cs.umd.edu Department of Computer Science University of Maryland, College Park, MD 20742

2 University of Maryland 2 The Need for GRIDS Many Computation Bound Jobs –Simulations Financial Electronic Design Science –Data Mining Large-scale Collaboration –Sharing of large data sets –Coupled communication simulation codes

3 University of Maryland 3 Available Resources - Desktops Networks of Workstations –Workstations have high processing power –Connected via high speed network (100Mbps+) –Long idle time (50-60%) and low resource usage Goal: Run CPU-intensive programs using idle periods while owner is away: send guest job and run when owner returns: stop and migrate guest job away –Examples: Condor (University of Wisconsin)

4 University of Maryland 4 Computational Grids Environment –Collection of semi-autonomous computers –Geographically distributed –Goal: Use these systems as a coordinated resource –Heterogeneous: processors, networks, OS Target Applications –Large-scale programs: running for 100-1,000 ’ s of seconds –Significant need to access long term storage Needs –Coordinated access (scheduling) –Specific time requests (reservations) –Scalable system software (1000 ’ s of nodes)

5 University of Maryland 5 Two Models of Grid Nodes Harvested Nodes (Desktop) –Computers on desktops –Have Primary user who has priority –Participate in Grid, when resources are free Dedicated Nodes (Data Center) –Dedicated to computational bound jobs –Various Policies May participate in grid 24/7 May only participate when load is low

6 University of Maryland 6 Available Processing Power –Memory is available - 30MB available 70% of time –CPU usage is low - 10% or less for 75% of time

7 University of Maryland 7 OS Support for Harvested Grid Computing Need To Manage Resources Differently –Scheduler Normally designed to be fair Need strict priority –Virtual Memory Need priority for local jobs –File systems Virtual Machines make things easier –Provide Isolation –Mange Resources

8 University of Maryland 8 Starvation Level CPU Scheduling Original Linux CPU Scheduler –Run-time Scheduling Priority nice value & remaining time quanta T i = 20 - nice_level + 1/2 * T i-1 –Possible to schedule niced processes Modified Linux CPU Scheduler –If runnable host processes exist Schedule a host process with highest priority –Only when no host process is runnable Schedule a guest process

9 University of Maryland 9 Prioritized Page Replacement New page replacement algorithm Adaptive Page-Out Speed –When a host job steals a guest ’ s page, page-out multiple guest pages faster High Limit Low Limit Priority to Host Job Priority to Guest Job Based only on LRU Main Memory Pages –No limit on taking free pages –High Limit : Maximum pages guest can hold –Low Limit : Minimum pages guest can hold

10 University of Maryland 10 Micro Test Prioritized Memory Page Replacement –Total Available Memory : 179MB –Memory Thresholds: High Limit (70MB), Low Limit (50MB) –Guest job starts at 20 acquiring 128MB –Host job starts at 38 touching 150MB –Host job becomes I/O intensive at 90 –Host job finishes at 130

11 University of Maryland 11 Application Evaluation - Setup Experiment Environment –Linux PC Cluster 8 pentium II PCs, Linux 2.0.32 Connected by a 1.2Gbps Myrinet Local Workload for host jobs –Emulate Interactive Local User MUSBUS interactive workload benchmark Typical Programming environment Guest jobs –Run DSM parallel applications (CVM) –SOR, Water and FFT Metrics –Guest Job Performance, Host Workload Slowdown

12 University of Maryland 12 Application Evaluation - Host Slowdown Run DSM Parallel Applications –3 Host Workloads : 7%, 13%, 24% (CPU Usage) –Host Workload Slowdown –For Equal Priority: Significant Slowdown Slowdown increases with load –No Slowdown with Linger Priority

13 University of Maryland 13 Application Evaluation - Guest Performance Run DSM Parallel Applications –Guest Job Slowdown –Slowdown proportional to musbus usage –Running guest at same priority as host provides little benefit to guest job

14 University of Maryland 14 Unique Grid Infrastructure Applies to both Harvested and Dedicated Resource Monitoring –Finding available resources –Need both CPUs and Bandwidth Scheduling –Policies to sharing resources among organizations Security –Protect nodes from guest jobs –Protect jobs on foreign nodes

15 University of Maryland 15 Security Goals –Don ’ t require explicit accounts on each computer –Provide controlled access Define policies on what jobs run where Authenticate access Techniques –Certificates –Single account on system for all grid jobs

16 University of Maryland 16 Resource Monitoring Need to find available resources –CPU cycles With appropriate OS/System Software With sufficient memory & temporary disk –Network bandwidth Between nodes running a parallel job To the remote file system Issues –Time varying availability –Passive vs. active monitoring

17 University of Maryland 17 Ganglia Toolkit Courtesy of NPACI, SDSC, and UC Berkeley

18 University of Maryland 18 NetLogger Courtesy of Brian Tierney, LBL

19 University of Maryland 19 Scheduling Need to allocate resources on Grid Each site might: –Accept jobs from remote sites –Send jobs to other sites Need to accommodate co-scheduling –A single job that spans multiple site Need for reservations –Time certain allocate of resources

20 University of Maryland 20 Scheduling Parallel Jobs Scheduling Constraints –Different jobs use different numbers of nodes –Jobs provide estimate of runtime –Jobs run from a few minutes to a few weeks Typical Approach –One parallel job per node Called space-sharing –Batch Style Scheduling Used Even a single user often has more processes than can run at once Need to have many nodes at once for a job

21 University of Maryland 21 Typical Parallel Scheduler Packs Jobs into a schedule by –Required number of nodes –Estimated runtime Backfills with smaller jobs when –Holes develop due to early job termination

22 University of Maryland 22 Imprecise Calendars Data structure to manage scheduling grids –permits allocations of time to applications –uses hierarchical representation each level maintains calendar for managed nodes –allows multiple temporal resolutions Key Features: –allows reservations –supports co-scheduling semi-autonomous sites a site can refuse an individual remote job small jobs don ’ t need inter-site coordination

23 University of Maryland 23 Multiple Time/Space Resolutions Refine space Refine time Parameters –number and sizes of slots –packing density Have multiple time-scales at once –near events at finest temporal resolution

24 University of Maryland 24 Evaluation Approach –use traces of job submission to real clusters –simulate different scheduling policies imprecise calendars traditional back-filling schedulers Metrics for comparison –job completion time aggregate and by job size –node utilization

25 University of Maryland 25 Comparison with Partitioned Cluster Based on job data from LANL Treat each cluster as a trading partner

26 University of Maryland 26 Balance of Trade Jobs are allowed to split across partitions Significant shift in work from 128 node partition

27 University of Maryland 27 Large Cluster of Clusters Each cluster has 336 nodes –jobs < 1/3 of nodes and < 12 node-hours sched. locally –jobs were not split between nodes Data is one month of jobs per node Workload from CTC SP-2

28 University of Maryland 28 Balance of Trade: Large Clusters

29 University of Maryland 29 Social, Political, and Corporate Barriers “ It ’ s my computer ” –Even if the employer purchased it Tragedy of the Commons –Who will buy resources Chargeback concerns –HW purchased for one project used by another Data Security Concerns –You want to run our critical jobs where?

30 University of Maryland 30 Globus Toolkit Collection of Tools –Security –Scheduling –Grid aware Parallel Programming Designed for –Confederation of dedicated clusters –Support for parallel programs

31 University of Maryland 31 Condor Core of tightly coupled tools –Monitoring of node –Scheduling (including batch queues) –Checkpointing of jobs Designed for –Harvested resources (dedicated nodes too) –Parameter sweeps using many serial program runs

32 University of Maryland 32 Layout of the Condor Pool Central Manager Master Collector Cluster Node Master startd Cluster Node Master startd Desktop Master startd schedd Desktop Master startd schedd negotiator schedd negotiator schedd Master Courtesy of Condor Group, University of Wisconsin

33 University of Maryland 33 Conclusion What the Grid is –An approach to improve computation utilization –Support for data migration for large-scale computation –Several families of tools –Tools to enable collaboration What the Grid is not –Free cycles from heaven

34 University of Maryland 34 Grid Resources Books –The Grid2: Blueprint for a New Computing Infrastructure Foster & Kessleman, ed. –Grid Computing: Making the Global Infrastructure a Reality Berman, Fox, Hey, ed. Software Distributions –Condor: www.cs.wisc.edu/condor –Globus: www.globus.org


Download ppt "Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,"

Similar presentations


Ads by Google