Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,

University of Maryland 2 The Need for GRIDS Many Computation Bound Jobs –Simulations Financial Electronic Design Science –Data Mining Large-scale Collaboration –Sharing of large data sets –Coupled communication simulation codes

University of Maryland 3 Available Resources - Desktops Networks of Workstations –Workstations have high processing power –Connected via high speed network (100Mbps+) –Long idle time (50-60%) and low resource usage Goal: Run CPU-intensive programs using idle periods while owner is away: send guest job and run when owner returns: stop and migrate guest job away –Examples: Condor (University of Wisconsin)

University of Maryland 4 Computational Grids Environment –Collection of semi-autonomous computers –Geographically distributed –Goal: Use these systems as a coordinated resource –Heterogeneous: processors, networks, OS Target Applications –Large-scale programs: running for 100-1,000 ’ s of seconds –Significant need to access long term storage Needs –Coordinated access (scheduling) –Specific time requests (reservations) –Scalable system software (1000 ’ s of nodes)

University of Maryland 5 Two Models of Grid Nodes Harvested Nodes (Desktop) –Computers on desktops –Have Primary user who has priority –Participate in Grid, when resources are free Dedicated Nodes (Data Center) –Dedicated to computational bound jobs –Various Policies May participate in grid 24/7 May only participate when load is low

University of Maryland 6 Available Processing Power –Memory is available - 30MB available 70% of time –CPU usage is low - 10% or less for 75% of time

University of Maryland 7 OS Support for Harvested Grid Computing Need To Manage Resources Differently –Scheduler Normally designed to be fair Need strict priority –Virtual Memory Need priority for local jobs –File systems Virtual Machines make things easier –Provide Isolation –Mange Resources

University of Maryland 8 Starvation Level CPU Scheduling Original Linux CPU Scheduler –Run-time Scheduling Priority nice value & remaining time quanta T i = 20 - nice_level + 1/2 * T i-1 –Possible to schedule niced processes Modified Linux CPU Scheduler –If runnable host processes exist Schedule a host process with highest priority –Only when no host process is runnable Schedule a guest process

University of Maryland 9 Prioritized Page Replacement New page replacement algorithm Adaptive Page-Out Speed –When a host job steals a guest ’ s page, page-out multiple guest pages faster High Limit Low Limit Priority to Host Job Priority to Guest Job Based only on LRU Main Memory Pages –No limit on taking free pages –High Limit : Maximum pages guest can hold –Low Limit : Minimum pages guest can hold

University of Maryland 10 Micro Test Prioritized Memory Page Replacement –Total Available Memory : 179MB –Memory Thresholds: High Limit (70MB), Low Limit (50MB) –Guest job starts at 20 acquiring 128MB –Host job starts at 38 touching 150MB –Host job becomes I/O intensive at 90 –Host job finishes at 130

University of Maryland 11 Application Evaluation - Setup Experiment Environment –Linux PC Cluster 8 pentium II PCs, Linux 2.0.32 Connected by a 1.2Gbps Myrinet Local Workload for host jobs –Emulate Interactive Local User MUSBUS interactive workload benchmark Typical Programming environment Guest jobs –Run DSM parallel applications (CVM) –SOR, Water and FFT Metrics –Guest Job Performance, Host Workload Slowdown

University of Maryland 12 Application Evaluation - Host Slowdown Run DSM Parallel Applications –3 Host Workloads : 7%, 13%, 24% (CPU Usage) –Host Workload Slowdown –For Equal Priority: Significant Slowdown Slowdown increases with load –No Slowdown with Linger Priority

University of Maryland 13 Application Evaluation - Guest Performance Run DSM Parallel Applications –Guest Job Slowdown –Slowdown proportional to musbus usage –Running guest at same priority as host provides little benefit to guest job

University of Maryland 14 Unique Grid Infrastructure Applies to both Harvested and Dedicated Resource Monitoring –Finding available resources –Need both CPUs and Bandwidth Scheduling –Policies to sharing resources among organizations Security –Protect nodes from guest jobs –Protect jobs on foreign nodes

University of Maryland 15 Security Goals –Don ’ t require explicit accounts on each computer –Provide controlled access Define policies on what jobs run where Authenticate access Techniques –Certificates –Single account on system for all grid jobs

University of Maryland 16 Resource Monitoring Need to find available resources –CPU cycles With appropriate OS/System Software With sufficient memory & temporary disk –Network bandwidth Between nodes running a parallel job To the remote file system Issues –Time varying availability –Passive vs. active monitoring

University of Maryland 17 Ganglia Toolkit Courtesy of NPACI, SDSC, and UC Berkeley

University of Maryland 18 NetLogger Courtesy of Brian Tierney, LBL

University of Maryland 19 Scheduling Need to allocate resources on Grid Each site might: –Accept jobs from remote sites –Send jobs to other sites Need to accommodate co-scheduling –A single job that spans multiple site Need for reservations –Time certain allocate of resources

University of Maryland 20 Scheduling Parallel Jobs Scheduling Constraints –Different jobs use different numbers of nodes –Jobs provide estimate of runtime –Jobs run from a few minutes to a few weeks Typical Approach –One parallel job per node Called space-sharing –Batch Style Scheduling Used Even a single user often has more processes than can run at once Need to have many nodes at once for a job

University of Maryland 21 Typical Parallel Scheduler Packs Jobs into a schedule by –Required number of nodes –Estimated runtime Backfills with smaller jobs when –Holes develop due to early job termination

University of Maryland 22 Imprecise Calendars Data structure to manage scheduling grids –permits allocations of time to applications –uses hierarchical representation each level maintains calendar for managed nodes –allows multiple temporal resolutions Key Features: –allows reservations –supports co-scheduling semi-autonomous sites a site can refuse an individual remote job small jobs don ’ t need inter-site coordination

University of Maryland 23 Multiple Time/Space Resolutions Refine space Refine time Parameters –number and sizes of slots –packing density Have multiple time-scales at once –near events at finest temporal resolution

University of Maryland 24 Evaluation Approach –use traces of job submission to real clusters –simulate different scheduling policies imprecise calendars traditional back-filling schedulers Metrics for comparison –job completion time aggregate and by job size –node utilization

University of Maryland 25 Comparison with Partitioned Cluster Based on job data from LANL Treat each cluster as a trading partner

University of Maryland 26 Balance of Trade Jobs are allowed to split across partitions Significant shift in work from 128 node partition

University of Maryland 27 Large Cluster of Clusters Each cluster has 336 nodes –jobs < 1/3 of nodes and < 12 node-hours sched. locally –jobs were not split between nodes Data is one month of jobs per node Workload from CTC SP-2

University of Maryland 28 Balance of Trade: Large Clusters

University of Maryland 29 Social, Political, and Corporate Barriers “ It ’ s my computer ” –Even if the employer purchased it Tragedy of the Commons –Who will buy resources Chargeback concerns –HW purchased for one project used by another Data Security Concerns –You want to run our critical jobs where?

University of Maryland 30 Globus Toolkit Collection of Tools –Security –Scheduling –Grid aware Parallel Programming Designed for –Confederation of dedicated clusters –Support for parallel programs

University of Maryland 31 Condor Core of tightly coupled tools –Monitoring of node –Scheduling (including batch queues) –Checkpointing of jobs Designed for –Harvested resources (dedicated nodes too) –Parameter sweeps using many serial program runs

University of Maryland 32 Layout of the Condor Pool Central Manager Master Collector Cluster Node Master startd Cluster Node Master startd Desktop Master startd schedd Desktop Master startd schedd negotiator schedd negotiator schedd Master Courtesy of Condor Group, University of Wisconsin

University of Maryland 33 Conclusion What the Grid is –An approach to improve computation utilization –Support for data migration for large-scale computation –Several families of tools –Tools to enable collaboration What the Grid is not –Free cycles from heaven

University of Maryland 34 Grid Resources Books –The Grid2: Blueprint for a New Computing Infrastructure Foster & Kessleman, ed. –Grid Computing: Making the Global Infrastructure a Reality Berman, Fox, Hey, ed. Software Distributions –Condor: www.cs.wisc.edu/condor –Globus: www.globus.org

Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,

Similar presentations

Presentation on theme: "Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,

Similar presentations

Presentation on theme: "Copyright 2006, Jeffrey K. Hollingsworth Grid Computing Jeffrey K. Hollingsworth Department of Computer Science University of Maryland,"— Presentation transcript:

Similar presentations

About project

Feedback