Department of Computer Science University of California, Santa Barbara Shen, Tang, and Yang 11/17/2018 Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara SuperComputing'99
MPI-Based Parallel Computation on Shared Memory Machines Shen, Tang, and Yang 11/17/2018 MPI-Based Parallel Computation on Shared Memory Machines Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model. MPI on SMMs Threads are easy to program. But people still use MPI on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99
Scheduling for Parallel Jobs in Multiprogrammed SMMs Shen, Tang, and Yang 11/17/2018 Scheduling for Parallel Jobs in Multiprogrammed SMMs Gang-scheduling Good for parallel programs which synchronize frequently; Low resource utilization (Processor-fragmentation; not enough parallelism). Space/time Sharing Time sharing on dynamically partitioned machines; Short response time and high throughput. Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 11/17/2018 Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization Our prototype system: TMPI 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 11/17/2018 Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk/et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, more synchronization will lead to context switch context switch/synchronization has large performance impact in multiprogrammed environments Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among user-level threads Very few kernel-level context switches 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
System Architecture … ... … ... … ... MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Adaptive Two-level Thread Management System-wide resource manager (OS kernel or User-level central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system): #Active kernel threads ≈ #Processors Minimize kernel-level context switch 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
User-level Thread Scheduling Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Polling in User-Level Context Switch Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Event Waiting Synchronization All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); Waiting could be: spinning yielding/blocking waiting *pflag = value; wakeup 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Tradeoff between spin and block Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Scheduler-conscious Event Waiting User-level scheduler provides: scheduling info affinity info 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Experimental Settings Machines: SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 Testing Benchmarks Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree is, the more synchronization will lead to context switch. Sparse LU benchmarks have much more frequent synchronization than others. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Performance evaluation on a Multiprogrammed Workload Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Workload with Certain Multiprogramming Degrees Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. Compare megaflop rates or speedups of the kernel part of each application. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Performance Impact of Multiprogramming Degree (SGI Power Challenge) 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Benefits of Scheduler-conscious Event Waiting Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99
Shen, Tang, and Yang @ SuperComputing'99 Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Scheduler-conscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99