Department of Computer Science University of California, Santa Barbara

Department of Computer Science University of California, Santa Barbara
Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang Department of Computer Science University of California, Santa Barbara

MPI-Based Parallel Computation on Shared Memory Machines
Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model.  MPI on SMMs Threads are easy to program. But MPI is still used on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 2019/5/24 Shen, Tang, and SuperComputing'99

Scheduling for Parallel Jobs in Multiprogrammed SMMs
Gang-scheduling Good for parallel programs which synchronize frequently; Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). Space/time Sharing Time sharing combined with dynamic partitioning; High throughput. Popular in current OS (e.g., IRIX 6.5) Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 2019/5/24 Shen, Tang, and SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99
Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization Our prototype system: TMPI 2019/5/24 Shen, Tang, and SuperComputing'99

Impact of synchronization on coarse-grain parallel programs Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. Synchronization costs 43%-84% of total time. Execution time breakdown for TMPI and SGI MPI: 2019/5/24 Shen, Tang, and SuperComputing'99

Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 2019/5/24 Shen, Tang, and SuperComputing'99

Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and SuperComputing'99

Context Switch/Synchronization in Multiprogrammed Environments
In multiprogrammed environments, synchronization leads to more context switches  large performance impact. Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among user-level threads Very few kernel-level context switches 2019/5/24 Shen, Tang, and SuperComputing'99

System Architecture … ... … ... … ...
MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 2019/5/24 Shen, Tang, and SuperComputing'99

Adaptive Two-level Thread Management
System-wide resource manager (OS kernel or User-level central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system):  #Active kernel threads ≈ #Processors  Minimize kernel-level context switch 2019/5/24 Shen, Tang, and SuperComputing'99

User-level Thread Scheduling
Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 2019/5/24 Shen, Tang, and SuperComputing'99

Polling in User-Level Context Switch
Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation  suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 2019/5/24 Shen, Tang, and SuperComputing'99

Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and SuperComputing'99

Event Waiting Synchronization
All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); Waiting could be: spinning yielding/blocking waiting *pflag = value; wakeup 2019/5/24 Shen, Tang, and SuperComputing'99

Tradeoff between spin and block
Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 2019/5/24 Shen, Tang, and SuperComputing'99

Scheduler-conscious Event Waiting
User-level scheduler provides: scheduling info affinity info 2019/5/24 Shen, Tang, and SuperComputing'99

Experimental Settings
Machines: SGI Origin 2000 system with MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 2019/5/24 Shen, Tang, and SuperComputing'99

Testing Benchmarks Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree, the more spin-blocks (context switch) during each synchronization Sparse LU benchmarks have much more frequent synchronization than others. 2019/5/24 Shen, Tang, and SuperComputing'99

Performance evaluation on a Multiprogrammed Workload
Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. 2019/5/24 Shen, Tang, and SuperComputing'99

Workload with Certain Multiprogramming Degrees
Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. Compare megaflop rates or speedups of the kernel part of each application. 2019/5/24 Shen, Tang, and SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Power Challenge)
2019/5/24 Shen, Tang, and SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Origin 2000)
Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI 2019/5/24 Shen, Tang, and SuperComputing'99

Benefits of Scheduler-conscious Event Waiting
Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 2019/5/24 Shen, Tang, and SuperComputing'99

Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Scheduler-conscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters 2019/5/24 Shen, Tang, and SuperComputing'99

Department of Computer Science University of California, Santa Barbara

Similar presentations

Presentation on theme: "Department of Computer Science University of California, Santa Barbara"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Computer Science University of California, Santa Barbara

Similar presentations

Presentation on theme: "Department of Computer Science University of California, Santa Barbara"— Presentation transcript:

Similar presentations

About project

Feedback