Department of Computer Science University of California, Santa Barbara

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
Informationsteknologi Tuesday, October 9, 2007Computer Systems/Operating Systems - Class 141 Today’s class Scheduling.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit:
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Computer Science and Software Engineering
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Concepts and Structures. Main difficulties with OS design synchronization ensure a program waiting for an I/O device receives the signal mutual exclusion.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 4 – Thread Concepts
Processes and threads.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Optimizing Parallel Algorithms for All Pairs Similarity Search
Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.
Memory Caches & TLB Virtual Memory
Operating Systems (CS 340 D)
Sujata Ray Dey Maheshtala College Computer Science Department
Processes and Threads Processes and their scheduling
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 4 – Thread Concepts
Chapter 4: Multithreaded Programming
Process Management Presented By Aditya Gupta Assistant Professor
Chapter 4: Threads 羅習五.
Improving java performance using Dynamic Method Migration on FPGAs
Performance Evaluation of Adaptive MPI
Operating Systems (CS 340 D)
Chapter 6: CPU Scheduling
Process Virtualization. Process Process is a program that has initiated its execution. A program is a passive entity; whereas a process is an active entity.
B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M
Operating Systems CPU Scheduling.
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Levels of Parallelism within a Single Processor
Chapter 5: CPU Scheduling
Threads Chapter 4.
Sujata Ray Dey Maheshtala College Computer Science Department
Multiprocessor and Real-Time Scheduling
CS703 – Advanced Operating Systems
Levels of Parallelism within a Single Processor
Thomas E. Anderson, Brian N. Bershad,
Chapter 5: CPU Scheduling
Chapter 2 Operating System Overview
COMP755 Advanced Operating Systems
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Department of Computer Science University of California, Santa Barbara Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara

MPI-Based Parallel Computation on Shared Memory Machines Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model.  MPI on SMMs Threads are easy to program. But MPI is still used on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Scheduling for Parallel Jobs in Multiprogrammed SMMs Gang-scheduling Good for parallel programs which synchronize frequently; Affect resource utilization (Processor-fragmentation; not enough parallelism to use allocated resource). Space/time Sharing Time sharing combined with dynamic partitioning; High throughput. Popular in current OS (e.g., IRIX 6.5) Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization Our prototype system: TMPI 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Impact of synchronization on coarse-grain parallel programs Running a communication-infrequent MPI program (SWEEP3D) on 8 SGI Origin 2000 processors with multiprogramming degree 3. Synchronization costs 43%-84% of total time. Execution time breakdown for TMPI and SGI MPI: 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, synchronization leads to more context switches  large performance impact. Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among user-level threads Very few kernel-level context switches 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

System Architecture … ... … ... … ... MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Adaptive Two-level Thread Management System-wide resource manager (OS kernel or User-level central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system):  #Active kernel threads ≈ #Processors  Minimize kernel-level context switch 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

User-level Thread Scheduling Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Polling in User-Level Context Switch Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation  suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Event Waiting Synchronization All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); Waiting could be: spinning yielding/blocking waiting *pflag = value; wakeup 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Tradeoff between spin and block Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Scheduler-conscious Event Waiting User-level scheduler provides: scheduling info affinity info 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Experimental Settings Machines: SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Testing Benchmarks Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree, the more spin-blocks (context switch) during each synchronization Sparse LU benchmarks have much more frequent synchronization than others. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Performance evaluation on a Multiprogrammed Workload Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Workload with Certain Multiprogramming Degrees Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. Compare megaflop rates or speedups of the kernel part of each application. 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Power Challenge) 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Benefits of Scheduler-conscious Event Waiting Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Scheduler-conscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi 2019/5/24 Shen, Tang, and Yang @ SuperComputing'99