Department of Computer Science University of California, Santa Barbara

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.

Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Exokernel: An Operating System Architecture for Application-Level Resource Management Dawson R. Engler, M. Frans Kaashoek, and James O’Toole Jr. M.I.T.

Informationsteknologi Tuesday, October 9, 2007Computer Systems/Operating Systems - Class 141 Today’s class Scheduling.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit:

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Department of Computer Science and Software Engineering

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Concepts and Structures. Main difficulties with OS design synchronization ensure a program waiting for an I/O device receives the signal mutual exclusion.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Chapter 4 – Thread Concepts

lecture 5: CPU Scheduling

Processes and threads.

Achieving Multiprogramming Scalability of Parallel Programs on Intel SMP Platforms: Nanothreading in the Linux Kernel Christos D. Antonopoulos Panagiotis.

Operating Systems (CS 340 D)

Sujata Ray Dey Maheshtala College Computer Science Department

Processes and Threads Processes and their scheduling

Networks and Operating Systems: Exercise Session 2

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 4 – Thread Concepts

Process Management Presented By Aditya Gupta Assistant Professor

April 6, 2001 Gary Kimura Lecture #6 April 6, 2001

Chapter 4: Threads 羅習五.

Improving java performance using Dynamic Method Migration on FPGAs

Performance Evaluation of Adaptive MPI

Operating Systems (CS 340 D)

Chapter 6: CPU Scheduling

Process Virtualization. Process Process is a program that has initiated its execution. A program is a passive entity; whereas a process is an active entity.

B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M

Operating Systems CPU Scheduling.

Capriccio – A Thread Model

Process management Information maintained by OS for process management

Chapter 4: Threads.

Levels of Parallelism within a Single Processor

CPU Scheduling G.Anuradha

Chapter 5: CPU Scheduling

Process scheduling Chapter 5.

Threads Chapter 4.

Chapter 5: CPU Scheduling

Sujata Ray Dey Maheshtala College Computer Science Department

Multiprocessor and Real-Time Scheduling

CS703 – Advanced Operating Systems

Levels of Parallelism within a Single Processor

Thomas E. Anderson, Brian N. Bershad,

Chapter 5: CPU Scheduling

Chapter 2 Operating System Overview

Department of Computer Science University of California, Santa Barbara

COMP755 Advanced Operating Systems

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Department of Computer Science University of California, Santa Barbara Shen, Tang, and Yang 11/17/2018 Adaptive Two-level Thread Management for MPI Execution on Multiprogrammed Shared Memory Machines Kai Shen, Hong Tang, and Tao Yang http://www.cs.ucsb.edu/research/tmpi Department of Computer Science University of California, Santa Barbara SuperComputing'99

MPI-Based Parallel Computation on Shared Memory Machines Shen, Tang, and Yang 11/17/2018 MPI-Based Parallel Computation on Shared Memory Machines Shared Memory Machines (SMMs) or SMM Clusters become popular for high end computing. MPI is a portable high performance parallel programming model.  MPI on SMMs Threads are easy to program. But people still use MPI on SMMs: Better portability for running on other platforms (e.g. SMM clusters); Good data locality due to data partitioning. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99

Scheduling for Parallel Jobs in Multiprogrammed SMMs Shen, Tang, and Yang 11/17/2018 Scheduling for Parallel Jobs in Multiprogrammed SMMs Gang-scheduling Good for parallel programs which synchronize frequently; Low resource utilization (Processor-fragmentation; not enough parallelism). Space/time Sharing Time sharing on dynamically partitioned machines; Short response time and high throughput. Impact on MPI program execution Not all MPI nodes are scheduled simultaneously; The number of available processors for each application may change dynamically. Optimization is needed for fast MPI execution on SMMs. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 11/17/2018 Techniques Studied Thread-Based MPI execution [PPoPP’99] Compile-time transformation for thread-safe MPI execution Fast context switch and synchronization Fast communication through address sharing Two-level thread management for multiprogrammed environments Even faster context switch/synchronization Use scheduling information to guide synchronization Our prototype system: TMPI 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 11/17/2018 Related Work MPI-related Work MPICH, a portable MPI implementation [Gropp/Lusk/et al.]. SGI MPI, highly optimized on SGI platforms. MPI-2, multithreading within a single MPI node. Scheduling and Synchronization Process Control [Tucker/Gupta] and Scheduler Activation [Anderson et al.] Focus on OS research. Scheduler-conscious Synchronization [Kontothanssis et al.] Focus on primitives such as barriers and locks. Hood/Cilk threads [Arora et al.] and Loop-level Scheduling [Yue/Lilja]. Focus on fine-grain parallelism. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99 SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Context Switch/Synchronization in Multiprogrammed Environments In multiprogrammed environments, more synchronization will lead to context switch  context switch/synchronization has large performance impact in multiprogrammed environments Conventional MPI implementation maps each MPI node to an OS process. Our earlier work maps each MPI node to a kernel thread. Two-level Thread Management: maps each MPI node to a user-level thread. Faster context switch and synchronization among user-level threads Very few kernel-level context switches 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

System Architecture … ... … ... … ... MPI application MPI application … ... TMPI Runtime TMPI Runtime … ... User-level threads User-level threads System-wide resource management Targeted at multiprogrammed environments Two-level thread management 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Adaptive Two-level Thread Management System-wide resource manager (OS kernel or User-level central monitor) collects information about active MPI applications; partitions processors among them. Application-wide user-level thread management maps each MPI node into a user-level thread; schedules user-level threads on a pool of kernel threads; controls the number of active kernel threads close to the number of allocated processors. Big picture (in the whole system):  #Active kernel threads ≈ #Processors  Minimize kernel-level context switch 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

User-level Thread Scheduling Every kernel thread can be: active: executing an MPI node (user-level thread); suspended. Execution invariant for each application: #active kernel threads ≈ #allocated processors (minimize kernel-level context switch) #kernel threads = #MPI nodes (avoid dynamic thread creation) Every active kernel thread polls system resource manager, which leads to: Deactivation: suspending itself Activation: waking up some suspended kernel threads No-action When to poll? 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Polling in User-Level Context Switch Context switch is a result of synchronization (e.g. an MPI node waits for a message). Underlying kernel thread polls system resource manager during context switch: Two stack switches if deactivation  suspend on a dummy stack One stack switch otherwise After optimization, 2s in average on SGI Power Challenge 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Outline Motivations & Related Work Adaptive Two-level Thread Management Scheduler-conscious Event Waiting Experimental Studies 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Event Waiting Synchronization All MPI synchronization is based on waitEvent waiter caller waitEvent(*pflag == value); Waiting could be: spinning yielding/blocking waiting *pflag = value; wakeup 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Tradeoff between spin and block Basic rules for waiting using spin-then-block: Spinning wastes CPU cycles. Blocking introduces context switch overhead; always-blocking is not good for dedicated environments. Previous work focuses on choosing the best spin time. Our optimization focus and findings: Fast context switch has substantial performance impact; Use scheduling information to guide spin/block decision: Spinning is futile when the caller is not currently scheduled; Most blocking cost comes from cache flushing penalty. (actual cost varies, up to several ms) 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Scheduler-conscious Event Waiting User-level scheduler provides: scheduling info affinity info 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Experimental Settings Machines: SGI Origin 2000 system with 32 195MHz MIPS R10000s with 2GB memory SGI Power Challenge with 4 200MHz MPIS R4400s with 256MB memory Compare among: TMPI-2: TMPI with two-level thread management SGI MPI: SGI’s native MPI implementation TMPI: original TMPI without two-level thread management 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Testing Benchmarks Sync frequency is obtained by running each benchmark with 4 MPI nodes on 4-processor Power Challenge. The higher the multiprogramming degree is, the more synchronization will lead to context switch. Sparse LU benchmarks have much more frequent synchronization than others. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Performance evaluation on a Multiprogrammed Workload Workload: contains a sequence of six jobs launched with a fixed interval. Compare job turnaround time in Power Challenge. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Workload with Certain Multiprogramming Degrees Goal: identify the performance impact of multiprogramming degrees. Experimental setting: Each workload has one benchmark program. Run n MPI nodes on p processors (n≥p). Multiprogramming degree is n/p. Compare megaflop rates or speedups of the kernel part of each application. 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Power Challenge) 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Performance Impact of Multiprogramming Degree (SGI Origin 2000) Performance ratios of TMPI-2 over TMPI Performance ratios of TMPI-2 over SGI MPI 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Benefits of Scheduler-conscious Event Waiting Improvement over simple spin-block on Power Challenge Improvement over simple spin-block on Origin 2000 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99

Shen, Tang, and Yang @ SuperComputing'99 Conclusions Contributions for optimizing MPI execution: Adaptive two-level thread management; Scheduler-conscious event waiting; Great performance improvement: up to an order of magnitude, depending on applications and load; In multiprogrammed environments, fast context switch/synchronization is important even for communication-infrequent MPI programs. Current and future work Support threaded MPI on SMP-clusters http://www.cs.ucsb.edu/research/tmpi 11/17/2018 Shen, Tang, and Yang @ SuperComputing'99