Computer Architecture: Parallel Task Assignment

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Multiscalar processors
Strategies for Implementing Dynamic Load Sharing.
Multiprocessor Real- Time Scheduling Aaron Harris CSE 666 Prof. Ganesan.
Load Balancing Dan Priece. What is Load Balancing? Distributed computing with multiple resources Need some way to distribute workload Discreet from the.
Chapter 3 Memory Management: Virtual Memory
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit:
ICOM Noack Scheduling For Distributed Systems Classification – degree of coupling Classification – granularity Local vs centralized scheduling Methods.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Dynamic Load Balancing Tree and Structured Computations.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Fall 2012 Parallel Computer Architecture Lecture 1: Introduction Prof. Onur Mutlu Carnegie Mellon University 9/5/2012.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Architecture: Branch Prediction (II) and Predicated Execution
Introduction to Load Balancing:
For Massively Parallel Computation The Chaotic State of the Art
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Parallel Programming By J. H. Wang May 2, 2017.
15-740/ Computer Architecture Lecture 7: Pipelining
Parallel Algorithm Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Threads and Cooperation
Computer Architecture: Multithreading (I)
Improved schedulability on the ρVEX polymorphic VLIW processor
Department of Computer Science University of California, Santa Barbara
15-740/ Computer Architecture Lecture 24: Control Flow
15-740/ Computer Architecture Lecture 5: Precise Exceptions
15-740/ Computer Architecture Lecture 26: Predication and DAE
15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up
Multiprocessor and Real-Time Scheduling
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
CS703 - Advanced Operating Systems
Design of Digital Circuits Lecture 19a: VLIW
Department of Computer Science University of California, Santa Barbara
EdgeWise: A Better Stream Processing Engine for the Edge
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Computer Architecture: Parallel Task Assignment Prof. Onur Mutlu Carnegie Mellon University

Static versus Dynamic Scheduling Static: Done at compile time or parallel task creation time Schedule does not change based on runtime information Dynamic: Done at run time (e.g., after tasks are created) Schedule changes based on runtime information Example: Instruction scheduling Why would you like to do dynamic scheduling? What pieces of information are not available to the static scheduler?

Parallel Task Assignment: Tradeoffs Problem: N tasks, P processors, N>P. Do we assign tasks to processors statically (fixed) or dynamically (adaptive)? Static assignment + Simpler: No movement of tasks. - Inefficient: Underutilizes resources when load is not balanced When can load not be balanced? Dynamic assignment + Efficient: Better utilizes processors when load is not balanced - More complex: Need to move tasks to balance processor load - Higher overhead: Task movement takes time, can disrupt locality

Parallel Task Assignment: Example Compute histogram of a large set of values Parallelization: Divide the values across T tasks Each task computes a local histogram for its value set Local histograms merged with global histograms in the end

Parallel Task Assignment: Example (II) How to schedule tasks updating local histograms? Static: Assign equal number of tasks to each processor Dynamic: Assign tasks to a processor that is available When does static work as well as dynamic? Implementation of Dynamic Assignment with Task Queues

Software Task Queues What are the advantages and disadvantages of each? Centralized Distributed Hierarchical

Task Stealing Idea: When a processor’s task queue is empty it steals a task from another processor’s task queue Whom to steal from? (Randomized stealing works well) How many tasks to steal? + Dynamic balancing of computation load - Additional communication/synchronization overhead between processors - Need to stop stealing if no tasks to steal

Parallel Task Assignment: Tradeoffs Who does the assignment? Hardware versus software? Software + Better scope - More time overhead - Slow to adapt to dynamic events (e.g., a processor becoming idle) Hardware + Low time overhead + Can adjust to dynamic events faster - Requires hardware changes (area and possibly energy overhead)

How Can the Hardware Help? Managing task queues in software has overhead Especially high when task sizes are small An idea: Hardware Task Queues Each processor has a dedicated task queue Software fills the task queues (on demand) Hardware manages movement of tasks from queue to queue There can be a global task queue as well  hierarchical tasking in hardware Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007. Optional reading

Dynamic Task Generation Does static task assignment work in this case? Problem: Searching the exit of a maze

Computer Architecture: Parallel Task Assignment Prof. Onur Mutlu Carnegie Mellon University

Backup slides

Referenced Readings Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007.