Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Chapter 5 Processes and Threads Copyright © 2008.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Virtual Memory Virtual Memory Management in Mach Labels and Event Processes in Asbestos Ingar Arntzen.

1: Operating Systems Overview

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Computer System Overview

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Computer System Overview

Threads CSCI 444/544 Operating Systems Fall 2008.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

1 When to Switch Processes 3 triggers –System call, Interrupt and Trap System call –when a user program invokes a system call. e.g., a system call that.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)

1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

Operating Systems and Networks AE4B33OSS Introduction.

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin.

The Performance of μ-Kernel-Based Systems H. Haertig, M. Hohmuth, J. Liedtke, S. Schoenberg, J. Wolter Presenter: Sunita Marathe.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Cooperating Processes The concurrent processes executing in the operating system may be either independent processes or cooperating processes. A process.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Introduction to Operating Systems Concepts

Computer Architecture: Parallel Task Assignment

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Computer Engg, IIT(BHU)

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Threads and Cooperation

Hyperthreading Technology

CMSC 611: Advanced Computer Architecture

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Fine-grained vs Coarse-grained multithreading

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang

Overview Today’s Trends Chip-multiprocessors (CMPs)  Need Finer-grain parallelism for large-scale CMPs Existing fine ‐ grain schedulers Software ‐ only implement : Flexible, but slow and not scalable Hardware ‐ only implement : Fast, but inflexible and costly Approach of this paper Hardware ‐ aided user ‐ level schedulers  H/W : Asynchronous direct messages (ADM)  S/W : Scalable message-passing schedulers  ADM schedulers are Fast like H/W, Flexible like S/W scheduler

Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation

Fine ‐ grain parallelism divide tasks in partition and distribute them Advantages serve more parallelism avoid load imbalance Disadvantages large scheduling overhead Poor locality

Work ‐ stealing scheduler 1 task queue / thread Threads dequeue and enqueue tasks from task queues If a thread runs out of work  try to steal tasks from another thread’s queue

Components of Work-stealing Queues Policies effective load rebalancing method Communication

Comparison 2 scheme Cost of queues and policies is cheap Cost of communication through shared memory is increasingly expensive! Carbon [ISCA ’07] One hardware LIFO task queue / core Special instructions to enqueue/dequeue tasks Components  Global Task Unit : Centralized queues for fast stealing  Local Task Units : One small task buffer per core to hide GTU latency S/W scheduler H/W scheduler Useless if the H/W policy did not match! good bad

Approach to fine-grain scheduling S/W onlyH/W onlyH/W aided OpenMP, TBB, Clik, X10, … Carbon, GPUs, … Asynchronous Direct Messages (ADM) S/W queues & policies S/W communication H/W queues & policies H/W communication S/W queues & policies H/W communication High ‐ overhead Flexible No extra H/W Low ‐ overhead Inflexible Special ‐ purpose H/W Low ‐ overhead Flexible General ‐ purpose HW

Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation

Asynchronous Direct Messages Asynchronous Direct Messages (ADM) Messaging between threads tailored to scheduling and control Requirements  Short messages for Low ‐ overhead  Send from/receive to registers independent from coherence  Simultanenous communication and computation  Asynchronous messages with user-level interrupts  General ‐ purpose H/W  Generic interface allows reuse

ADM Microarchitecture One ADM unit / core Receive buffer holds messages until dequeued by thread Send buffer holds sent messages pending acknowledgement Thread ID Translation Buffer translates TID → core ID on sends Small structures (16 ‐ 32 entries) → scalable

ADM Instruction Set Architecture Send & receive are atomic (single instruction) Send completes when message is copied to send buffer Receive blocks if buffer is empty Peek doesn't block, enables polling Generate an user ‐ level interrupt when a message is received No stack switching, handler code partially saves context (used registers) → fast Interrupts can be disabled to preserve atomicity w.r.t. message reception InstructionDescription adm_send r1, r2 Sends a message of (r1) words (0 ‐ 6) to thread with ID (r2) adm_peek r1, r2Returns source and message length at head of rx buffer adm_rx r1, r2Dequeues message at head of rx buffer adm_ei / adm_diEnable / disable receive interrupts

Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation

ADM Schedulers Message ‐ passing schedulers Replace other parallel runtime’s scheduler Application programmer’s responsibility Threads can perform 2 roles : Worker : Execute parallel phase, enqueue & dequeue tasks Manager : Control task stealing & parallel phase termination Centralized scheduler: 1 manager controls all workers

Centralized Scheduler: Updates Manager keeps approximate task counters for each worker Workers only notify manager at exponential thresholds

Centralized Scheduler: Steals Manager requests a steal from most tasks for load balancing

Hierarchical Scheduler Centralized scheduler : Does all communication through messages Enables directed stealing, task prefetching Does not scale over 16 threads  Hierarchical scheduler Workers and managers form a tree

Hierarchical Scheduler: Steals Steals can affect multiple levels  Scales to a lot of threads

Contents Introduction Asynchronous Direct Messages (ADM) ADM schedulers Evaluation

Evaluation : Conditions Simulated machine: Tiled CMP 32, 64, 128 in ‐ order dual ‐ thread SPARC cores (64, 256 thread) 3 ‐ level cache hierarchy, directory coherence Benchmarks: Loop ‐ parallel : canneal, cg, gtfold Task ‐ parallel : maxflow, mergesort, ced, hashjoin

Evaluation : Results Carbon and ADM : Small overheads that scale ADM matches Carbon → No need for H/W scheduler Must not Implement all scheduling policies in H/W