Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

SE-292 High Performance Computing
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Cache Optimization Summary
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
CS 470/570:Introduction to Parallel and Distributed Computing.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Outline Why this subject? What is High Performance Computing?
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Parallel Computing Presented by Justin Reschke
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
The University of Adelaide, School of Computer Science
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Chip-Multiprocessor.
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 25: Multiprocessors
Lecture 25: Multiprocessors
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
CSL718 : Multiprocessors 13th April, 2006 Introduction
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Lecture 19 Beyond Low-level Parallelism

2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained parallelism Issues in parallelization: communication, synchronization, Amdahl’s Law Basic Single-Bus Shared Memory Machines

3 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Intra-Instruction (pipelining) Parallel Instructions (wide issue) Loop-level Parallelism (blocks, loops) Algorithmic Thread Parallelism (loops, plus more) Concurrent programs (multiple programs) Dimension 1: Granularity of Concurrency

4 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 2: Flynn’s Classification Single Instruction (stream), Single Data (stream) (SISD) –simple CPU (1-wide pipeline) Single Instruction, Multiple Data (SIMD) –vector computer, multimedia extensions Multiple Instruction, Single Data (MISD) –??? Multiple Instruction, Multiple Data (MIMD) –multiprocessor (now also, multithreaded CPU)

5 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 3: Parallel Programming Models Data Parallel Model: –The same computation is applied to a large data set. All such applications can be done in parallel. –E.g. Matrix multiplication: all elements of the product array can be generated in parallel. Function Parallel Model: –A program is partitioned into parallel modules. –E.g.“ Compiler: phases of a compiler can be made into parallel modules: parser, semantic analyzer, optimizer 1, optimizer 2, and code generator. They can be organized into a high-level pipeline.

6 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 4: Memory Models CPU Mem CPU Mem CPU Mem Shared Memory ModelMessage Passing Model Interconnect (amenable to single address space)(amenable to multiple address spaces) How processors view each other’s memory space

7 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Dimension 4: Memory Models Shared memory –All processors have common access to a shared memory space. When a program accesses data, it just give the memory system an address for the data. –DO 10 I = 1, N DO 20 J = 1, N A[I,J] = 0 DO 30 K = 1, N A[I,J] = A[I,J] + B[I,K]*C[K,J] END DO

8 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Memory Architecture (Cont.) Message passing (or Distributed Memory) –Each processor has a local memory. For a processor to access a piece of data not present in its own local memory, the processor has to exchange message with the processor whose local memory contains the data. For each data access, the programmer must determine if a thread is executed by the processor whose local memory contains the data. –In general, the programmer has to partition the data and assign them explicitly to the processor local memories. –The programmer then parallelizes the program and assign the threads to processors. Using the assignment decisions, each access is then determined to be either a local memory access or a message exchange.

9 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois A Simple Problem min = D[0]; for (i = 0; i < n; i++) { if (D[i] < min) min = D[i]; } cout << min; Program to find the minimum of a large set of input data.

10 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Simple Problem, Data Parallelized pstart = p# * ; min[p#] = D[0]; for (i = pstart; i < pstart ; i++) { if (D[i] < min[p#]) min[p#] = D[i]; } barrier(); if (p# == 0) { real_min = min[0]; for (i = 1; i < pMax; i++) { if (min[i] < real_min) real_min = min[i]; } cout << real_min; }

11 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Loop Level Parallelization DOALL: There is no dependence between loop iterations. All can be executed in parallel. DOALL 30 J=1,J1 X(II1+J) = X(II1+J) * SC1 Y(II1+J) = Y(II1+J) * SC1 Z(II1+J) = Z(II1+J) * SC1 END DO

12 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Loop Level Parallelization DOACROSS: There are dependences between loop iterations. The dependences are enforced by two synchronization constructs: –advance(synch_pt): signals that the current iteration has passed the synchronization point identified by synch_pt. –await(synch_pt, depend_distance): forces the execution of the current iteration to wait for a previous iteration to pass the synchronization point identified by synch_pt. The iteration is the current iteration number minus depend_distance.

13 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois DOACROSS Example DOACROSS 40 I=4,IL AWAIT(1, 3) X(I) = Y(I) + X(I-3) ADVANCE (1) END DO The current iteration depends on synch point 1 of three iterations ago.

14 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Objective is to write parallel programs that can use as use many processors as are available. Information needs to be exchanged between processors –The min[] needs to be communicated to processor 0 in example. Parallelization is limited by communication. –In terms of latency –In terms of amount of communication –At some point, the communication overhead will out- weight the benefit of parallelism Issue 1: Communication

15 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issue 2: Synchronization Often, parallel tasks need to coordinate and wait for each other. Example: barrier in parallel min algorithm The art of parallel programming involves minimizing synchronization

16 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Mutual Exclusion Atomic update to data structures so that the actions from different processors are serialized. –This is especially important for data structures that need to be updated in multiple steps. Each processor should be able to lock out other processors, update the data structure, leave the data structure in a consistent state, and unlock. –This is usually realized by using atomic Read Modify Write instructions. Parking space example, file editing example

17 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issue 3 : Serial Component Also, no matter how much you parallelize, the performance is ultimately limited by the serial component. Better known as Amdahl’s Law. Jet Service to Chicago –Jet and propellers have very different flying time but similar taxi time – taxi time is the serial section here.

18 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Example Speedup Graph Speedup over a single processor Num of Processors Ideal Speedup Observed Speedup

19 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Message Passing Machines Network characteristic are of primary importance Latency, bisection bandwidths, node bandwidth, occupancy of communication Latency = sender overhead + transport latency + receiver overhead

20 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Synchronization support for shared memory machines Example: two processors trying to increment a shared variable Requires an atomic memory update instr –example: test and set, or compare and exchange Using cmpxchg, implement barrier(); Scalability issues with traditional synchronization

21 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Coherence CPU Mem CPU Mem No Caches: easy hardware, bad performance Cache Better performance, but… what about keeping data valid cache

22 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Memory Coherence, continued CPU 0CPU 1 Cache Loc X: val A A Memory X Essential Question: What happens when CPU 0 writes value B to memory location X?

23 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois One solution : snooping caches CPU Mem Cache cache snoop tags snoop tags Snoop logic watches bus transactions and either invalidates or updates matching cache lines. Requires an extra set of cache tags. Generally, write-back caches are used. Why? What about the L1 cache?

24 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Cache Coherence Protocol Basic premise : writes must update other copies or invalidate other copies. How do we make this work while not choking the bus with transactions OR flooding memory OR bringing each CPU to a crawl? One solution: The Illinois Protocol

25 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois The Illinois Protocol Tag MESI The tag for each cache block now contains 2 bits that specify whether the block is modified, exclusively owned, shared, or invalid.

26 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Illinois Protocol from the CPU’s perspective Invalid or no tag match ModifiedShared Read, supplied by memory Write, but invalidate others Read, supplied by another CPU Write, but invalidate others Write Exclusive Read Read or Write

27 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Illinois Protocol from the bus’s perspective Invalid or no tag match ModifiedShared Exclusive Read Bus Read, and supply data Bus Write (miss) Bus Read Bus Write, but first supply data Bus Write or Invalidation Bus

28 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Issues with a single bus Keep adding CPUs and eventually the bus saturates. Coherence protocol adds extra traffic Buses are slower than point-to-point interconnects