Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

CSCI 4717/5717 Computer Architecture

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Practical techniques & Examples

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Programming and Data Structure

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Lecture: Pipelining Basics

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Reference: / MPI Program Structure.

MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.

Processor Technology and Architecture

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 17 Parallel Processing.

© The McGraw-Hill Companies, 2006 Chapter 5 Arrays.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

12b.1 Introduction to Message-passing with MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

Parallelism Processing more than one instruction at a time. Pipelining

Basics and Architectures

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

Hybrid MPI and OpenMP Parallel Programming

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.

PP Lab MPI programming II. Program#1 Write a program that prints hello from every created process. Like: Hello World from process 0 of 5 Hello World from.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 03, 2005 Session 15.

Chapter 4 Message-Passing Programming. The Message-Passing Model.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

Running on GCB part1 By: Camilo Silva. Simple steps to run MPI 1.Use putty or the terminal 2.SSH to gcb.fiu.edu 3.Loggin by providing your username and.

MPI and OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

Computer Organization Instructions Language of The Computer (MIPS) 2.

Message Passing Interface Using resources from

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Instruction Level Parallelism

Morgan Kaufmann Publishers

The University of Adelaide, School of Computer Science

Pipelining and Vector Processing

Message Passing Models

Register Pressure Guided Unroll-and-Jam

Multivector and SIMD Computers

Instructions Instructions (referred to as micro-instructions in the book) specify a relatively simple task to be executed It is assumed that data are stored.

Parallel Processing - MPI

Some codes for analysis and preparation for programming

Presentation transcript:

Quiz 3: solutions QUESTION #2 Consider a multiprocessor system with two processors (P1 and P2) and each processor has a cache. Initially, there is no copy of variable X in any of the caches and X=10. For each of the following protocols, show the state of variable X in caches and memory after each of the preceding statements is executed. (a) two-state write-through write invalidate protocol R = Read, W = Write, Z = Replace i = local processor, j = other processor State of P1’s cacheContent of x in P1’s cache State of P2’s cacheContent of x in P2’s cache Content of memory location x 1. Processor P1 reads variable X V10I- 2. P2 reads X V10V 3. P2 performs operation X=X+2 I10V12 4. P1 performs the operation X=X*2 V24I P1 reads X V24V

Quiz 3: solutions QUESTION #2 (b) basic MSI write-back invalidation protocol State of P1’s cacheContent of x in P1’s cache State of P2’s cacheContent of x in P2’s cache Content of memory location x 1. Processor P1 reads variable X RO10INV P2 reads X RO10RO10 3. P2 performs operation X=X+2 INV10RW P1 performs the operation X=X*2 RW24INV P1 reads X RO24RO24

1. P2 reads X 2. P1 writes back X’ 3. P2 reads X’

Quiz 3: solutions QUESTION #3(a) The following MPI program is given. What is the order of printing? Why? #include #include "mpi.h" main(int argc, char** argv) { int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d.\n", my_PE_num); MPI_Finalize(); } MPI_Init  initiate computation MPI_Comm_rank  determine the integer identifier assigned to the current process (processes in a process group are identified with unique, contiguous integers numbered from 0) MPI_COMM_WORLD  default value which identifies all processes involved in a computation MPI_Finalize  terminate computation  There is no defined order of printing  the order in which processes are executing the printf command is not defined by MPI_Comm_rank Hello from 3. Hello from 1. Hello from 0. Hello from 2.

Quiz 4: QUESTION #1 4. Explain how scheduling in-forest / out-forest task graphs works: First, determine the level of each node, which is the maximum number of nodes (including itself) on any path from the given node to a terminal node  the level of each node is used as each node’s priority Whenever a processor becomes available, assign it the unexecuted ready task with the highest priority

Quiz 4: QUESTION #2 Task graph is shown bellow together with the execution and communication times: a. Draw Gantt chart with communication when this program is executed on two processors. Schedule program on these processor so that the overall time is minimized. What is the total time needed?  total time is 30 y 15 a bc x Task Graph TaskExecution time a b c y ArcCommunication (a,b) y=5 (a,c)x=10 P1 P2 a idle c 30 b

Quiz 4: QUESTION #2 Task graph is shown bellow together with the execution and communication times: b. Which technique will help eliminating communication time? What is the total time needed? NODE DUPLICATION  total time is 25 a bc x Task Graph TaskExecution time a b c y ArcCommunication (a,b) y=5 (a,c)x=10 P1 P2 a a 10 c b 25 y 15 P1 P2 a idle c 30 b

Quiz 4: QUESTION #1 1. Which of the following statements is false? a)Node duplication reduces the overall number of computational operations in the system b)Node duplication reduces communication delays c)Node duplication is used to reduce the idle time

Vector Processing: Architectures that have high-level operations that work on linear arrays of numbers or “vectors’ Some typical vector-based instructions:

Convoy  set of vector instructions that could potentially begin execution together in one clock period:

Enhancing Vector performance: Chaining  allows a vector operation to start as soon as the individual elements of its vector source operand become available:

Quiz 4: QUESTION #1 3. If we compare a program that deals with arrays written for the vector and for the scalar processor, we can see that the vector program has the smaller number of instructions and it also executes the smaller number of operations. Why? The number of instructions is reduced because the whole loops can be replaced with one (or a few) instruction. The number of operations is reduced as well because the operations needed to handle the loop such as incrementing indexes do not need to be executed in software.

Quiz 4: QUESTION #3 (a, b, 17 points each, total 34 points) Consider a vector program given bellow for Y=X*Z+Y. All vectors have length of 64. Suppose that the hardware have 2 load/store units capable of performing 2 loads, or 2 stores, or 1 load and 1 store vector operation at the same time, one pipelined vector multiplier and one pipelined vector adder. Suppose that chaining is not allowed and that the start-up times are 12 for LV and SV, 7 for MULV and 6 for ADDV. a. How many convoys do we have? b. What is the total execution time? LV V5,Rz ;load vector Z LV V1,Rx ;load vector X MULV V2,V1,V5;vector multiply LVV3,Ry ;load vector Y ADDVV4,V2,V3 ;vector add SV Ry,V4 ;store the result 4 convoys: 1.LV, LV 2.MULV, LV 3.ADDV 4.SV  4 x = 298

Final: QUESTION #5.1-2 Consider the following code implemented on a vector processor used to multiply 64 element vector Y = a × X: L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply SV Ry,V2 ; store the result Startup delay: Load and store unit 12, Multiply unit 7 clock cycles Compute the total execution time of vector instructions if the instructions are chained. Assume that: a) There is only 1 load/store unit L.D  = 152 LV MULVS SV

Final: QUESTION #5.1-2 Consider the following code implemented on a vector processor used to multiply 64 element vector Y = a × X: L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply SV Ry,V2 ; store the result Startup delay: Load and store unit 12, Multiply unit 7 clock cycles Compute the total execution time of vector instructions if the instructions are chained. Assume that: b) There are one load and one store unit L.D  = 95 LV MULVS SV