Tutorial for QUIZ 1: Interconnects, shared memory, and synchronization

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

MESI cache coherence protocol

SE-292 High Performance Computing

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Multiple Processor Systems

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

1/25 Concurrency and asynchronous computing How do we deal with evaluation when we have a bunch of processors involved?

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

1 MATERI PENDUKUNG SINKRONISASI Matakuliah: M0074/PROGRAMMING II Tahun: 2005 Versi: 1/0.

CSCE 313: Embedded Systems Multiprocessor Systems

1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,

4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Introduction to Parallel Processing Ch. 12, Pg

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009.

Interconnect Networks

Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.

Synchronization (Barriers) Parallel Processing (CS453)

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

1 Dynamic Interconnection Networks Miodrag Bolic.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.

Synchronization with shared memory AMANO, Hideharu Textbook pp

Games Development 2 Concurrent Programming CO3301 Week 9.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Memory/Storage Architecture Lab Computer Architecture Multiprocessors.

Birds Eye View of Interconnection Networks

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Lab 2 Parallel processing using NIOS II processors

1 Interconnection Networks. 2 Interconnection Networks Interconnection Network (for SIMD/MIMD) can be used for internal connections among: Processors,

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Concurrency & Dynamic Programming.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-5 Process Synchronization Department of Computer Science and Software.

Cotter-cs431 Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 8 Multiple Processor Systems.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

1 Introduction ELG 6158 Digital Systems Architecture Miodrag Bolic.

Super computers Parallel Processing

In1210/01-PDS 1 TU-Delft Large systems. In1210/01-PDS 2 TU-Delft Why parallelism(1) l Fundamental laws of nature: -example: channel widths are becoming.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

INTERCONNECTION NETWORK

Multiprocessor System Distributed System

Overview Parallel Processing Pipelining

MODERN OPERATING SYSTEMS Third Edition ANDREW S

Definition of Distributed System

Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.

Lecture 18: Coherence and Synchronization

Computer Engg, IIT(BHU)

Multiprocessor Introduction and Characteristics of Multiprocessor

Message Passing Models

Lecture 14: Pthreads Mutex and Condition Variables

Multiple Processor Systems

Shared Memory Systems Miodrag Bolic.

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Concurrency: Mutual Exclusion and Process Synchronization

Lecture 25: Multiprocessors

Lecture 14: Pthreads Mutex and Condition Variables

- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts: multiprogramming, multiprocessing, multitasking,

Lecture 18: Coherence and Synchronization

Presentation transcript:

Tutorial for QUIZ 1: Interconnects, shared memory, and synchronization Daniel Shapiro Explain partitioning and synchronization for shared memory systems.

Question #1 QUESTION: Can there be questions on the quiz that are not in the tutorial? ANSWER: YES.

INTERCONNECTS (QUIZ #2) QUESTION #1 The following network is asymmetric: [a] Hypercube [b] 2D Mesh [c] Ring If the main concern in the design of the interconnection network is configurability (ability to easily add more nodes), then which multistage network should be used: [a] Bus [b] Omega network [c] Crossbar network The number of permutations in 8x8 crossbar network is: [a] 256 [b] 40320 [c] 16777216 In a single-bus system that uses split transactions: [a] Both masters and slaves have to compete for the bus by arbitration [b] Only the masters have to compete for the bus by arbitration [c] No arbitration is necessary (3)Non-blocking!! So num permutations = 8P8 = 8! / (8 -8)! = 8! / 0! = 8! / 1 = 8! = 40320

INTERCONNECTS (QUIZ #2) QUESTION #2 Compute the diameter and bisection width for 3D mesh with p=64 processors Compute the diameter and bisection width for 2D torus with p=64 processors (assume that routing is bidirectional). c. Construct an 8-input Omega network using 2 x 2 switch modules in multiple stages. Show the routing of the message from input 010 to output 110. Static Interconnection Networks Slide Comparison of topologies (slide 11) 4x4x4 mesh D = 4+4+4=12, bisection along the middle = 4x4 = 16 8x8  bisection = 8x2=16… D=2*floor(sqrt(64)/2)=2*floor(4)=2*4=8

2. Consider the following 16x16 Omega network Number of stages. Number of 2 x 2 switches needed to implement the network. Draw a 16-input Omega network using 2 x 2 switches as building blocks. Show switch settings for routing a message from node 1101 to node 0101 and from node 0111 to node 1001 simultaneously. Does blocking exist in this case? 4 Stages x switches = 8 x 4 = 32

You may be asked to… Analyze a program Modify a program Write a program

Multiprocessor Software Functions INITIALIZE – assigns a number (proc_num) to each processor in the system; assigns the total number of processors (num_procs). LOCK(data) Allows a processor to “check out” a certain piece of shared data. While one processor has the data locked, no other processors can obtain the lock. The lock is blocking, so once a LOCK is encountered, execution of the program cannot proceed until the LOCK is obtained. UNLOCK(data) – releases a lock so that other processors can obtain it. BARRIER(n_procs) – When a BARRIER is encountered, a processor waits at that BARRIER until n_procs processors reach the BARRIER, then execution can proceed.

Mutex is not the boss!! It is important to note that the mutex core does not physically protect resources in the system from being accessed at the same time by multiple processors. The software running on the processors is responsible for abiding by the rules. The software must be designed to always acquire the mutex before accessing its associated shared resource.

code of processors P1 and P2, do we need to use lock (mutex)? Why? SHARED MEMORY 1. a) Consider a shared memory system with two processors. What value will processor P1 print? If variable flag is used only in this portion of the code of processors P1 and P2, do we need to use lock (mutex)? Why? P1 … A=0; while (flag==0) do nothing; print A; P2 … A=1; flag=1; b) System deadlock refers to a situation in a multiprocessor when concurrent processes are holding resources and preventing each other from completing their execution. Is it possible that software that executes on P1 and P2 causes deadlock situation? c) Why does NIOS II use a hardware mutex? What is the advantage of hardware mutex in comparison with the software one? Bus access locks up the system. P1 is hitting the bus every cycle and slowing down P2. Use a mutex to solve the problem. Mutex is atomic test and set, saving a cycle and ensuring there is no conflicting assignment. Nope Mutex is a core on the bus, not in the nios. Altera: A mutex allows cooperating processors to agree that one of them should be allowed mutually exclusive access to a hardware resource in the system. This is useful for the purpose of protecting resources from data corruption that can occur if more than one processor attempts to use the resource at the same time.

global_MAC = X[0]*Y[0]+ X[1]*Y[1]+… X[N-1]*Y[N-1] SHARED MEMORY QUESTION #4 (MIDTERM) Consider a machine with 2 processors that share the same memory. Multiply and Accumulate operation is performed: global_MAC = X[0]*Y[0]+ X[1]*Y[1]+… X[N-1]*Y[N-1] The MAC subroutine is implemented on both processors and it is shown bellow. Modify the program to make it suitable for execution in a four-processor machine. If processor P1 starts executing MAC subroutine before the processor P0, will the final result be different. Why? 1. id = mypid (); // Assign identification number: id=0 for processor 0, and id=1 for processor 1 2. read_array(X, Y, N); //read arrays X and Y that have size N 3. if (id == 0) //initialize the MAC 4. { 5. LOCK(global_MAC); 6. global_MAC = 0; 7. UNLOCK(global_MAC); 8. } 9. BARRIER(2); //waits for all processors to get to this point in the program BARRIER (4) 10. local_MAC = 0; 11. for (i =id*N/2; i < (id+1)*N/2; i++) for (i =id*N/4; i < (id+1)*N/4; i++) 12. local_MAC += X[i]*Y[i]; 13. LOCK(global_MAC); 14. global_MAC += local_MAC; 15. UNLOCK(global_MAC); 16. BARRIER(2); //waits for all processors to get to this point in the program BARRIER (4) 17. END;

SHARED MEMORY Sum all the elements of an array Z of size n. INITIALIZE; //assign proc_nums and num_procs read_array(Z, size); //read the array and array size from file if (proc_num == 0) //initialize the sum { LOCK(global_sum); global_sum = 0; UNLOCK(global_sum); } local_sum = 0; size_to_sum = size/num_procs; lower_ind = size_to_sum * proc_num; upper_ind = size_to_sum * (proc_num + 1); for (i = lower_ind; i < upper_ind; i++) local_sum += Z[i]; //if size =100, num_proc=4, processor 0 sums 0 to 24, proc 1 sums 25 to 49, etc. LOCK(global_sum); //locks the sum variable so only this process can change it global_sum += local_sum; UNLOCK(global_sum); //gives the sum back so other procs can add to it BARRIER(num_procs); //waits for num_procs to get to this point in the program if (proc_num == 0) printf("sum is %d", global_sum); END;

SHARED MEMORY if M=2 and n = 4, then size_to_sum = 2 For P0 Cumulative sum the elements of an array Z of size n. i.e. C(i)=Z(1)+Z(2)+…Z(i). INITIALIZE; //assign proc_nums and M where M is the number of processors if (proc_num == 0) //initialize the sum C = {0}; BARRIER (M); read_array(Z, n); //read the array and array size n from file size_to_sum = n/M; lower_ind = size_to_sum * proc_num; upper_ind = size_to_sum * (proc_num + 1); for (i = lower_ind; i < upper_ind; i++) C[i]= C[i-1]+Z[i];//what happens when i-1 = 0-1 = 0?? Change to (i-1)>0?(i-1):0 //what happened when i-1 is not yet computed by another processor? BAD NEWS BARRIER(M); //waits for M processors to get to this point in the program for (j=M-1;j>=1;j--) { if (proc_num>=j) C[i]= C[i]+C[size_to_sum * j]; } if (proc_num == 0) for (i=0;i<=n;i++) printf("C[i]= %d", C[i]); END; if M=2 and n = 4, then size_to_sum = 2 For P0 lower_ind = 2*0=0 upper_ind = 2*1=2 C[0]=C[0]+Z[0]; C[1]=C[0]+Z[1]; For P1 lower_ind = 2*1=2 upper_ind = 2*2=4 C[2]=C[1]+Z[2]; C[3]=C[2]+Z[3];

Let us do an example … see files in this folder 

Questions?