Classic Model of Parallel Processing

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism and Superscalar Processors

Advertisements

Lecture 4: CPU Performance

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Computer Organization and Architecture

CPU Review and Programming Models CT101 – Computing Systems.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Performance Analysis of Multiprocessor Architectures

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.

Chapter 17 Parallel Processing.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Pipelining By Toan Nguyen.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Parallelism Processing more than one instruction at a time. Pipelining

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Performance Evaluation of Parallel Processing. Why Performance?

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

EKT303/4 Superscalar vs Super-pipelined.

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Classification of parallel computers Limitations of parallel processing.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

CS203 – Advanced Computer Architecture

18-447: Computer Architecture Lecture 30B: Multiprocessors

Visit for more Learning Resources

Parallel Processing - introduction

Morgan Kaufmann Publishers

Multi-Processing in High Performance Computer Architecture:

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

CSE8380 Parallel and Distributed Processing Presentation

Computer Evolution and Performance

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Presentation transcript:

Classic Model of Parallel Processing An example parallel process of time 10: Multiple Processors available (4) A Process can be divided into serial and parallel portions The parallel parts are executed concurrently Serial Time: 10 time units Parallel Time: 4 time units S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts Executed on a single processor: S A B Executed in parallel on 4 processors: S A B

Amdahl’s Law (Analytical Model) Analytical model of parallel speedup from 1960s Parallel fraction () is run over n processors taking /n time The part that must be executed in serial (1- ) gets no speedup Overall performance is limited by the fraction of the work that cannot be done in parallel (1- ) diminishing returns with increasing processors (n)

Pipelined Processing Single Processor enhanced with discrete stages Instructions “flow” through pipeline stages Parallel Speedup with multiple instructions being executed (by parts) simultaneously Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster Cycle: 1 2 3 4 5 F D OF EX WB F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle

Pipeline Performance Speedup is serial time (nS) over parallel time Performance is limited by the number of pipeline flushes (n) due to jumps speculative execution and branch prediction can minimize pipeline flushes Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

Super-Scalar: Multiple Pipelines Concurrent Execution of Multiple sets of instructions Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes

Algorithm/Thread Level Parallelism Example: Algorithms to compute Fast Fourier Transform (FFT) used in Digital Signal Processing (DSP) Many separate computations in parallel (High Degree Of Parallelism) Large exchange of data - much communication between processors Fine-Grained Parallelism Communication time (latency) may be a consideration if multiple processors are combined on a board of motherboard Large communication load (fine-grained parallelism) can force the algorithm to become bandwidth-bound rather than computation-bound.

Simple Algorithm/Thread Parallelism Model B Parallel “threads of execution” could be a separate process could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism S A B Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.

Multiprocess Speedup Concurrent Execution of Multiple Processes Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n  single process speedup S A B S A B No speedup - uniprocessor 12 t S A B S A B Single Process 8 t, Speedup = 1.5 S A B S A B Multi-Process 4 t, Speedup = 3 Two

Algorithm/Thread Parallelism - Analytical Model Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

(Simple) Unified Model with Scaled Speedup Adds scaling factor on parallel work, while holding serial work constant k1 = scaling factor on parallel portion  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

Capturing Multiple Levels of Parallelism Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

Trend in Microprocessor Architectures 1. Intra-Instruction Parallelism: Pipelines 2. Instruction-Level Parallelism: Super-Scalar - Multiple Pipelines 3. Algorithm/Thread Parallelism: Multiple processing elements Integrated DSP with microcontroller Enhanced microcontroller to do DSP Enhanced DSP processor that also functions as a microcontroller Architectural Variations DSP and microcontroller cores on same chip DSP also does microprocessor Microprocessor also does DSP Multiprocessor Each variation captures some speedup from all three levels Varying amounts of speedup from each level Each parallel level operates at a more efficient level than if all hardware resources were allocated to a single parallel level

More Levels of Parallelism Outside the Chip Multiple Processors in a box: on a motherboard on back-plane with daughter-boards Shared-Memory Multiprocessors communication is through shared memory Clustered Multiprocessors another hierarchical level processors are grouped into clusters intra-cluster can be bus or network inter-cluster can be bus or network Distributed Multicomputers multiple computers loosely coupled through a network n-tiered Architectures modern client/server architectures

Speedup of Client-Server, 2-Tier Systems  - workload balance,% of workload on client  = 1 (100%), completely distributed  = 0 (100%), completely centralized n clients, m servers n CLIENTS m SERVERS LAN INTERNET LAN

Speedup of Client-Server, n-Tier Systems m1 level 1 machines (clients) m2 server2, m3 server3, m3 server3, etc. 1 - workload balance,% of workload on client 2 - % of workload on server2, 3 - % of workload on server3, etc. SERVERS m2 m3 m4 m1 CLIENTS LAN INTERNET LAN SAN

Presentation Summary Architects/Chip-Manufacturers are integrating additional levels of parallelism. Multiple levels of speedup achieve higher speedups and greater efficiencies than increasing hardware at a single parallel level. A balanced approach would achieve about the same level of efficiency in cost of hardware resources allocated, in delivering parallel speedup at each level of parallelism. Numerous architectural approaches are possible, each with different trade-offs and performance returns. Current technology is integrating DSP processing with microcontroller functionality - achieving up to three levels of parallelism.