CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily…

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Lecture 6: Multicore Systems

SISD—Single Instruction Single Data Xin Meng Tufts University School of Engineering.

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

Chapter 17 Parallel Processing.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Pipelining By Toan Nguyen.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Advanced Computer Architectures

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.

Parallelism Processing more than one instruction at a time. Pipelining

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.

Classic Model of Parallel Processing

Parallel Computing.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Copyright © Curt Hill Parallelism in Processors Several Approaches.

Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.

Outline Why this subject? What is High Performance Computing?

Computer performance issues* Pipelines, Parallelism. Process and Threads.

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

EKT303/4 Superscalar vs Super-pipelined.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

Background Computer System Architectures Computer System Software.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

These slides are based on the book:

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

18-447: Computer Architecture Lecture 30B: Multiprocessors

Computer Architecture: Parallel Processing Basics

Distributed Processors

Parallel Processing - introduction

CS 147 – Parallel Processing

Multi-Processing in High Performance Computer Architecture:

Distributed System Structures 16: Distributed Structures

Hardware Multithreading

Chapter 17 Parallel Processing

Symmetric Multiprocessing (SMP)

CSE8380 Parallel and Distributed Processing Presentation

AN INTRODUCTION ON PARALLEL PROCESSING

Computer Evolution and Performance

Chapter 4 Multiprocessors

Database System Architectures

Presentation transcript:

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson CS8625-June-2-08 Class Will Start Momentarily… CS8625 High Performance and Parallel Computing Dr. Ken Hoganson Intro Parallel Architectures

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy High performance through replication of components –Multiple processors –Multiple buses –Multiple hard drives –Multiple network interfaces

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Computing Paradigms Mainframe Terminal “Old” computing paradigm: mainframe/terminal with centralized processing and storage Failed 1 st client/server computing paradigm: Decentralized processing and storage PC Server SERVER PC Successful 2nd client/server computing paradigm: strong centralized processing and storage PC

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Evolving Computing Paradigm Processing and Storage Locality Centralized Decentralized ? Clusters, Servers Distributed, Grid

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Mainframe: the Ultimate Server? Client/server architecture was originally predicted to bring about the demise of the mainframe. Critical corporate data must reside on a highly reliable high performance machine Early PC networks did not have the needed performance or reliability –NOW (Network Of Workstations) –LAN (Local Area Network) Some firms, after experience with client/server problems, returned to the mainframe for critical corporate data and functions Modern computing paradigm combines –powerful servers (including mainframes when needed) where critical corporate data and information resides –With decentralized processing and non-critical storage on PCs –Interconnected with a network

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Multiprocessor Servers Multiprocessor servers offer high performance at much lower cost than a traditional mainframe Uses inexpensive, “off-the-shelf” components Combine multiple PCs or workstations in one box Processors cooperate to complete the work Processors share resources and memory One of the implementations of Parallel Processing Blade Cluster in process of development –10 Blades –Each Blade has 2 CPUs, memory, disk

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson 5 Parallel Levels 5 levels of parallelism have been identified Each level has both a software level parallelism, and a hardware implementation that accommodates or implements the software parallelism Sources: The Unified Parallel Speedup Model and Simulator, K. Hoganson, SE-ACM 2001, March 2001 Alternative Mechanisms to Achieve Parallel Speedup, K. Hoganson, First IEEE Online Symposium for Electronics Engineers, IEEE Society, November Workload Execution Strategies and Parallel Speedup on Clustered Computers, K. Hoganson, IEEE Transactions on Computers, Vol. 48, No. 11, November SoftwareHardware Implementation 1Intra-InstructionPipeline 2Inter-InstructionSuper-Scalar, multiple pipelines 3Algorithm/Thread/ObjectMultiProcessor 4Multi-ProcessClustered-Multiprocessor 5Distributed/N-Tier CSMulticomputer/Internet/Web

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Terminology Thread - a lightweight process, easy (efficient) to multi- task between. Multiprocessor - a computer system with multiple processors combined in a single system (in a single box or frame). Usually share memory and other resources between processors. Multicomputer - multiple discrete computers with separate memory and etc. Interconnected with a network. Clustered computer - a multiprocessor OR multicomputer that builds two levels of interconnection between processors –Intra-Cluster connection (within cluster) –Inter-Cluster connection (between clusters) Distributed Computer - a loosely coupled multicomputer – a n-Tiered Client/Server computing system is an example of distributed computing

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Clustered Multiprocessor CPU Cache CPU Cache CPU Cache I/O MEM

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Multi-Computer CPU MEM I/O NIC CPU MEM I/O NIC CPU MEM I/O NIC

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Level 5 N-Tier Client-Server LAN C C C S C C C G Internet G WS S S W W Client TierServer Tier Client Tier (1)Server Tier (2)Server Tier (3) LAN PA 2 & AveLat 2 PA 3 & AveLat 3 C - Client Workstation S - Data Server G - Gateway W - Web Host Server Figure 2. N-Tier Architectures

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Flynn’s Classification Old idea, still useful. Examines parallelism from the point of view of what is the parallel scope of an instruction SISD - Single Instruction, Single Data: Each instruction operates on a single data item SIMD - Single Instruction, Multiple Data: Each instruction operates on multiple data items simultaneously (classic supercomputing) MIMD - Multiple Instruction, Multiple Data: Separate Instruction/Data streams. Super-scalar, multiprocessors, multicomputers. MISD - Multiple Instruction Single Data: No know examples

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Symmetric Multiprocessing Asymmetric Multiprocessing: –multiple unique processors, each dedicated to a special function –PC is an example Symmetric Multiprocessing: –multiple identical processors able to work together on parallel problems Homogenous system: a symmetric multiprocessor Heterogenous system: different “makes” or models of processors combined in a system. Example: distributed system with different types of PCs with different processors

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Classic Model: Parallel Processing Multiple Processors available (4) A Process can be divided into serial and parallel portions The parallel parts are executed concurrently Serial Time: 10 time units Parallel Time: 4 time units S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts An example parallel process of time 10 : Executed on a single processor: Executed in parallel on 4 processors: SAAAABBBBS S A A A A B B B B S

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Amdahl’s Law (Analytical Model) Analytical model of parallel speedup from 1960s Parallel fraction (  ) is run over n processors taking  /n time The part that must be executed in serial (1-  ) gets no speedup Overall performance is limited by the fraction of the work that cannot be done in parallel (1-  ) diminishing returns with increasing processors (n)

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Pipelined Processing Single Processor enhanced with discrete stages Instructions “flow” through pipeline stages Parallel Speedup with multiple instructions being executed (by parts) simultaneously Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle OFIFDWBEX Cycle:

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Pipeline Performance Speedup is serial time (nS) over parallel time Performance is limited by the number of pipeline flushes (n) due to jumps speculative execution and branch prediction can minimize pipeline flushes Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Super-Scalar: Multiple Pipelines Concurrent Execution of Multiple sets of instructions Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Algorithm/Thread Parallelism Parallel “threads of execution” –could be a separate process OR –could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism S A AB B S S A AB B S P1 P2 Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Multiprocess Speedup Concurrent Execution of Multiple Processes not related. Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n  single process speedup Two S A AB B S S A AB B S SAABBSSAABBS No speedup - uniprocessor 12 t Single Process 8 t, Speedup = 1.5 S A AB B S Multi-Process 4 t, Speedup = 3 S A AB B S

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Algorithm/Thread Analytical Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors in system n i =number of processors used by process i N = number concurrent (assumed dissimilar) processes or threads

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Realizing Multiple Levels of Parallelism Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson End of Lecture End Of Today’s Lecture.

CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2001, 2004, 2005, 2006, 2008, Dr. Ken Hoganson Blank Slide