CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan

PARALLEL PROCESSING ARCHITECTURES CS213 SYLLABUS Winter 2008 INSTRUCTOR: L.N. Bhuyan (http://www.engr.ucr.edu/~bhuyan/) PHONE: (951) 827-2347 E-mail: bhuyan@cs.ucr.edu LECTURE TIME: TR 12:40pm-2pm PLACE: HMNSS 1502 OFFICE HOURS: W 2.00-4.00 or By Appointment

References: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, Morgan Kauffman Publisher. Research Papers to be available in the class COURSE OUTLINE: Introduction to Parallel Processing: Flynn’s classification, SIMD and MIMD operations, Shared Memory vs. message passing multiprocessors, Distributed shared memory Shared Memory Multiprocessors: SMP and CC-NUMA architectures, Cache coherence protocols, Consistency protocols, Data pre-fetching, CC-NUMA memory management, SGI 4700 multiprocessor, Chip Multiprocessors, Network Processors (IXP and Cavium) Interconnection Networks: Static and Dynamic networks, switching techniques, Internet techniques Message Passing Architectures: Message passing paradigms, Grid architecture, Workstation clusters, User level software Multiprocessor Scheduling: Scheduling and mapping, Internet web servers, P2P, Content aware load balancing PREREQUISITE: CS 203A GRADING: Project I – 20 points Project II – 30 points Test 1 – 20 points Test 2 - 30 points

Possible Projects Experiments with SGI Altix 4700 Supercomputer – Algorithm design and FPGA offloading I/O Scheduling on SGI Chip Multiprocessor (CMP) – Design, analysis and simulation P2P – Using Planet Lab Note: 2 students/group – Expect submission of a paper to a conference

Useful Web Addresses http://www.sgi.com/products/servers/altix/4000/ and http://www.sgi.com/products/rasc/http://www.sgi.com/products/servers/altix/4000/ Wisconsin Computer Architecture Page – Simulators http://www.cs.wisc.edu/~arch/www/tools.html SimpleScalar – www.simplescalar.com – Look for multiprocessor extensionswww.simplescalar.com NepSim: http: www.cs.ucr.edu/~yluo/nepsim/ Working in a cluster environment Beowulf Cluster – www.beowulf.org MPI – www-unix.mcs.anl.gov/mpi Application Benchmarks http://www-flash.stanford.edu/apps/SPLASH/

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing,1989 Questions about parallel computers: –How large a collection? –How powerful are processing elements? –How do they cooperate and communicate? –How are data transmitted? –What type of interconnection? –What are HW and SW primitives for programmer? –Does it translate into performance?

Parallel Processors “Myth” The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models since “uniprocessors can’t keep going” –e.g., uniprocessors must stop getting faster due to limit of speed of light – Has it happened? –Killer Micros! Parallelism moved to instruction level. Microprocessor performance doubles every 1.5 years! –In 1990s companies went out of business: Thinking Machines, Kendall Square,...

What level Parallelism? Bit level parallelism: 1970 to ~1985 –4 bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): ~1985 through today –Pipelining –Superscalar –VLIW –Out-of-Order execution –Limits to benefits of ILP? Process Level or Thread level parallelism; mainstream for general purpose computing? –Servers are parallel –High-end Desktop dual processor PC soon?? (or just the sell the socket?)

Why Multiprocessors? 1.Microprocessors as the fastest CPUs Collecting several much easier than redesigning 1 2.Complexity of current microprocessors Do we have enough ideas to sustain 2X/1.5yr? Can we deliver such complexity on schedule? 3.Slow (but steady) improvement in parallel software (scientific apps, databases, OS) 4.Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism Network processors exploiting packet-level parallelism SMP Servers and cluster of workstations for multiple users – Less demand for parallel computing

Amdahl’s Law and Parallel Computers Amdahl’s Law (f: original fraction sequential) Speedup = 1 / [(f + (1-f)/n] = n/[1+(n-1)/f], where n = No. of processors A portion f is sequential => limits parallel speedup –Speedup <= 1/ f Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used 80 = 1 / [(f + (1-f)/100] => f = 0.0025 Only 0.25% sequential! => Must be a highly parallel program

Popular Flynn Categories SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 Simple programming model Low overhead Flexibility All custom integrated circuits –(Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

Classification of Parallel Processors SIMD – EX: Illiac IV and Maspar MIMD - True Multiprocessors 1. Message Passing Multiprocessor - Interprocessor communication through explicit message passing through “send” and “receive operations. EX: IBM SP2, Cray XD1, and Clusters 2. Shared Memory Multiprocessor – All processors share the same address space. Interprocessor communication through load/store operations to a shared memory. EX: SMP Servers, SGI Origin, HP V-Class, Cray T3E Their advantages and disadvantages?

More Message passing Computers Cluster: Computers connected over high- bandwidth local area network (Ethernet or Myrinet) used as a parallel computer Network of Workstations (NOW): Homogeneous cluster – same type computers Grid: Computers connected over wide area network

Another Classification for MIMD Computers Centralized Memory: Shared memory located at centralized location – may consist of several interleaved modules – same distance from any processor – Symmetric Multiprocessor (SMP) – Uniform Memory Access (UMA) Distributed Memory: Memory is distributed to each processor – improves scalability (a) Message passing architectures – No processor can directly access another processor’s memory (b) Hardware Distributed Shared Memory (DSM) Multiprocessor – Memory is distributed, but the address space is shared – Non-Uniform Memory Access (NUMA) (c) Software DSM – A level of o/s built on top of message passing multiprocessor to give a shared memory view to the programmer.

Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array 1 Control Processor (CP) broadcasts to many PEs. The CP reads an instruction from the control memory, decodes the instruction, and broadcasts control signals to all PEs. Condition flag per PE so that can skip Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE Data parallel programming languages lay out data to processor

Data Parallel Model Vector processors have similar ISAs, but no data placement restriction SIMD led to Data Parallel Programming languages Advancing VLSI led to single chip FPUs and whole fast µProcs (SIMD less attractive) SIMD programming model led to Single Program Multiple Data (SPMD) model –All processors execute identical program Data parallel programming languages still useful, do communication all at once: “Bulk Synchronous” phases in which all communicate after a global barrier

SIMD Programming – High- Performance Fortran (HPF) Single Program Multiple Data (SPMD) FORALL Construct similar to Fork: FORALL (I=1:N), A(I) = B(I) + C(I), END FORALL Data Mapping in HPF 1. To reduce interprocessor communication 2. Load balancing among processors http://www.npac.syr.edu/hpfa/ http://www.crpc.rice.edu/HPFF/

Major MIMD Styles 1.Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") 2.Decentralized memory (memory module with CPU) Advantages: Scalability, get more memory bandwidth, lower local memory latency Drawback: Longer remote communication latency, Software model more complex Two types: Shared Memory and Message passing

Symmetric Multiprocessor (SMP) Memory: centralized with uniform access time (“uma”) and bus interconnect Examples: Sun Enterprise 5000, SGI Challenge, Intel SystemPro

Decentralized Memory versions 1.Shared Memory with "Non Uniform Memory Access" time (NUMA) 2.Message passing "multicomputer" with separate address space per processor –Can invoke software with Remote Procedue Call (RPC) –Often via library, such as MPI: Message Passing Interface –Also called "Syncrohnous communication" since communication causes synchronization between 2 processes

Distributed Directory MPs

Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: Model of choice for uniprocessors, small-scale MPs Ease of programming Lower latency Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via messages –Advantages: Less hardware, easier to design Good scalability Focuses attention on costly non-local operations Virtual Shared Memory (VSM)

Shared Address/Memory Multiprocessor Model Communicate via Load and Store –Oldest and most popular model Based on timesharing: processes on multiple processors vs. sharing single processor process: a virtual address space and ~ 1 thread of control –Multiple processes can overlap (share), but ALL threads share a process address space Writes to shared address space by one thread are visible to reads of other threads –Usual model: share code, private stack, some shared heap, some private heap

Shared Memory Multiprocessor Model Communicate via Load and Store –Oldest and most popular model Based on timesharing: processes on multiple processors vs. sharing single processor process: a virtual address space and ~ 1 thread of control –Multiple processes can overlap (share), but ALL threads share a process address space Writes to shared address space by one thread are visible to reads of other threads –Usual model: share code, private stack, some shared heap, some private heap

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan

Similar presentations

Presentation on theme: "CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan

Similar presentations

Presentation on theme: "CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan"— Presentation transcript:

Similar presentations

About project

Feedback