Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011

Similar presentations


Presentation on theme: "Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011"— Presentation transcript:

1 Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011
15-740/ Computer Architecture Lecture 3: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011

2 Review of Last Lecture Vector processors SIMD vs. MIMD
Advantages and disadvantages Amortization of control overhead over multiple data elements Vector vs. scalar code execution time SIMD vs. MIMD Concept of on-chip networks

3 Today Wrap up intro to on-chip networks ISA level tradeoffs

4 Review: SIMD vs. MIMD SIMD: MIMD:
Concurrency arises from performing the same operations on different pieces of data MIMD: Concurrency arises from performing different operations on different pieces of data Control/thread parallelism: execute different threads of control in parallel  multithreading, multiprocessing Idea: Use multiple processors to solve a problem

5 Review: Flynn’s Taxonomy of Computers
Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements Array processor Vector processor MISD: Multiple instructions operate on single data element Closest form: systolic array processor, streaming processor MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) Multiprocessor Multithreaded processor

6 Review: On-Chip Network Based Multi-Core Systems
A scalable multi-core is a distributed system on a chip R PE R PE R PE R PE Input Port with Buffers R PE R PE R PE R PE From East From West From North From South From PE VC 0 VC Identifier VC 1 VC 2 Control Logic Routing Unit ( RC ) VC Allocator VA Switch Allocator (SA) To East To PE To West To North To South R PE R PE R PE R PE R PE R PE R PE R PE R Router PE Processing Element (Cores, L2 Banks, Memory Controllers, Accelerators, etc) Crossbar

7 Review: Idea of On-Chip Networks
Problem: Connecting many cores with a single bus is not scalable Single point of connection limits communication bandwidth What if multiple core pairs want to communicate with each other at the same time? Electrical loading on the single bus limits bus frequency Idea: Use a network to connect cores Connect neighboring cores via short links Communicate between cores by routing packets over the network Dally and Towles, “Route Packets, Not Wires: On-Chip Interconnection Networks,” DAC 2001.

8 Review: Bus + Simple + Cost effective for a small number of nodes
+ Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth, electrical loading  reduced frequency) - High contention

9 Review: Crossbar Every node connected to every other
Good for small number of nodes + Least contention in the network: high bandwidth - Expensive - Not scalable due to quadratic cost Used in core-to-cache-bank networks in - IBM POWER5 - Sun Niagara I/II 1 2 3 4 5 6 7

10 Review: Mesh O(N) cost Average latency: O(sqrt(N))
Easy to layout on-chip: regular and equal-length links Path diversity: many ways to get from one node to another Used in Tilera 100-core And many on-chip network prototypes

11 Review: Torus Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle Torus avoids this problem + Higher path diversity than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths

12 Review: Torus, continued
Weave nodes to make inter-node latencies ~constant

13 Review: Example NoC: 100-core Tilera Processor
Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007.

14 The Need for QoS in the On-Chip Network
One can create malicious applications that continuously access the same resource  deny service to less aggressive applications

15 The Need for QoS in the On-Chip Network
Need to provide packet scheduling mechanisms that ensure applications’ service requirements (bandwidth/latency) are satisfied Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.

16 On Chip Networks: Some Questions
Is mesh/torus the best topology? How do you design the router? Energy efficient, low latency What is the routing algorithm? Is it adaptive or deterministic? How does the router prioritize between different threads’/applications’ packets? How does the OS/application communicate the importance of applications to the routers? How does the router provide bandwidth/latency guarantees to applications that need them? Where do you place different resources? (e.g., memory controllers) How do you maintain cache coherence? How does the OS scheduler place tasks? How is data placed in distributed caches?

17 What is Computer Architecture?
The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. We will soon distinguish between the terms architecture, microarchitecture, and implementation.

18 Why Study Computer Architecture?

19 Moore’s Law Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, 1965.

20 Why Study Computer Architecture?
Make computers faster, cheaper, smaller, more reliable By exploiting advances and changes in underlying technology/circuits Enable new applications Life-like 3D visualization 20 years ago? Virtual reality? Personal genomics? Adapt the computing stack to technology trends Innovation in software is built into trends and changes in computer architecture > 50% performance improvement per year Understand why computers work the way they do

21 An Example: Multi-Core Systems
Chip CORE 0 L2 CACHE 0 L2 CACHE 1 CORE 1 SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3 *Die photo credit: AMD Barcelona

22 Unexpected Slowdowns in Multi-Core
High priority Memory Performance Hog Low priority What kind of performance do we expect when we run two applications on a multi-core system? To answer this question, we performed an experiment. We took two applications we cared about, ran them together on different cores in a dual-core system, and measured their slowdown compared to when each is run alone on the same system. This graph shows the slowdown each app experienced. (DATA explanation…) Why do we get such a large disparity in the slowdowns? Is it the priorities? No. We went back and gave high priority to gcc and low priority to matlab. The slowdowns did not change at all. Neither the software or the hardware enforced the priorities. Is it the contention in the disk? We checked for this possibility, but found that these applications did not have any disk accesses in the steady state. They both fit in the physical memory and therefore did not interfere in the disk. What is it then? Why do we get such large disparity in slowdowns in a dual core system? I will call such an application a “memory performance hog” Now, let me tell you why this disparity in slowdowns happens. Is it that there are other applications or the OS interfering with gcc, stealing its time quantums? No. (Core 0) (Core 1) Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

23 Why the Disparity in Slowdowns?
Multi-Core Chip CORE 1 matlab CORE 2 gcc L2 CACHE L2 CACHE unfairness INTERCONNECT Shared DRAM Memory System DRAM MEMORY CONTROLLER -In a multi-core chip, different cores share some hardware resources. In particular, they share the DRAM memory system. The shared memory system consists of this and that. When we run matlab on one core, and gcc on another core, both cores generate memory requests to access the DRAM banks. When these requests arrive at the DRAM controller, the controller favors matlab’s requests over gcc’s requests. As a result, matlab can make progress and continues generating memory requests. These requests are again favored by the DRAM controller over gcc’s requests. Therefore, gcc starves waiting for its requests to be serviced in DRAM whereas matlab makes very quick progress as if it were running alone. Why does this happen? This is because the algorithms employed by the DRAM controller are unfair. But, why are these algorithms unfair? Why do they unfairly prioritize matlab accesses? To understand this, we need to understand how a DRAM bank operates. Almost all systems today contain multi-core chips Multi-core systems consist of multiple on-chip cores and caches Cores share the DRAM memory system DRAM memory system consists of DRAM banks that store data (multiple banks to allow parallel accesses) DRAM memory controller that mediates between cores and DRAM memory It schedules memory operations generated by cores to DRAM This talk is about exploiting the unfair algorithms in the memory controllers to perform denial of service to running threads To understand how this happens, we need to know about how each DRAM bank operates DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 3

24 DRAM Bank Operation Access Address: (Row 0, Column 0) Columns
Row address 0 Row address 1 Row decoder Rows Row 0 Empty Row 1 Row Buffer CONFLICT ! HIT HIT Column address 0 Column address 1 Column address 0 Column address 85 Column mux Data

25 DRAM Controllers A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner 2000]* (1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first This scheduling policy aims to maximize DRAM throughput *Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

26 The Problem Multiple threads share the DRAM controller
DRAM controllers designed to maximize DRAM throughput DRAM scheduling policies are thread-unfair Row-hit first: unfairly prioritizes threads with high row buffer locality Threads that keep on accessing the same row Oldest-first: unfairly prioritizes memory-intensive threads DRAM controller vulnerable to denial of service attacks Can write programs to exploit unfairness

27 Fundamental Concepts

28 What is Computer Architecture?
The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals. Traditional definition: “The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the dataflow and controls, the logic design, and the physical implementation.” Gene Amdahl, IBM Journal of R&D, April 1964

29 Levels of Transformation
Microarchitecture ISA Programs Algorithm Problem Circuits/Technology Electrons Runtime System (VM, OS, MM) User Problem Algorithm Program ISA Microarchitecture Circuits ISA is the interface between hardware and software… It is a contract that the hardware promises to satisfy. Electrons

30 Levels of Transformation
ISA Agreed upon interface between software and hardware SW/compiler assumes, HW promises What the software writer needs to know to write system/user programs Microarchitecture Specific implementation of an ISA Not visible to the software Microprocessor ISA, uarch, circuits “Architecture” = ISA + microarchitecture Problem Algorithm Program ISA Microarchitecture Circuits Electrons ISA is the interface between hardware and software… It is a contract that the hardware promises to satisfy. Builder/user interface

31 ISA vs. Microarchitecture
What is part of ISA vs. Uarch? Gas pedal: interface for “acceleration” Internals of the engine: implements “acceleration” Add instruction vs. Adder implementation Implementation (uarch) can be various as long as it satisfies the specification (ISA) Bit serial, ripple carry, carry lookahead adders x86 ISA has many implementations: 286, 386, 486, Pentium, Pentium Pro, … Uarch usually changes faster than ISA Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs Why?

32 ISA Instructions Memory Call, Interrupt/Exception Handling
Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes Memory Address space, Addressability, Alignment Virtual memory management Call, Interrupt/Exception Handling Access Control, Priority/Privilege I/O Task Management Power and Thermal Management Multi-threading support, Multiprocessor support

33 Microarchitecture Implementation of the ISA under specific design constraints and goals Anything done in hardware without exposure to software Pipelining In-order versus out-of-order instruction execution Memory access scheduling policy Speculative execution Superscalar processing (multiple instruction issue?) Clock gating Caching? Levels, size, associativity, replacement policy Prefetching? Voltage/frequency scaling? Error correction?

34 Design Point A set of design considerations and their importance
leads to tradeoffs in both ISA and uarch Considerations Cost Performance Maximum power consumption Energy consumption (battery life) Availability Reliability and Correctness (or is it?) Time to Market Design point determined by the “Problem” space (application space) Problem Algorithm Program ISA Microarchitecture Circuits Electrons

35 Tradeoffs: Soul of Computer Architecture
ISA-level tradeoffs Uarch-level tradeoffs System and Task-level tradeoffs How to divide the labor between hardware and software

36 ISA-level Tradeoffs: Semantic Gap
Where to place the ISA? Semantic gap Closer to high-level language (HLL) or closer to hardware control signals?  Complex vs. simple instructions RISC vs. CISC vs. HLL machines FFT, QUICKSORT, POLY, FP instructions? VAX INDEX instruction (array access with bounds checking) Tradeoffs: Simple compiler, complex hardware vs complex compiler, simple hardware Caveat: Translation (indirection) can change the tradeoff! Burden of backward compatibility Performance? Optimization opportunity: Example of VAX INDEX instruction: who (compiler vs. hardware) puts more effort into optimization? Instruction size, code size

37 X86: Small Semantic Gap: String Operations
REP MOVS DEST SRC How many instructions does this take in Alpha?

38 Small Semantic Gap Examples in VAX
FIND FIRST Find the first set bit in a bit field Helps OS resource allocation operations SAVE CONTEXT, LOAD CONTEXT Special context switching instructions INSQUEUE, REMQUEUE Operations on doubly linked list INDEX Array access with bounds checking STRING Operations Compare strings, find substrings, … Cyclic Redundancy Check Instruction EDITPC Implements editing functions to display fixed format output Digital Equipment Corp., “VAX Architecture Handbook,”

39 Small versus Large Semantic Gap
CISC vs. RISC Complex instruction set computer  complex instructions Initially motivated by “not good enough” code generation Reduced instruction set computer  simple instructions John Cocke, mid 1970s, IBM 801 Goal: enable better compiler control and optimization RISC motivated by Memory stalls (no work done in a complex instruction when there is a memory stall?) When is this correct? Simplifying the hardware  lower cost, higher frequency Enabling the compiler to optimize the code better Find fine-grained parallelism to reduce stalls

40 Small versus Large Semantic Gap
John Cocke’s RISC (large semantic gap) concept: Compiler generates control signals: open microcode Advantages of Small Semantic Gap (Complex instructions) + Denser encoding  smaller code size  saves off-chip bandwidth, better cache hit rate (better packing of instructions) + Simpler compiler Disadvantages - Larger chunks of work  compiler has less opportunity to optimize - More complex hardware  translation to control signals and optimization needs to be done by hardware Read Colwell et al., “Instruction Sets and Beyond: Computers, Complexity, and Controversy,” IEEE Computer 1985.

41 ISA-level Tradeoffs: Instruction Length
Fixed length: Length of all instructions the same + Easier to decode single instruction in hardware + Easier to decode multiple instructions concurrently -- Wasted bits in instructions (Why is this bad?) -- Harder-to-extend ISA (how to add new instructions?) Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good?) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently Tradeoffs Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness Performance? Smaller code vs. imperfect decode

42 ISA-level Tradeoffs: Uniform Decode
Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location Ditto operand specifiers, immediate values, … Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions?) or wastes space Non-uniform decode E.g., opcode can be the 1st-7th byte in x86 + More compact and powerful instruction format -- More complex decode logic (e.g., more logic to speculatively generate branch target)

43 x86 vs. Alpha Instruction Formats

44 ISA-level Tradeoffs: Number of Registers
Affects: Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler  fewer saves/restores -- Larger instruction size -- Larger register file size -- (Superscalar processors) More complex dependency check logic

45 ISA-level Tradeoffs: Addressing Modes
Addressing mode specifies how to obtain an operand of an instruction Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointer-based accesses) -- make it harder for the architect to design -- too many choices for the compiler? Many ways to do the same thing complicates compiler design Read Wulf, “Compilers and Computer Architecture”

46 x86 vs. Alpha Instruction Formats

47 x86 register indirect absolute register + displacement register

48 x86 indexed (base + index) scaled (base + index*4)

49 Other ISA-level Tradeoffs
Load/store vs. Memory/Memory Condition codes vs. condition registers vs. compare&test Hardware interlocks vs. software-guaranteed interlocking VLIW vs. single instruction vs. SIMD 0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands) Precise vs. imprecise exceptions Virtual memory vs. not Aligned vs. unaligned access Supported data types Software vs. hardware managed page fault handling Granularity of atomicity Cache coherence (hardware vs. software)

50 Programmer vs. (Micro)architect
Many ISA features designed to aid programmers But, complicate the hardware designer’s job Virtual memory vs. overlay programming Should the programmer be concerned about the size of code blocks? Unaligned memory access Compile/programmer needs to align data Transactional memory? VLIW vs. SIMD? Superscalar execution vs. SIMD?

51 Transactional Memory enqueue (Q, v) { } enqueue (Q, v) { }
THREAD 1 THREAD 2 enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; release(lock); Q->tail = node; } enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; Q->tail = node; release(lock); } enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; release(lock); Q->tail = node; } enqueue (Q, v) { Node_t node = malloc(…); node->val = v; node->next = NULL; acquire(lock); if (Q->tail) Q->tail->next = node; else Q->head = node; Q->tail = node; release(lock); } begin-transaction enqueue (Q, v); //no locks end-transaction begin-transaction enqueue (Q, v); //no locks end-transaction

52 Transactional Memory A transaction is executed atomically: ALL or NONE
If there is a data conflict between two transactions, only one of them completes; the other is rolled back Both write to the same location One reads from the location another writes

53 ISA-level Tradeoff: Supporting TM
Still under research Pros: Could make programming with threads easier Could improve parallel program performance vs. locks. Why? Cons: What if it does not pan out? All future microarchitectures might have to support the new instructions (for backward compatibility reasons) Complexity? How does the architect decide whether or not to support TM in the ISA? (How to evaluate the whole stack)

54 ISA-level Tradeoffs: Instruction Pointer
Do we need an instruction pointer in the ISA? Yes: Control-driven, sequential execution An instruction is executed when the IP points to it IP automatically changes sequentially (except control flow instructions) No: Data-driven, parallel execution An instruction is executed when all its operand values are available (data flow) Tradeoffs: MANY high-level ones Ease of programming (for average programmers)? Ease of compilation? Performance: Extraction of parallelism? Hardware complexity?

55 The Von-Neumann Model MEMORY Mem Addr Reg Mem Data Reg PROCESSING UNIT
INPUT OUTPUT ALU TEMP CONTROL UNIT IP Inst Register

56 The Von-Neumann Model Stored program computer (instructions in memory)
One instruction at a time Sequential execution Unified memory The interpretation of a stored value depends on the control signals All major ISAs today use this model Underneath (at uarch level), the execution model is very different Multiple instructions at a time Out-of-order execution Separate instruction and data caches

57 Fundamentals of Uarch Performance Tradeoffs
Instruction Supply Data Path (Functional Units) Data Supply - Zero-cycle latency (no cache miss) - No branch mispredicts No fetch breaks Perfect data flow (reg/memory dependencies) Zero-cycle interconnect (operand communication) Enough functional units Zero latency compute? Zero-cycle latency Infinite capacity Zero cost We will examine all these throughout the course (especially data supply)

58 How to Evaluate Performance Tradeoffs
time program Execution time = # instructions program # cycles instruction time cycle = X X Algorithm Program ISA Compiler Microarchitecture Logic design Circuit implementation Technology ISA Microarchitecture

59 Improving Performance
Reducing instructions/program Reducing cycles/instruction (CPI) Reducing time/cycle (clock period)

60 Improving Performance (Reducing Exec Time)
Reducing instructions/program More efficient algorithms and programs Better ISA? Reducing cycles/instruction (CPI) Better microarchitecture design Execute multiple instructions at the same time Reduce latency of instructions (1-cycle vs. 100-cycle memory access) Reducing time/cycle (clock period) Technology scaling Pipelining

61 Improving Performance: Semantic Gap
Reducing instructions/program Complex instructions: small code size (+) Simple instructions: large code size (--) Reducing cycles/instruction (CPI) Complex instructions: (can) take more cycles to execute (--) REP MOVS How about ADD with condition code setting? Simple instructions: (can) take fewer cycles to execute (+) Reducing time/cycle (clock period) Does instruction complexity affect this? It depends


Download ppt "Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011"

Similar presentations


Ads by Google