Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Similar presentations


Presentation on theme: "1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative."— Presentation transcript:

1 1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative Approach, Fifth Edition

2 2 Copyright © 2012, Elsevier Inc. All rights reserved. Computer Technology Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages SaaS, Virtualization, Cloud Applications evolution: Speech, sound, images, video, “augmented/extended reality”, “big data” Introduction

3 3 Copyright © 2012, Elsevier Inc. All rights reserved. Single Processor Performance Introduction RISC Move to multi-processor

4 4 Copyright © 2012, Elsevier Inc. All rights reserved. Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Introduction

5 5 Copyright © 2012, Elsevier Inc. All rights reserved. Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers (1.8 billion sold 2010) Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance (0.35 billion) Servers Emphasis on availability (very costly downtime!), scalability, throughput (20 million) Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)”, PaaS, IaaS, etc. Emphasis on availability ($6M/hour-downtime at Amazon.com!) and price-performance (power=80% of TCO!) Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks, and big data analytics Embedded Computers (19 billion in 2010) Emphasis: price Classes of Computers

6 6 Copyright © 2012, Elsevier Inc. All rights reserved. Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Classes of Computers

7 7 Copyright © 2012, Elsevier Inc. All rights reserved. Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Classes of Computers

8 8 Copyright © 2012, Elsevier Inc. All rights reserved. Defining Computer Architecture “Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding:decisions regarding registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding “Real” computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Defining Computer Architecture

9 9 Copyright © 2012, Elsevier Inc. All rights reserved. Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM Trends in Technology

10 10 Copyright © 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors over the 1st milestone 300-1200X improvement for memory and disks over the 1st milestone Latency or response time Time between start and completion of an event 30-80X improvement for processors over the 1st milestone 6-8X improvement for memory and disks over the 1st milestone Trends in Technology

11 11 Copyright © 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Log-log plot of bandwidth and latency milestones Trends in Technology

12 12 Copyright © 2012, Elsevier Inc. All rights reserved. Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to.032 microns in 2011 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Linear performance and quadratic density growth present a challenge and opportunity, creating the need for computer architect! Trends in Technology

13 13 Copyright © 2012, Elsevier Inc. All rights reserved. Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Trends in Power and Energy

14 14 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Trends in Power and Energy

15 15 Copyright © 2012, Elsevier Inc. All rights reserved. Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy

16 16 Copyright © 2012, Elsevier Inc. All rights reserved. Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Trends in Power and Energy

17 17 Copyright © 2012, Elsevier Inc. All rights reserved. Static Power Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Race-to-halt The new primary evaluation for design innovation Tasks per joule Performance per watt Trends in Power and Energy

18 18 Copyright © 2012, Elsevier Inc. All rights reserved. Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Trends in Cost

19 19 Copyright © 2012, Elsevier Inc. All rights reserved. Integrated Circuit Cost Integrated circuit Bose-Einstein formula: Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) The manufacturing process dictates the wafer cost, wafer yield and defects per unit area The architect’s design affects the die area, which in turn affects the defects and cost per die Trends in Cost

20 20 Copyright © 2012, Elsevier Inc. All rights reserved. Dependability Systems alternate between two states of service with respect to SLA/SLO: 1. Service accomplishment, where service is delivered as specified by SLA 2. Service interruption, where the delivered service is different from the SLA Module reliability: “failure(F)=transition from 1 to 2” and “repair(R)=transition from 2 to 1” Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF Dependability

21 21 Copyright © 2012, Elsevier Inc. All rights reserved. Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Execution time Y / Execution time X Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Measuring Performance

22 22 Copyright © 2012, Elsevier Inc. All rights reserved. Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units Principle of Locality Reuse of data and instructions Focus on the Common Case Amdahl’s Law Principles

23 23 Copyright © 2012, Elsevier Inc. All rights reserved. Principles of Computer Design The Processor Performance Equation Principles

24 24 Copyright © 2012, Elsevier Inc. All rights reserved. Principles of Computer Design Principles Different instruction types having different CPIs

25 25 Chapter 1 Review & Examples Copyright © 2012, Elsevier Inc. All rights reserved.

26 ISA CSCE430/830 Instruction Set Architecture (ISA) Serves as an interface between software and hardware. Provides a mechanism by which the software tells the hardware what should be done. instruction set High level language code : C, C++, Java, Fortran, hardware Assembly language code: architecture specific statements Machine language code: architecture specific bit patterns software compiler assembler

27 ISA CSCE430/830 Instruction Set Design Issues Instruction set design issues include: –Where are operands stored? »registers, memory, stack, accumulator –How many explicit operands are there? »0, 1, 2, or 3 –How is the operand location specified? »register, immediate, indirect,... –What type & size of operands are supported? »byte, int, float, double, string, vector... –What operations are supported? »add, sub, mul, move, compare...

28 ISA CSCE430/830 Classifying ISAs Accumulator (before 1960, e.g. 68HC11 ): 1-addressadd Aacc  acc + mem[A] Stack (1960s to 1970s): 0-addressaddtos  tos + next Memory-Memory (1970s to 1980s): 2-addressadd A, Bmem[A]  mem[A] + mem[B] 3-addressadd A, B, C mem[A]  mem[B] + mem[C] Register-Memory (1970s to present, e.g. 80x86 ): 2-addressadd R1, AR1  R1 + mem[A] load R1, AR1  mem[A] Register-Register (Load/Store, RISC) (1960s to present, e.g. MIPS ): 3-addressadd R1, R2, R3R1  R2 + R3 load R1, R2R1  mem[R2] store R1, R2mem[R1]  R2

29 ISA CSCE430/830 Operand Locations in Four ISA Classes GPR

30 ISA CSCE430/830 Code Sequence C = A + B for Four Instruction Sets StackAccumulatorRegister (register-memory) Register (load- store) Push A Push B Add Pop C Load A Add B Store C Load R1, A Add R1, B Store C, R1 Load R1,A Load R2, B Add R3, R1, R2 Store C, R3 memory acc = acc + mem[C] R1 = R1 + mem[C] R3 = R1 + R2

31 ISA CSCE430/830 Types of Addressing Modes (VAX) Addressing ModeExampleAction 1.Register directAdd R4, R3R4 <- R4 + R3 2.Immediate Add R4, #3R4 <- R4 + 3 3.DisplacementAdd R4, 100(R1)R4 <- R4 + M[100 + R1] 4.Register indirect Add R4, (R1)R4 <- R4 + M[R1] 5.IndexedAdd R4, (R1 + R2)R4 <- R4 + M[R1 + R2] 6.Direct Add R4, (1000)R4 <- R4 + M[1000] 7.Memory IndirectAdd R4, @(R3)R4 <- R4 + M[M[R3]] 8.AutoincrementAdd R4, (R2)+R4 <- R4 + M[R2] R2 <- R2 + d 9.AutodecrementAdd R4, (R2)-R4 <- R4 + M[R2] R2 <- R2 - d 10. ScaledAdd R4, 100(R2)[R3]R4 <- R4 + M[100 + R2 + R3*d] Studies by [Clark and Emer] indicate that modes 1-4 account for 93% of all operands on the VAX.

32 ISA CSCE430/830 Types of Operations Arithmetic and Logic:AND, ADD Data Transfer:MOVE, LOAD, STORE ControlBRANCH, JUMP, CALL SystemOS CALL, VM Floating PointADDF, MULF, DIVF DecimalADDD, CONVERT StringMOVE, COMPARE Graphics(DE)COMPRESS

33 ISA-2 CSCE430/830 MIPS Instructions All instructions exactly 32 bits wide Different formats for different purposes Similarities in formats ease implementation oprsrtoffset 6 bits5 bits 16 bits oprsrtrdfunctshamt 6 bits5 bits 6 bits R-Format I-Format opaddress 6 bits26 bits J-Format 310 0 0

34 ISA-2 CSCE430/830 MIPS Instruction Types Arithmetic & Logical - manipulate data in registers add $s1, $s2, $s3$s1 = $s2 + $s3 or $s3, $s4, $s5$s3 = $s4 OR $s5 Data Transfer - move register data to/from memory  load & store lw $s1, 100($s2)$s1 = Memory[$s2 + 100] sw $s1, 100($s2)Memory[$s2 + 100] = $s1 Branch - alter program flow beq $s1, $s2, 25if ($s1==$s1) PC = PC + 4 + 4*25 else PC = PC + 4

35 ISA-2 CSCE430/830 MIPS Arithmetic & Logical Instructions Instruction usage (assembly) add dest, src1, src2dest=src1 + src2 sub dest, src1, src2dest=src1 - src2 and dest, src1, src2dest=src1 AND src2 Instruction characteristics –Always 3 operands: destination + 2 sources –Operand order is fixed –Operands are always general purpose registers Design Principles: –Design Principle 1: Simplicity favors regularity –Design Principle 2: Smaller is faster

36 ISA-2 CSCE430/830 Arithmetic & Logical Instructions - Binary Representation Used for arithmetic, logical, shift instructions –op: Basic operation of the instruction (opcode) –rs: first register source operand –rt: second register source operand –rd: register destination operand –shamt: shift amount (more about this later) –funct: function - specific type of operation Also called “R-Format” or “R-Type” Instructions oprsrtrdfunctshamt 6 bits5 bits 6 bits 031

37 ISA-2 CSCE430/830 oprsrtrdfunctshamt 6 bits5 bits 6 bits Decimal Binary Arithmetic & Logical Instructions - Binary Representation Example Machine language for add $8, $17, $18 See reference card for op, funct values 000000 0 10001 17 10010 18 01000 8 00000 0 100000 32 031

38 ISA-2 CSCE430/830 MIPS Data Transfer Instructions Transfer data between registers and memory Instruction format (assembly) lw $dest, offset($addr)load word sw $src, offset($addr)store word Uses: –Accessing a variable in main memory –Accessing an array element

39 ISA-2 CSCE430/830 Review: Chapter 1 Classes of Computers and Classes of Parallelism Technology Trend Dependability Performance Measurements and Benchmarks Principles

40 ISA-2 CSCE430/830 5 Classes of Computers Personal Mobile Devices –Cost is its primary concern –Energy, media performance, and responsiveness Desktop Computing –Price-Performance is its primary concern Servers –Availability, Scalability, and Throughput Clusters/warehouse-scale computers –Price-Performance, Energy Embedded Computer –Price

41 ISA-2 CSCE430/830 Classes of Parallelism & Architectures Data-Level Parallelism –Data items can be operated on at the same time Task-Level Parallelism –Tasks can operate independently and largely in parallel Instruction-Level Parallelism: data-level para. –Pipelining, speculative execution Vector Architectures & GPU: data-level para. –A single instruction operates a collection of data in para. Thread-Level Parallelism: either data-level para. or task-level para. –Exploits parallelism via parallel threads Request-Level Parallelism: task-level para. –Exploits parallelism via decoupled tasks

42 ISA-2 CSCE430/830 4 ways for hardware to support parallelism Single Instruction stream, Single Data stream –SISD Single Instruction stream, Multiple Data streams –SIMD, e.g., GPU, targets data-level parallelism Multiple Instruction streams, Single Data stream –MISD, no commercial multiprocessor of this type Multiple Instruction streams, Multiple Data streams –MIMD, e.g., multi-core processors, targets task-level parallelism

43 ISA-2 CSCE430/830 Trend in Technology Integrated Circuit (IC) logic technology –Moore’s Law: a growth rate in transistor count on a chip of about 40%-55% per year, or doubling every 18 or 24 months. Semiconductor DRAM –In 2011, a growth rate in capacity: 25%-40% per year Flash –A growth rate in capacity: 50%-60% per year Magnetic Disk –Since 2004, it has dropped back to 40% per year.

44 ISA-2 CSCE430/830 Trend in Performance Bandwidth vs. Latency –The improvement on Bandwidth is much significant than that on Latency.

45 ISA-2 CSCE430/830 Growth in Processor Performance RISC Move to multi-processor Parallelism: via Pipelining Locality: using Cache Hurdle: Power Wall Lack: Instruction- level Parallelism

46 ISA-2 CSCE430/830 An example of Intel 486 CPU released in 1992 , 66MHz, w/ L2 Cache , 4.9- 6.3W http://www.cpu-world.com/CPUs/80486/Intel- A80486DX2-66.html

47 ISA-2 CSCE430/830 A CPU fan for Intel 486 CPU http://www.cnaweb.com/486-ball-bearing-cpu- fan.aspx

48 ISA-2 CSCE430/830 An example of Intel Pentium 4 CPU released in 2002, 2.8GHz, w/ 512KB Cache, 68.4W http://www.pcplanetsystems.com/abc/product_detail s.php?item_id=146&category_id=61

49 ISA-2 CSCE430/830 A typical CPU fan for Intel Pentium 4 http://www.dansdata.com/p4coc.htm

50 ISA-2 CSCE430/830 A special CPU fan for gaming/multimedia users http://www.pcper.com/reviews/Cases-and- Cooling/Asus-Star-Ice-CPU-Cooler-Review

51 ISA-2 CSCE430/830 Trend in Power and Energy in IC Energy dynamic – ½ X Capacitive Load X Voltage 2 Power dynamic – ½ X Capacitive Load X Voltage 2 X Freq. switched Example –Intel 486 66MHz Voltage: 5V –Intel Pentium 4 2.8GHz Voltage: 1.5V –Intel Core 990x 3.4GHz Voltage: 0.8-1.375V Improving Energy Efficiency –Do nothing well; Dynamic Voltage-Frequency Scaling(DVFS); Design for typical case; Overclocking Power static – Current static X Voltage

52 ISA-2 CSCE430/830 Dependability Service Accomplishment & Service Interruption Transitions between 2 states: Failure & Restoration Measurements –Reliability: a measure of the continuous service accomplishment from a reference initial instant. »MTTF: Mean time to failure »FIT: failures per billion hours, 1/MTTF X 10 9 »MTTR: Mean time to repair »MTBF: Mean time between failures = MTTF + MTTR –Availability: a measure of the service accomplishment with respect to the alternation between the two states. »MTTF/(MTTF+MTTR) »Upper bound: 100%

53 ISA-2 CSCE430/830 Performance Measurements and Benchmarks Metrics –Throughput: a total amount of work done in a given time –Response time (Execution time): the time between the start and the completion of an event Speedup of X relative to Y –Execution time Y / Execution time X Execution time –Wall clock time: a latency to complete a task –CPU time: only computation time Benchmarks –Kernels, Toy programs, Synthetic benchmarks –Benchmark suites: SPEC [CPU] & TPC [Transaction Processing] –SpecRatio = Execution Time reference / Execution Time target

54 ISA-2 CSCE430/830 Design Principles Take Advantage of Parallelism Principle of Locality Focus on the Common Case –Amdahl’s Law –Upper bound of the speedup: ?

55 ISA-2 CSCE430/830 Example: Laundry Room Washing MachineDrying MachineClean LaundryDirty Laundry 30 minutes washing90 minutes drying Total Execution Time: 30+90 = 120 minutes Washing Portion: 30/120 = ¼ Drying Portion: 90/120 = ¾

56 ISA-2 CSCE430/830 If we can have two drying machines Washing Machine2 Drying MachinesClean LaundryDirty Laundry 30 minutes washing 90/2=45 minutes drying

57 ISA-2 CSCE430/830 Speedup: (30+90)/(30+45)=1.6 Washing Machine2 Drying MachinesClean LaundryDirty Laundry 30 minutes washing 90/2=45 minutes drying

58 ISA-2 CSCE430/830 If we can have unlimited drying machines Washing Machine∞ Drying MachinesClean LaundryDirty Laundry 30 minutes washing ? minutes drying

59 ISA-2 CSCE430/830 Speedup: (30+90)/(30+0)=4 Washing Machine∞ Drying MachinesClean LaundryDirty Laundry 30 minutes washing 90/∞ ≈ 0 minutes drying

60 ISA-2 CSCE430/830 Design Principles Take Advantage of Parallelism Principle of Locality Focus on the Common Case –Amdahl’s Law –Upper bound of the speedup: » 1 / (1 - Fraction enhanced )

61 ISA-2 CSCE430/830 Exercise 1 If the new processor is 10 times faster than the original process, and we assume that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? Fraction enhanced = 0.4, Speedup enhanced = 10 Speedup overall = 1/(0.6+0.4/10) = 1.56 What is the upper bound of the overall speedup? Upper bound = 1/0.6 = 1.67

62 ISA-2 CSCE430/830 Exercise 2 In a disk subsystem: –10 disks, each rated at 1,000,000-hour MTTF –1 ATA controller, 500,000-hour MTTF –1 power supply, 200,000-hour MTTF –1 fan, 200,000-hour MTTF –1 ATA cable, 1,000,000-hour MTTF Assuming the lifetimes are exponentially distributed and that failures are independent, compute the MTTF of the system as a whole

63 ISA-2 CSCE430/830 Exercise 2 Because the overall failure rate of the collection is the sum of the failure rates of the modules, the failure rate of the system –= 10*(1/1,000,000) + 1/500,000 + 1/200,000 + 1/200,000 + 1/1,000,000 –= 23/1,000,000 or 23,000 FIT Because MTTF is the inverse of the failure rate –MTTF system = 1/(23/1,000,000) = 43,500 hours


Download ppt "1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative."

Similar presentations


Ads by Google