CptS / E E COMPUTER ARCHITECTURE

Slides:

Advertisements

Similar presentations

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Advertisements

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Computer Abstractions and Technology

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.

CS 203 A: Advanced Computer Architecture

1 ECE 570– Advanced Computer Architecture Dr. Patrick Chiang Winter 2013 Tues/Thurs 2-4PMPM.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Intro II & Trends Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

Instruction Level Parallelism and Its Exploitation.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

CPSC 614 Computer Architecture Lec 2 - Introduction EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley Copyright.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Bandwidth Rocks (1) Latency Lags Bandwidth (last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, RPM.

CSE820 Lec 2 - Introduction Rich Enbody Based on slides by David Patterson.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Computer performance.

CS136, Advanced Architecture Introduction to Architecture (continued)

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Lecture 2: Fundamentals of Computer Design Kai Bu

Chapter 1: Fundamentals of Computer Design David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

1 Chapter 1: Fundamentals of Computer Design Introduction, class of computers Instruction set architecture (ISA) Technology trend: performance, power,

Computer Architecture- 1 Ping-Liang Lai ( 賴秉樑 ) Chapter 1 Fundamentals of Computer Design Computer Architecture 計算機結構.

Lecture 1 Overview of Computer Architecture TopicsOverview Readings: Chapter 1 August 24, 2015 CSCE 513 Computer Architecture.

Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Lecture 02: Fundamentals of Computer Design - Basics Kai Bu

Last Time Performance Analysis It’s all relative

Computer Architecture Lec 2 - Introduction. 01/19/10Lec 02-intro 2 Review from last lecture Computer Architecture >> instruction sets Computer Architecture.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

The University of Adelaide, School of Computer Science

MS108 Computer System I Lecture 2 Metrics Prof. Xiaoyao Liang 2014/2/28 1.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?

CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

CS 5513 Computer Architecture Lecture 2 – More Introduction, Measuring Performance.

Computer Architecture

CS 152 L18 Disks, RAID, BW (1)Fall 2004 © UC Regents CS152 – Computer Architecture and Engineering Lecture 18 – ECC, RAID, Bandwidth vs. Latency

Chapter 1 Technology Trends and Performance. Chapter 1 — Computer Abstractions and Technology — 2 Technology Trends Electronics technology continues to.

Pipelining and Parallelism Mark Staveley

Performance Performance

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture Performance Evaluation.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Ch1. Fundamentals of Computer Design 1. Formulas ECE562 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts Dartmouth.

Lecture 1 Overview of Computer Architecture

CS203 – Advanced Computer Architecture

What is *Computer Architecture*

CS203 – Advanced Computer Architecture

Lynn Choi School of Electrical Engineering

Copyright © 2012, Elsevier Inc. All rights reserved.

HY425 – Αρχιτεκτονική Υπολογιστών Διάλεξη 02

The University of Adelaide, School of Computer Science

School of Computing and Informatics Arizona State University

Morgan Kaufmann Publishers

CPE 432 Computer Design 1 – Introduction and Technology Trends

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 2: Performance Today’s topics: Technology wrap-up

Chapter 1 Fundamentals of Computer Design

The University of Adelaide, School of Computer Science

Computer Evolution and Performance

The University of Adelaide, School of Computer Science

Performance of computer systems

The University of Adelaide, School of Computer Science

Utsunomiya University

Presentation transcript:

CptS 5 6 1 / E E 5 2 4 COMPUTER ARCHITECTURE School of Electrical Engineering and Computer Science WASHINGTON STATE UNIVERSITY F A L L 2 0 1 2

Course Objectives Project Students in this course will be able to: Understand how modern computer systems work. Perform quantitative analysis of computer systems. Analyze at system level the impact of changes in the computer systems. Estimate the performance of a computer system. Recognize the need for further learning in this field (life-long learning). Project A study of a multi-core processor Students will work on teams of 2 members

Grade

Conventional Wisdom in Comp. Architecture Old Conventional Wisdom: Power is free, Transistors are expensive New Conventional Wisdom: “Power wall” Power expensive, Transistors free (We can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” law of diminishing returns on more HW for ILP

Conventional Wisdom in Comp. Architecture Old CW: Multipliers are slow, Memory access is fast New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall Uniprocessor performance now 2X / 5(?) yrs  Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) More simpler processors that are more power efficient

Single Processor Performance Move to multi-processor RISC

Current Trends in Architecture The University of Adelaide, School of Computer Science 19 April 2017 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) -Data Centeres These require explicit restructuring of the application Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Flynn’s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Chapter 2 — Instructions: Language of the Computer

Defining Computer Architecture The University of Adelaide, School of Computer Science 19 April 2017 Defining Computer Architecture “Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding: registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding “Real” computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Chapter 2 — Instructions: Language of the Computer

Tracking Technology Performance Trends Drill down into 4 technologies: Disks Memory Network Processors Compare ~1980 Archaic (Nostalgic) vs. ~2000 Modern (Newfangled) Performance Milestones in each technology Compare for Bandwidth vs. Latency improvements in performance over time Bandwidth: number of events per unit time E.g., M bits / second over network, M bytes / second from disk Latency: elapsed time for a single event E.g., one-way network delay in microseconds, average disk access time in milliseconds

Disks: Archaic(Nostalgic) v. Modern(Newfangled) Seagate 373453, 2003 15000 RPM (4X) 73.4 GBytes (2500X) Tracks/Inch: 64000 (80X) Bits/Inch: 533,000 (60X) Four 2.5” platters (in 3.5” form factor) Bandwidth: 86 MBytes/sec (140X) Latency: 5.7 ms (8X) Cache: 8 MBytes CDC Wren I, 1983 3600 RPM 0.03 GBytes capacity Tracks/Inch: 800 Bits/Inch: 9550 Three 5.25” platters Bandwidth: 0.6 MBytes/sec Latency: 48.3 ms Cache: none

Latency Lags Bandwidth (for last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Disk: 33X BW, 6X latency (latency = simple operation w/o contention BW = best-case)

Memory: Archaic (Nostalgic) v. Modern (Newfangled) 1980 DRAM (asynchronous) 0.06 Mbits/chip 64,000 xtors, 35 mm2 16-bit data bus per module, 16 pins/chip 13 Mbytes/sec Latency: 225 ns (no block transfer) 2000 Double Data Rate Synchr. (clocked) DRAM 256 Mbits/chip (4000X) 256,000,000 xtors, 204 mm2 64-bit data bus per DIMM, 66 pins/chip (4X) 1600 Mbytes/sec (120X) Latency: 52 ns (4X) Block transfers (page mode)

Latency Lags Bandwidth (last ~20 years) Performance Milestones Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) DRAM: 120X BW, 4X latency Jog between 3rd and 4th point is because a lot of time between 32b Fast page mode and 64b (latency = simple operation w/o contention BW = best-case)

LANs: Archaic (Nostalgic)v. Modern (Newfangled) Ethernet 802.3 Year of Standard: 1978 10 Mbits/s link speed Latency: 3000 msec Shared media Coaxial cable Ethernet 802.3ae Year of Standard: 2003 10,000 Mbits/s (1000X) link speed Latency: 190 msec (15X) Switched media Category 5 copper wire Copper, 1mm thick, twisted to avoid antenna effect Twisted Pair: "Cat 5" is 4 twisted pairs in bundle Coaxial Cable: Plastic Covering Braided outer conductor Insulator Copper core

Latency Lags Bandwidth (last ~20 years) Performance Milestones Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk: 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) Network: 1000X BW, 13X Latency (latency = simple operation w/o contention BW = best-case)

CPUs: Archaic (Nostalgic) v. Modern (Newfangled) 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar, Dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache

Latency Lags Bandwidth (last ~20 years) Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) CPU high, Memory low (“Memory Wall”) Processor: 2250X, 22X Latency

The University of Adelaide, School of Computer Science 19 April 2017 Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Chapter 2 — Instructions: Language of the Computer

Log-log plot of bandwidth and latency milestones

Rule of Thumb for Latency Lagging BW In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4 (and capacity improves faster than bandwidth) Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency

Six Reasons Latency Lags Bandwidth 1. Moore’s Law helps BW more than latency Faster transistors, more transistors, more pins help Bandwidth MPU Transistors: 0.130 vs. 42 M xtors (300X) DRAM Transistors: 0.064 vs. 256 M xtors (4000X) MPU Pins: 68 vs. 423 pins (6X) DRAM Pins: 16 vs. 66 pins (4X) Smaller, faster transistors but communicate over (relatively) longer lines: limits latency Feature size: 1.5 to 3 vs. 0.18 micron (8X,17X) MPU Die Size: 35 vs. 204 mm2 (ratio sqrt  2X) DRAM Die Size: 47 vs. 217 mm2 (ratio sqrt  2X)

6 Reasons Latency Lags Bandwidth (cont’d) 2. Distance limits latency Size of DRAM block  long bit and word lines  most of DRAM access time Speed of light and computers on network 1. & 2. explains linear latency vs. square BW? 3. Bandwidth easier to sell (“bigger=better”) E.g., 10 Gbits/s Ethernet (“10 Gig”) vs. 10 msec latency Ethernet 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency Even if just marketing, customers now trained Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

6 Reasons Latency Lags Bandwidth (cont’d) 4. Latency helps BW, but not vice versa Spinning disk faster improves both bandwidth and rotational latency 3600 RPM  15000 RPM = 4.2X Average rotational latency: 8.3 ms  2.0 ms Things being equal, also helps BW by 4.2X Lower DRAM latency  More access/second (higher bandwidth) Higher linear density helps disk BW (and capacity), but not disk Latency 9,550 BPI  533,000 BPI  60X in BW

6 Reasons Latency Lags Bandwidth (cont’d) 5. Bandwidth hurts latency Queues help Bandwidth, hurt Latency (Queuing Theory) Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency 6. Operating System overhead hurts latency more than Bandwidth Long messages amortize overhead; overhead bigger part of short messages

The University of Adelaide, School of Computer Science 19 April 2017 Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .032 microns in 2011 (.022 micron FinFET 2012) Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Chapter 2 — Instructions: Language of the Computer

Dynamic Energy and Power The University of Adelaide, School of Computer Science 19 April 2017 Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2 Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched Reducing clock rate reduces power, not energy Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Reducing Power Techniques for reducing power: Do nothing well (dark silicon) Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Static Power Static power consumption Currentstatic x Voltage Scales with number of transistors To reduce: power gating Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 19 April 2017 Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Chapter 2 — Instructions: Language of the Computer

Measuring Performance The University of Adelaide, School of Computer Science 19 April 2017 Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Execution timeY / Execution timeX Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Chapter 2 — Instructions: Language of the Computer

Principles of Computer Design The University of Adelaide, School of Computer Science 19 April 2017 Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units Principle of Locality Reuse of data and instructions Focus on the Common Case Amdahl’s Law Chapter 2 — Instructions: Language of the Computer

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected. F

Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = Speedupoverall =

Principles of Computer Design The University of Adelaide, School of Computer Science 19 April 2017 Principles of Computer Design The Processor Performance Equation Chapter 2 — Instructions: Language of the Computer

Principles of Computer Design The University of Adelaide, School of Computer Science 19 April 2017 Principles of Computer Design Different instruction types having different CPIs Chapter 2 — Instructions: Language of the Computer