UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN Csaba Andras Moritz Review today, not so fast in future

Coping with ECE 668 Students with varied backgrounds Prerequisites – Basic Computer Architecture, VLSI 2 projects to choose from, some flexibility beyond that You need software and/or Verilog/HSPICE skils to complete it 2 exams – midterm and final Class participation, attend office hours About the instructor First lectures- review of Performance and Pipelining (Chapter 1 + Appendix A) Many lectures will be using the whiteboard, and slides Lectures related to textbook and beyond Many lectures are outside the textbook Web:

What you should know Basic machine structure
processor (data path, control, arithmetic), memory, I/O Read and write in an assembly language, C, C++,.. MIPS/ARM ISA preferred Understand the concepts of pipelining and virtual memory Basic VLSI – HSPICE and/or Verilog

Textbook and references
Textbook: D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, 4th edition (or later), Morgan-Kaufmann. Recommended reading: J.P. Shen and M.H. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, McGraw-Hill, 2005. Chandrakasan et al, Design of High-Performance Microprocessor Circuits NASIC research papers and Nanoelectronics textbook chapter; SKYBRIDGE, N3ASIC, CMOL, FPNI, SPWF papers if interested Other research papers we bring up in class.

Course Outline I. Introduction (Ch 1) II. Pipeline Design (App A)
III. Instruction-level Parallelism, Pipelining (App.A,Ch.2) IV. Memory Design: Memory Hierarchy, Cache Memory, Secondary Memory (Ch.4) V. Multiprocessors (Ch. 3) VI. Deep Submicron Implementation – Process Variation, Power-aware Architectures, Compiler’s role VII. Nanoscale architectures

Administrative Details
Instructor: Prof. Csaba Andras Moritz KEB 309H Office Hours: 2:30-3:30 pm, Tues., & 2:30-3PM Thur. TA – pending Course web page: details available at:

Grading Midterm I - 35% Project – 30%: two projects to choose from
Class Participation – 5% Final Exam. - 30% Homework – exam questions

What is “Computer Architecture”
Instruction Set Architecture + Machine Organization (e.g., Pipelining, Memory Hierarchy, Storage systems, etc) Or Unconventional Organization IBM 360 (minicomputer, mainframe, supercomputer) Intel X86 vs. ARM vs. Nanoprocessors

Computer Architecture Topics - Processors
Input/Output and Storage RAID performance, reliability Disks, Tape Interleaving Bus protocols DRAM Memory Hierarchy Bandwidth, Latency L2 Cache L1 Cache Addressing VLSI Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Branch Prediction, VLIW, Vector Instruction Level Parallelism

Advanced CMOS multi-cores &Nano proc.?
2013

Scaling Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

Shrinking geometry Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

CPUs: Archaic (Nostalgic) v. Semi Modern v. Modern?
1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 134,000 xtors, 47 mm2 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar, Dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache 2015? 2015 homework?

Multi-core = Network on a chip
Everything you learn as CSE students applied/integrated in a chip!

Intel Polaris with 80 cores
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program

Tilera processor with 64 cores
MIT startup from Raw project (used to be involved in this)

What is next: Nanoprocessors?
Molecular memory, NASIC processors, 3D? Cross NW devices Courtesy of Prof Chui’s Group at UCLA NASIC ALU, Copyright: NASIC group, UMASS

From Nanodevices to Nanocomputing
Crossed Nanowire Array Physical Layer Array-based Circuits with Built-in Fault-tolerance (NASICs) Evaluation/Cascading: Streaming Control with Surrounding Microwires Circuits We’re working on building up from nanowires, to transistors, to microprocessor architectures built on what we refer to as nanoscale fabrics, or computing structures. We know that we can build transistors from nanowires. We are building on that foundation to create processor architectures which can replace traditional scaled CMOS-based architectures as we move deeper into sub-micron territory towards true nanoscale territory. Here you can see semiconductor nanowires laid out in 3D. At each point where the two wires cross, a field effect transistor can be created. In this case, the lower, the upper three semiconductor nanowires gate channels on the horizontal NW. From these semiconductor nanowire arrays, by selectively constructing these FETs, we can build array-based circuits, and you can see here some FETs on an array, or a fabric. This 2-d array of FETs is the physical architecture that we’re working on. All of our work is based on these arrays. We are studying the implications of this structure on all levels up to microprocessors, from devices to circuits to architectures. Nanoprocessor Architectures

NASICs Fabric Based Architectures
Cellular Architecture WIre Streaming Processor Special purpose for image and signal processing Massively parallel array of identical interacting simple functional cells Fully programmable from external template signals 22X denser than in 16nm scaled CMOS General purpose stream processor 5-stage pipeline with minimal feedback Built-in fault tolerance: up to 10% device level defect rates 33X density adv vs. 16nm scaled CMOS Simpler manufacturing ~9X improved power-per-performance efficiency (rough estimate) Layout of ALU

N3ASIC- 3D Nanowire Technology

N3P – Hybrid Spin-Charge Platform

Skybridge 3D Circuits – Vertically Integrated
3D Circuit concept and 1 bit full adder Designed in my group FETs are gate-all-around on vertical nanowires

Example ISAs in Processors (Instruction Set Architectures)
ARM (32, 64-bit, v8) 1985 Digital Alpha (v1, v3) 1992 HP PA-RISC (v1.1, v2.0) 1986 Sun Sparc (v8, v9) 1987 MIPS (MIPS I, II, III, IV, V) 1986 Intel (8086,80286,80386, ,Pentium, MMX, ...) RISC vs. CISC

Basics Let us review some basics

RISC ISA Encoding Example
Fmt replaced with S or D

Virtualized ISAs BlueRISC TrustGUARD
ISA is randomly created internally Fluid - more than one ISA possible

Characteristics of RISC
Only Load/Store instructions access memory A relatively large number of registers Goals of new computer designs Higher performance More functionality (e.g., MMX) Other design objectives? (examples)

How to measure performance?
Time to run the task Execution time, response time, latency Performance may be defined as 1 / Ex_Time Throughput, bandwidth

Speedup performance(x) = 1 execution_time(x)
" Y is n times faster than X" means Execution_time(old / brand x) n = speedup = Execution_time(new / brand y) Speedup must be greater than 1; Tx/Ty = 3/2 = but not Ty/Tx = 2/3 = 0.67

MIPS and MFLOPS MIPS (Million Instructions Per Second)
Can we compare two different CPUs using MIPS? MFLOPS (Million Floating-point operations Per Sec.) Application dependent (e.g., compiler) Still useful for benchmarks Benchmarks: e.g., SPEC CPU 2000: 26 applications (with inputs) SPECint2000: Twelve integer, e.g., gcc, gzip, perl SPECfp2000: Fourteen floating-point intensive, e.g., equake

SPEC CPU 2000 SPECint2000 SPECfp2000 www.specbench.org/cpu200
Benchmark Language Category 168.wupwise Fortran77 Quantum Chromodynamics 171.swim Fortran77 Shallow Water Modeling 172.mgrid Fortran77 Multi-grid Solver 173.applu Fortran77 Partial Differential Equations 177.mesa C D Graphics Library 178.galgel Fortran90 Fluid Dynamics 179.art C Image Recognition /Neural Nets 183.equake C Seismic Wave Propagation 187.facerec Fortran 90 Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran90 Primality Testing 191.fma3d Fortran90 Finite-element Crash - Nuclear Physics 200.sixtrack Fortran77 Accelerator Design 301.apsi Fortran77 Meteorology: Pollutant Distribution Benchmark Language Category 164.gzip C Compression 175.vpr C FPGA Circuit Place& Route 176.gcc C C Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Prog Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip C Compression 300.twolf C Place and Route Simulator

Spec2006 (still current)

Handheld device committee SPEC
Other Benchmarks Workload Category Example Benchmark Suite CPU Benchmarks - Uniprocessor SPEC CPU 2006 Java Grande Forum Benchmarks SciMark, ASCI CPU - Parallel Processor SPLASH, NASPAR Multimedia MediaBench Embedded EEMBC benchmarks Digital Signal Processing BDTI benchmarks Java - Client side SPECjvm98, CaffeineMark Java - Server side SPECjBB2000, VolanoMark Java - Scientific Java Grande Forum Benchmarks SciMark Transaction Processing On-Line Transaction Processing TPC-C, TPC-W Decision Support Systems TPC-H, TP-R Web Server SPEC web99, TPC-W, VolanoMark Electronic commerce TPC-W, SPECjBB2000 Mail-server SPECmail2000 Network File System SPEC SFS 2.0 Personal Computer SYSMARK, WinBench, DMarkMAX99 Handheld device committee SPEC

Synthetic Benchmarks Whetstone Benchmark Dhrystone Benchmark
Rank Machine Mflop ratings (Vl=1024) Total CPU MWIPS N N N8 (seconds) 1 Pentium 4/3066 (ifc) 2 HP Superdome Itanium2/ 3 HP RX5670 Itanium2/1500-H 4 Pentium 4/2666 (ifc) 5 IBM pSeries 690Turbo/ 6 Compaq Alpha ES45/ 7 HP RX4640 Itanium2/ 8 IBM Regatta-HPC/ 9 IBM pSeries 690Turbo/ 10 AMD Opteron848/ Whetstone Benchmark Core DMIPS Freq. DMIPS Inline Inline /MHz. (MHz) DMIPS/MHz DMIPS 4Kc™ 4KEc™ 5Kc™ 5Kf™ 20Kc™ Dhrystone Benchmark

How do we design faster CPUs?
Faster technology – used to be the main approach (a) getting more expensive (b) reliability & yield (c) speed of light (3.10^8 m/sec) Larger dies (SOC - System On a Chip) less wires between ICs but - low yield (next slide) Parallel processing - use n independent processors limited success n-issue superscaler microprocessor (currently n=4) Can we expect a Speedup = n ? Pipelining Multi-threading

Power consumption Dynamic Leakage Power-aware architectures
α * Vdd^2 * f* Cl Leakage Mainly from subthreshold (the FETs leak current) Significant for small feature sizes (less Ion/Ioff) Power-aware architectures Objective is to minimize activity often Role of compilers - control Circuit level optimizations – make same more efficient CAD tools – e.g., clock gating – make it easy to add

Define and quantify power
Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off Leakage is dominant sub 90nms Very low power systems even gate voltage to inactive modules to control loss due to leakage

Define and quantity dependability (2/3)
Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures Traditionally reported as failures per billion hours of operation Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability (MA) measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) Module availability MA = MTTF / ( MTTF + MTTR)

Example calculating reliability
If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

Integrated Circuits Yield
- ï þ ý ü î í ì ÷ ø ö ç è æ + = Die_area sity Defect_Den 1 d Wafer_yiel Yield Die

Integrated Circuits Costs
Die cost + Testing cost + Packaging cost IC cost = Final test yield Wafer cost Die cost = Dies per Wafer x Die Yield P (Wafer_diam/2)² P x Wafer_diam Dies per wafer = - - Test_Die Die_Area 2 x Die_Area × Die Cost goes up roughly with (Die_Area)2

Amdahl’s Law - Basics Example: Executing a program on n independent processors Fraction = parallelizable part of program enhanced Speedup = n enhanced old new ExTime = (1- Fraction ) + Fraction n ExTime enhanced enhanced Lim Speedup = 1 / (1 - Fraction ) overall enhanced n

Law of Diminishing Returns
Amdahl’s Law - Graph Law of Diminishing Returns 1-f enh

Amdahl’s Law - Extension
Example: Improving part of a processor (e.g., multiplier, floating-point unit) enhanced Fraction = part of program to be enhanced 1 Speedup = overall Fraction ( ) enhanced 1 - Fraction + enhanced Speedup enhanced < 1 / (1 - Fraction ) enhanced A given signal processing application consists of 40% multiplications. An enhanced multiplier will execute 5 times faster Speedup = 1 / ( ) = 1.47 < 1/0.6 = 1.66 overall

Amdahl’s Law - Another Example
Floating point instructions improved to run 2X; but only 10% of actual run time is used by FP instructions ExTimenew = ExTimeold x ( /2) = 0.95 x ExTimeold 1 0.95 Speedupoverall = = 1.053

Instruction execution
Components of average execution time (CPI Law) Average CPU time per program CPI /clock_rate The “End to End Argument” is what RISC was ultimately about - it is the performance of the complete system that matters, not individual components! CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Cycles Per Instruction – Another Performance Metric
“Average Cycles per Instruction” CPI = Total_No_of_Cycles / Instruction Count “CPI of Individual Instructions” CPIj - CPI for instruction j (j=1,…,n) Ij - # of times instruction j is executed “Instruction Frequency” Count n Instructio I F where CPI j = å 1

Example: Calculating CPI
Base Machine (Reg / Reg) Op Freq Cycles CPIj * Fj (% Time) ALU 50% ( %) Load 20% (27%) Store 10% (13%) Branch 20% (27%) Typical Mix of instruction types in program

Pipelining - Basics 4 consecutive operations Z=F(X,Y)=SqRoot(X +Y )
( ) 2 Square Root 4 consecutive operations Z=F(X,Y)=SqRoot(X +Y ) 2 2 If each step takes 1T then one calculation takes 3T, four take 12T Stage 1 X Stage 2 +Y Stage 3 SqRoot 2 Y Z Assuming ideally that each stage takes 1T What will be the latency (time to produce the first result)? What will be the throughput (pipeline rate in the steady state)?

Pipelining - Timing Total of 6T; Speedup = ?
For n operations: 3T + (n-1)T = latency + n-1 throughput 3T n 3n Speedup = = 3T + (n-1)T n + 2 # of stages n

Pipelining - Non ideal Non-ideal situation:
1. Steps take T ,T ,T Rate = 1 / max T Slowest unit determines the throughput 2. To allow independent operation must add latches t = t max T 1 2 3 i latch i

Rule of Thumb for Latency Lagging BW
In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4 (and capacity improves faster than bandwidth) Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency

Latency Lags Bandwidth (last ~20 years)
Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, RPM (8x, 143x) CPU high, Memory low (“Memory Wall”) Processor: 2250X, 22X Latency

Summary of Technology Trends
For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, latency improves by no more than 1.2X to 1.4X Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN HW and SW developers should innovate assuming Latency Lags Bandwidth If everything improves at the same rate, then nothing really changes When rates vary, require real innovation

Summary of Architecture Trends
CMOS Microprocessors focus on computing bandwidth with multiple cores Accelerators for specialized support Software to take advantage – Von Neumann design As nanoscale technologies emerge new architectural areas are created Unconventional architectures Not programmed – would operate more like the brain through learning and inference As well as new opportunities for microprocessor design

Backup slides for students

6 Reasons Latency Lags Bandwidth
1. Moore’s Law helps BW more than latency Faster transistors, more transistors, more pins help Bandwidth MPU Transistors: vs M xtors (300X) DRAM Transistors: vs. 256 M xtors (4000X) MPU Pins: 68 vs. 423 pins (6X) DRAM Pins: 16 vs pins (4X) Smaller, faster transistors but communicate over (relatively) longer lines: limits latency Feature size: 1.5 to 3 vs micron (8X,17X) MPU Die Size: 35 vs. 204 mm2 (ratio sqrt  2X) DRAM Die Size: 47 vs. 217 mm2 (ratio sqrt  2X)

6 Reasons Latency Lags Bandwidth (cont’d)
2. Distance limits latency Size of DRAM block  long bit and word lines  most of DRAM access time Speed of light and computers on network 3. Bandwidth easier to sell (“bigger=better”) E.g., 10 Gbits/s Ethernet (“10 Gig”) vs msec latency Ethernet 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency Even if just marketing, customers now trained Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance

4. Latency helps BW, but not vice versa Spinning disk faster improves both bandwidth and rotational latency 3600 RPM  RPM = 4.2X Average rotational latency: 8.3 ms  2.0 ms Things being equal, also helps BW by 4.2X Lower DRAM latency  More access/second (higher bandwidth) Higher linear density helps disk BW (and capacity), but not disk Latency 9,550 BPI  533,000 BPI  60X in BW

5. Bandwidth hurts latency Queues help Bandwidth, hurt Latency (Queuing Theory) Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency 6. Operating System overhead hurts Latency more than Bandwidth Long messages amortize overhead; overhead bigger part of short messages

UNIVERSITY OF MASSACHUSETTS Dept

Similar presentations

Presentation on theme: "UNIVERSITY OF MASSACHUSETTS Dept"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF MASSACHUSETTS Dept

Similar presentations

Presentation on theme: "UNIVERSITY OF MASSACHUSETTS Dept"— Presentation transcript:

Similar presentations

About project

Feedback