CS455/CpE 442 Intro. To Computer Architecure

Slides:



Advertisements
Similar presentations
CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.
Advertisements

CS1104: Computer Organisation School of Computing National University of Singapore.
Lecture 12 Reduce Miss Penalty and Hit Time
ELEN 468 Advanced Logic Design
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 23 - Course.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Lecture 16: Basic CPU Design
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
S. Barua – CPSC 440 CHAPTER 5 THE PROCESSOR: DATAPATH AND CONTROL Goals – Understand how the various.
Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.
CMPE 421 Parallel Computer Architecture
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
Computer Organization CS224 Fall 2012 Lesson 22. The Big Picture  The Five Classic Components of a Computer  Chapter 4 Topic: Processor Design Control.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
Computer Organization and Architecture Instructions: Language of the Machine Hennessy Patterson 2/E chapter 3. Notes are available with photocopier 24.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Lecture 17 Final Review Prof. Mike Schulte Computer Architecture ECE 201.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1. Convert the RISCEE 1 Architecture into a pipeline Architecture (like Figure 6.30) (showing the number data and control bits). 2. Build the control line.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
MIPS processor continued. Review Different parts in the processor should be connected appropriately to be able to carry out the functions. Connections.
May 22, 2000Systems Architecture I1 Systems Architecture I (CS ) Lecture 14: A Simple Implementation of MIPS * Jeremy R. Johnson Mon. May 17, 2000.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
ECE/CS 552: Single Cycle Control Path
CS161 – Design and Architecture of Computer Systems
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
CS2100 Computer Organization
CS 286 Computer Architecture & Organization
ELEN 468 Advanced Logic Design
Systems Architecture I
CS/COE0447 Computer Organization & Assembly Language
Multiple Cycle Implementation of MIPS-Lite CPU
ECE 445 – Computer Organization
Exploiting Memory Hierarchy Chapter 7
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Systems Architecture II
Systems Architecture I
Vishwani D. Agrawal James J. Danaher Professor
Systems Architecture I
CS455/CpE 442 Intro. To Computer Architecure
COMP541 Datapaths I Montek Singh Mar 18, 2010.
Arrays versus Pointers
Chapter Four The Processor: Datapath and Control
Execution time Execution Time (processor-related) = IC x CPI x T
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
CPU Structure CPU must:
Systems Architecture I
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
The University of Adelaide, School of Computer Science
Introduction to Computer Systems Engineering
CS161 – Design and Architecture of Computer Systems
Presentation transcript:

CS455/CpE 442 Intro. To Computer Architecure Review for Term Exam

The Role of Performance Text 3rd Edition, Chapter 4 Main focus topics Compare the performance of different architectures or architectural variations in executing a given application Determine the CPI for an executable application on a given architecture HW1 solutions, 2.11, 2.12, 2.13

Q2.13 [10] <§§2.2-2.3> Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate of 200 MHz. The average number of cycles per instruction (CPI) for each class of instruction on M1 and M2 is given in the following table: Class CPI on M1 CPI on M2 Instruction mix for C1 Instruction mix for C2 Instruction mix for C3 A 4 2 30% 30% 50% B 6 4 50% 20% 30% C 8 3 20% 50% 20% Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2? ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim that M2 is compared with M1? iii. If you purchase M1 which of the three compilers would you choose? iv. If you purchase M2 which of the three compilers would you choose?

Sol. Using C1 compiler: M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8 CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = 0.0145*10^-6 M2: CPU CC = 3.2 CPU time = 3.2 / 200*10^6 = 0.016*10^-6 Thus, M1 is 0.016 / 0.0145 = 1.10 times as fast as M2. Using C2 compiler: Using the above method, M1: CPU time = 0.016*10^-6 M2: CPU time = 0.0145*10^-6 Thus, M2 is 0.016 / 0.0145 = 1.10 times as fast as M1. Using 3rd party: M1: CPU time = 0.0135*10^-6 M2: CPU time = 0.014*10^-6 Thus, M1 is 0.014 / 0.0135 = 1.04 times as fast as M2. The third-party compiler is the superior product regardless of machine purchase. M1 is the machine to purchase using the third-party compiler

The Instruction Set Architecure Text, Ch. 2 Compare instruction set architectures based on their complexity (instruction format, number of operands, addressing modes, operations supported) Instruction set architecture types Register-to-register Register –to-memory Memory –to-memory HW2 solutions,

2.51 Suppose we have made the following measurements of average CPI for instructions: Arithmetic 1.0 clock cycles Data Transfer 1.4 clock cycles Conditional Branch 1.7 clock cycles Jump 1.2 clock cycles Compute the effective CPI for MIPS. Average the instruction frequencies for SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix. Class CPI Avg. Freq (int & fp) CxF Arithmetic 1.0 .36 ..36 Data Transfer 1.4 .375 .525 Cond. Branch 1.7 .12 .204 Jump 1.2 .03 .036 1.125CPI The effective CPI for MIPS is 1.125, this seems inaccurate because the table does not include the CPI for logical operations.

The Processor: Data Path and Control Text, ch. 5 The data path organization: functional units and their interconnections needed to support the instruction set. The control unit design Hardwired vs microprogramming design HW3 and HW4,

Instr RegDst ALUSrc Mem toReg Reg Write Read Branch ALUOp 1 2 JMPReg R-type lw sw x beq jr

Instr RegDst ALUSrc Mem toReg Reg Write Read Branch ALUOp 1 2 LUICtr R-type lw sw x beq lui

Using the numbers from pg 315 The concept of the “critical path” , the longest possible path in the machine, was introduced in 5.4 on page 315. Based on your understanding of the single-cycle implementation, show which units can tolerate more delays (i.e. are not on the critical path), and which units can benefit from hardware optimization. Quantify your answers taking the same numbers presented on page 315. Longest path is load instruction (instruction memory, register file, ALU, data memory, register file). It can benefit by optimizing the hardware. Using the numbers from pg 315 Mem units: 200ps ALU&Adders: 100ps Register File: 50ps Critical path = 200+50+100+200+50 = 600ps (for lw) The path between the adders and the pc can tolerate more delays because they do not lie within the critical path. Any unit within the critical path (ALU, Register, Data memory) would benefit by optimizing the hardware, this would make the critical path shorter

IorD=0, (pc=pc+4 cont.) MemRead LDI

Micro-program for LDI

Pipelined Architecutres Text, Ch.6 Stages of a pipelined data path Pipeline hazzards Pipelined performance, number of cycles to execute a code segment (and the effective CPI), look for dependencies in sequencesinvolving lw and branch instructions (delay cyles) HW5 6.22 lw $4, 100($2) sub $6, $4, $3 add $2, $3, $5 number of cycles = 5+2+1= 8 eff. CPI = 8/3 = k+ (n-1)+delay cycles #cycles / #instructions k=no of Stages, n=no of instructions

The Memory Hierarchy Text, Ch. 7 The levels of memory hierarchy, and the principal of locality Cache Design, direct-mapped, fully associative and set associative Cache access, factors affecting the miss rate, and the miss penalty Virtual memory, address map, page tables, and the TLB HW6

1 KB Direct Mapped Cache with 32 B Blocks 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

And yet Another Extreme Example: Fully Associative 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

Review: 4-way set associative

HW6 Problem 1 32 bit address space, 32Kbytes cache Direct-mapped cache (32 byte blocks) Byte select = 5 bits (lowest order bit 0-4) Cache index = address modulo 1024 = log2(1024) = 10 bits (low order after byte select) Tag = 32 – byte select – cache index = 17 bits (high order) 8 way set associative cache (16 byte blocks) – 8 blocks / set Byte select for 16 byte blocks = 4 bits set – 32768 bytes / 128 bytes per set = 256 sets Cache index = address modulo 256 sets = log2(256) = 8 bits Tag = 32 – 8 – 4 = 20 bits Fully associative cache (128 byte blocks) Byte select = 7 bits, Cache index does not exist because blocks in memory can be placed in any cache entry, Tag = 25 bits

Problem 7.46 word ReadDirectMappedCache(address a) static Entry cache[CACHE_SIZE_IN_WORDS]; Entry e = cache[a.index] if (e.valid == FALSE || e.tag != a.tag) { e.valid = true; e.tag = a.tag; e.data = load_from_memory(a); } return e.data;  Modified to the following for multi-word blocks word ReadDirectMappedCache(address a) static Entry cache[CACHE_SIZE_IN_BLOCKS]; Entry e = cache[a.index] if (e.valid == FALSE || e.tag != a.tag) { e.valid = true; e.tag = a.tag; e.data = load_from_memory(a); } return e.data[a.word_index];