Presentation on theme: "CS455/CpE 442 Intro. To Computer Architecure"— Presentation transcript:
1CS455/CpE 442 Intro. To Computer Architecure Review for Term Exam
2The Role of Performance Text 3rd Edition, Chapter 4Main focus topicsCompare the performance of different architectures or architectural variations in executing a given applicationDetermine the CPI for an executable application on a given architectureHW1 solutions, 2.11, 2.12, 2.13
3Q2.13  <§§ > Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate of 200 MHz. The average number of cycles per instruction (CPI) for each class of instruction on M1 and M2 is given in the following table:Class CPI on M1 CPI on M2 Instruction mix for C1 Instruction mix for C2 Instruction mix for C3A % 30% 50%B % 20% 30%C % 50% 20%Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2?ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim that M2 is compared with M1?iii. If you purchase M1 which of the three compilers would you choose?iv. If you purchase M2 which of the three compilers would you choose?
4Sol.Using C1 compiler:M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = *10^-6M2: CPU CC = 3.2CPU time = 3.2 / 200*10^6 = 0.016*10^-6Thus, M1 is / = 1.10 times as fast as M2.Using C2 compiler:Using the above method,M1: CPU time = 0.016*10^-6M2: CPU time = *10^-6Thus, M2 is / = 1.10 times as fast as M1.Using 3rd party:M1: CPU time = *10^-6M2: CPU time = 0.014*10^-6Thus, M1 is / = 1.04 times as fast as M2.The third-party compiler is the superior product regardless of machine purchase.M1 is the machine to purchase using the third-party compiler
5The Instruction Set Architecure Text, Ch. 2Compare instruction set architectures based on their complexity (instruction format, number of operands, addressing modes, operations supported)Instruction set architecture typesRegister-to-registerRegister –to-memoryMemory –to-memoryHW2 solutions,
62.51 Suppose we have made the following measurements of average CPI for instructions: Arithmetic1.0 clock cyclesData Transfer1.4 clock cyclesConditional Branch1.7 clock cyclesJump1.2 clock cyclesCompute the effective CPI for MIPS. Average the instruction frequencies for SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix.ClassCPIAvg. Freq (int & fp)CxFArithmetic1.0.36..36Data Transfer1.4.375.525Cond. Branch18.104.22.168Jump1.2.03.0361.125CPIThe effective CPI for MIPS is 1.125, this seems inaccurate because the table does not include the CPI for logical operations.
7The Processor: Data Path and Control Text, ch. 5The data path organization: functional units and their interconnections needed to support the instruction set.The control unit designHardwired vs microprogramming designHW3 and HW4,
10Using the numbers from pg 315 The concept of the “critical path” , the longest possible path in the machine, was introduced in 5.4 on page Based on your understanding of the single-cycle implementation, show which units can tolerate more delays (i.e. are not on the critical path), and which units can benefit from hardware optimization. Quantify your answers taking the same numbers presented on page 315.Longest path is load instruction (instruction memory, register file, ALU, data memory, register file). It can benefit by optimizing the hardware.Using the numbers from pg 315Mem units: 200psALU&Adders: 100psRegister File: 50psCritical path = = 600ps (for lw)The path between the adders and the pc can tolerate more delays because they do not lie within the critical path. Any unit within the critical path (ALU, Register, Data memory) would benefit by optimizing the hardware, this would make the critical path shorter
13Pipelined Architecutres Text, Ch.6Stages of a pipelined data pathPipeline hazzardsPipelined performance, number of cycles to execute a code segment (and the effective CPI), look for dependencies in sequencesinvolving lw and branch instructions (delay cyles)HW56.22 lw $4, 100($2)sub $6, $4, $3add $2, $3, $5number of cycles = 5+2+1= eff. CPI = 8/3= k+ (n-1)+delay cycles #cycles / #instructionsk=no of Stages, n=no of instructions
14The Memory Hierarchy Text, Ch. 7 The levels of memory hierarchy, and the principal of localityCache Design, direct-mapped, fully associative and set associativeCache access, factors affecting the miss rate, and the miss penaltyVirtual memory, address map, page tables, and the TLBHW6
151 KB Direct Mapped Cache with 32 B Blocks 3194Cache TagExample: 0x50Cache IndexByte SelectEx: 0x01Ex: 0x00Stored as partof the cache “state”Valid BitCache TagCache Data:Byte 31Byte 1Byte 0:0x50Byte 63Byte 33Byte 32123::::Byte 1023Byte 99231