Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material.

Similar presentations


Presentation on theme: "1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material."— Presentation transcript:

1 1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material inherited from Eric Krause @ PSU MS ECE

2 2 Syllabus Introduction Block Diagrams Simple µP Operation Arithmetic Logic Unit ALU BCD Arithmetic Instruction Set Architecture (ISA) Iron Law of µP Performance Amdahl’s Law Registers and Score Board Bibliographys

3 3 Introduction A microprocessor µP is functionally similar to an old- fashioned mainframe (MF) computer, more even to a typical desktop (DT) computer Yet there are notable differences! Like a Main Frame (MF) and desktop (DT), a µP: Has a defined instruction set, named ISA: for Instruction Set Architecture Has a defined data width, e.g. 8-bit bytes, 60-bit words, etc Has a defined address range, typically 2 32 or 2 64 Includes an arithmetic unit for integer, floating-point, BCD, and bit operations Includes a memory subsystem, plus several levels of caches Includes an IO subsystem and has other peripheral devices

4 4 Introduction Different from MF computer or DT, a typical µP: Has a smaller power envelope, in order to be usable in laptops Uses a lower main frequency, in order to generate less heat, also enabling laptop use with long battery life Manages power consumption and heat generation through frequency variation, and module shut-down Integrates some peripheral devices onto the same silicon, again saving electrical power, saving board space, plus gaining some speed Provides a smaller number of duplicated arithmetic logical units (ALU) that compute data; as a result some computations, giving identical clock frequencies, run more slowly on a µP due to the lower degree of parallel execution of instructions Must fit into a small volume of space, e.g. into a laptop; as a result, there exist fewer options for cooling the actual CPUs

5 5 Introduction A µP CPU performs arithmetic- logical operations Memory holds information, has a path to/from CPU via main bus How does the µP interact with the real world? IO ports connect external peripheral devices with HW registers r i, and allow r i to access memory Buses connect components E.g. data bus, address bus, specific enabling bits

6 6 Introduction IO ports connect a variety of IO devices Could be a simple latch to read/write Or a separate embedded system, e.g. an ethernet controller Common: General Purpose IO, or GPIO, which exposes raw bits IO controller and memory controller together can become as complex as the CPU E.g. on a motherboard the CPU chip, generally the largest component, is surrounded by other, good- sized chips Trend is to relocate some of their functions onto the same die as the CPU

7 7 Block Diagram of Itanium µP

8 8 Actual Photo of Itanium2 µP

9 9 Block Diagram of AMD Athlon

10 10 Block Diagram of Compaq Alpha

11 11 Die Image of Compaq Alpha

12 12 Simple µP Operation We’ll analyze a hypothetical, abstract µP operation In effect the simplest generic operation is:  Read numeric input from a keyboard; here we read an integer from the keyboard  To perform an operation on the read operand: addition of constant integer 7  Followed by displaying the newly computed result

13 13 Simple µP Operation, Step 1 1. Read numeric (e.g. integer) input data from keyboard, port 5: µP sends memory address for Port Input Instruction on the address bus µP sends memory read control signal Memory sends machine code for Port Read to µP µP decodes instruction, sends out port 5 on address bus µP sends control signal for Port Read Input port sends µP one byte from port 5, by placing it on the data bus

14 14 Simple µP Operation, Step 2 2. Add integer constant 7 to the data read from port 5: µP sends memory address for Add instruction on address bus µP sends control signal for Read on control bus Memory returns machine code for Add µP adds integer constant 7 to the value previously read from input port 5

15 15 Simple µP Operation, Step 3 3. Display newly computed integer value: µP sends memory address for Port Output instruction on address bus µP sends memory read control signal Memory returns opcode (machine code) for Port Output instruction µP decodes instruction, sends output port address (0x7) on address bus µP sends sum from register on data bus µP sends port write signal on control bus

16 16 Arithmetic Logic Unit ALU Full adder produces C OUT and SUM using 3 1-bit inputs, A B and C IN Just via logic gates What does this do? See result below right:

17 17 Arithmetic Logic Unit ALU

18 18 Arithmetic Logic Unit ALU This simple ALU performs any one of 16 arithmetic functions, on two 4-bit words and a carry in And what determines the function it performs? Selection plus MODE! We are going to program this device to perform a function on the two 4-bit inputs and produce a result on the output pins

19 19 Arithmetic Logic Unit ALU

20 20 Arithmetic Logic Unit ALU Bitwise OR, meaning f i = a i + b i, e.g. f 0 = a 0 + b 0, etc, as if there were a dual-input OR gate for each bit of 2 inputs ALU can perform any of these functions based on how it is programmed! Thus we have a device usable in different ways based on program That is a fundamental part of a microprocessor How do we program it? With an operational code (SELECT lines). We call this the OpCode We add 7; but how can this be done? Use the OpCode for Add, i.e. a binary value applied to the select code, to select an operation

21 21 Binary Coded Decimal Arithmetic Binary Coded Decimal is a machine-internal data type BCD uses 4 bits per decimal digit But computes in binary, to re-use integer ALU part of the processor Since only 10 of 16 numeric choices are used, it will be necessary to add 6 to each decimal digit computed I.e. it is necessary add 6 to each group of 4 bits The raw binary addition would be correct only in binary, and incorrect in BCD Adding 6 per BCD digit is that needed correction

22 22 Binary Coded Decimal Arithmetic With correction, BCD ALU part works like integer add part, with the additional step of adding 6, but the adder is already implemented in HW Only a constant storage for decimal 6 is needed Correction algorithm: Given 2 BCD digits A and B: if sum( A, B ) >= 1010 2 or carry( A, B ) = 1 then sum = add( sum( A, B ), 0110 2 ); end if;

23 23 Binary Coded Decimal Arithmetic

24 24 Binary Coded Decimal Arithmetic

25 25 Binary Coded Decimal Arithmetic

26 26 BCD Correction Circuit

27 27 BCD Correction Circuit ALU is in the center of above diagram Register File is an ordered sequence of flip-flops that can be accessed collectively Core of algorithm is to do 2 adds The /4 means: this is a 4-bit line MuxB will either take the second operand or add 0, or add 0x110 It has 3 4-input options on it Registers in front of mux hold values so they are stable for the mux Output register holds value of first addition, usable for second addition

28 28 BCD Correction Circuit To read the 2 operands from the reg file, send them into the A/B registers and do the first addition Then use the second addition. It will take the result of the first addition, then MUX B will switch to add 6 or 0 how do we which? Inspecting the flag register In this simple case, one bit stores carry information Output register result goes into the correction detection logig: It decides, to add 0 or add 6 Combinational logic: If Carry, or if one of these 6 signals, assert the FIX signal that goes to the controller

29 29 BCD Correction Circuit  Select desired operand registers in register file  Transfer operand A from register file to A operand register  Transfer operand B from register file to B operand register  Set MUX A to select A for ALU  Set MUX B to select B for ALU  Apply ADD code to ALU select inputs  Transfer first sum to output of ALU Output register  Transfer COUT and other flags to output of Status register  Set MUX A to select temporary sum for the ALU A input  Set MUX B to select correction 0110 or 0000, based on 1 st result  Transfer final sum to ALU output register  Select register in register file to write final result  Write result to selected register in register file

30 30 Instruction Set Architecture (ISA) ISA is boundary between Software (SW) and Hardware (HW) Specifies logical machine that is visible to the programmer & compiler Is functional specification for processor designers That boundary is sometimes a very low-level piece of system software that handles exceptions, interrupts, and HW-specific services Could fall into the domain of the OS

31 31 Instruction Set Architecture (ISA) Specified by an ISA: Operations: what to perform and in which order Temporary Operand Storage in the CPU: accumulator, stack, registers Note that stack can be word-sized, even bit-sized (design of successor for NCR’s Century architecture of the 1970s) Number of operands per instruction Operand location: where and how to specify/locate the operands Type and size of operands Instruction Encoding in binary

32 32 Instruction Set Architecture ISA: Dynamic Static Interface (DSI)

33 33 Iron Law of µP Performance Clock-rate doesn’t count, bus width doesn’t count, the number of registers and operations executed in parallel doesn’t count! What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium ® 4– based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution

34 34 Iron Law of µP Performance If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive Assuming same execution, performance Meaning of this: Wall-clock time (Time) is time the user has to wait for program completion Program Size is indicator of overall physical complexity of computational task

35 35 Iron Law of µP Performance

36 36 Amdahl’s Law Articulated by Gene Amdahl During 1967 AFIPS conference; yes, computers existed then already Stating that the maximum speedup of a program P is dominated by its sequential portion S I.e. if some part of P can be perfectly or infinitely accelerated due to numerous parallel processors, but some part S of P is inherently sequential, then the resulting performance is dominated by S See Wikipedia sample:

37 37 Amdahl’s Law (From Wikipedia) The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20, no matter how many processors are used n=element of N, N number of processors B =element of { 0, 1 } T(n)=time to execute with n processors T(n)=T(1) ( B + (1-B) / n ) S(n)=Speedup T(1) / T(n) S(n)=1 / ( B + (1 – B ) / n )

38 38 Amdahl’s Law (From Wikipedia)

39 39 Registers, Dependences, Score Board in a Microprocessor

40 40 Register Dependencies Inter-instruction dependencies, in engineering parlance also known as dependences, arise between registers or memory locations being defined and used One instruction computes a result into a register (or memory); another instruction needs that result from that same register, or from that same memory location Or, one instruction uses a resource; and after its use the same resource is newly recomputed Dependences require sequential execution, lest the result is unpredictable

41 41 Register Dependencies True-Dependence, AKA Data Dependence: <- synonymous! r3 ← r1 op r2 1: r5 ← r3 op r42: Read after Write, RAW Anti-Dependence, not a true dependence parallelize under right condition r3 ← r1 op r2 1: r1 ← r5 op r4 2: Write after Read, WAR Output Dependence, similar to Anti-Dependence: can do something r3 ← r1 op r2 1: Write r5 ← r3 op r4 2: Use r3 ← r6 op r73: Write after Write, WAW, use between

42 42 Register Dependencies Control Dependence: // r i, i = 1..4 come in “live” if ( condition1 ) { r3 = r1 op r2; }else{  see the jump here? r5 = r3 op r4; } // end if write( r3 );

43 43 Register Renaming Only data dependence is a real dependence, hence called true dependence Other dependences are artifacts of insufficient resources, generally insufficient registers This means: if additional registers were available, then replacing some of these conflicting registers with new, other registers, could make the conflict disappear! Anti- and Output-Dependences are indeed such false dependences

44 44 Register Renaming Original Code: -- r2, r3, r5, r6, r7 come in “live”... code before -- r1, r4 are not “live”, don’t have values L1:r1 ← r2 op r3 L2:r4 ← r1 op r5 L3:r1 ← r3 op r6 L4:r3 ← r1 op r7 Dependences before: Lx: Ly: x, y = 1..4, which dependence?

45 45 Register Renaming Original Code: L1:r1 ← r2 op r3 L2:r4 ← r1 op r5 L3:r1 ← r3 op r6 L4:r3 ← r1 op r7 Initial Dependences, lots of dependences: L1, L2 true-Dep with r1 L1, L3 output-Dep with r1 L1, L4 anti-Dep with r3 L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3s

46 46 Register Renaming What could we change, if we had some additional registers? Compute and use other temporaries via other registers to reduce dependences! That sometimes allows a higher degree of parallelism, due to a lower degree of dependences More parallelism means faster execution!

47 47 Register Renaming Original Code:New Code, after adding regs: L1:r1 ← r2 op r3r10 ← r2 op r30 –- r30 instead L2:r4 ← r1 op r5r4 ← r10 op r5 –- r10 instead L3:r1 ← r3 op r6r1 ← r30 op r6 L4:r3 ← r1 op r7r3 ← r1 op r7 Dependences before:Dependences after: L1, L2 true-Dep with r1L1, L2 true-Dep with r10 L1, L3 output-Dep with r1L3, L4 true-Dep with r1 L1, L4 anti-Dep with r3// r i, i = 1..7 are “live” L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3

48 48 Register Renaming With these additional, renamed regs, the new code could possibly run in half the time! First: Compute into r10 instead of r1, but you need to have such an additional register r10; no time penalty! Also: Compute in preceding code into r30 instead of r3, if r30 available; also no time penalty! Then the following regs are live afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! r2 came in live, must go out live! While r10 and r30 are don’t cares afterwards

49 49 Score Board Score-board: array of single-bit HW programmable resources sb[] Manages other HW resources, specifically registers In this single-bit HW array sb[], every bit i in sb[i] is associated with a specific machine register, the one identified by i, i.e. r i Association is by index, i.e. by name: sb[i] belongs to reg r i Only if sb[i] = 0, does that register i have valid data, and can therefore be used If sb[i] = 0 then register r i is NOT in process of being written If bit i is set, i.e. if sb[i] = 1, then that register r i is reserved; i.e it cannot be used Initially all sb[*] set to 0; hence all registers r i can be used

50 50 Score Board Execution constraints: r d ← r s op r t if sb[s] or if sb[t] is set → RAW dependence, hence stall the computation; wait until both r s and r t are available if sb[d] is set→ WAW dependence, hence stall the write; wait until r d has been used; SW can sometimes determine to use another register instead of r d Else, if none of the 3 registers are in use, dispatch the instruction immediately

51 51 Score Board Out of order (ooo) execution has long been common on microprocessors, where speed is critical To allow ooo execution, upon computing the value of r d : update r d, and clear sb[d] For uses (reads), HW may use any register i, whose sb[i] is 0 For definitions (writes), HW may set any register j, whose sb[j] is 0 Independent of original order, in which source program was written, i.e. possibly ooo Ooo instructions retire in original order

52 52 Bibliography 1.http://en.wikipedia.org/wiki/Flynn's_taxonomy 2.http://www.ajwm.net/amayer/papers/B5000.html 3.http://www.robelle.com/smugbook/classic.html 4.http://www.intel.com/design/itanium/manuals.htm 5.http://www.csupomona.edu/~hnriley/www/VonN.html 6.http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf


Download ppt "1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status 10/13/2015 For use at CCUT Fall 2015 Some Material."

Similar presentations


Ads by Google