Chapter 2 Microprocessors: From High Level View Down To Register level

Chapter 2 Microprocessors: From High Level View Down To Register level
ECE 485/585 Microprocessors Chapter 2 Microprocessors: From High Level View Down To Register level Herbert G. Mayer, PSU Status 10/3/2016 Parts gratefully taken with permission from Eric PSU

Syllabus Introduction Block Diagrams Simple μP Operation
Arithmetic Logic Unit ALU BCD Arithmetic Instruction Set Architecture (ISA) Iron Law of μP Performance Amdahl’s Law Registers and Score Board Bibliography

Introduction A microprocessor μP is functionally similar to an old- fashioned mainframe (MF) computer, more even to a typical desktop (DT) computer There are notable differences! Some already listed in “Microprocessor Characterization: CPU & Memory” Like a Main Frame (MF) and desktop (DT), a μP: Has a defined instruction set, named ISA: for Instruction Set Architecture Has a defined data width, e.g. 8-bit bytes, 60-bit words, etc. Has a defined address range, typically 232 or 264 Includes an arithmetic unit for integer, floating-point, BCD, and bit operations Includes a memory subsystem, plus several levels of caches Includes an IO subsystem and has other peripheral devices

Introduction Different from MF computer or DT, a typical μP:
Has a smaller power envelope, for use in laptops Can vary its main frequency, in order to generate less heat at times, also enabling laptop use with longer battery life Manages power consumption and heat generation through frequency variation, and select module shut-down Integrates some peripheral devices onto the same silicon, again saving electrical power, saving board space, plus gaining speed Provides a smaller number of (or no) duplicated arithmetic logical units (ALU); thus some computations, giving identical clock frequencies, run more slowly on a µP due to the lower degree of parallelism Must fit into a small physical space, e.g. into a laptop; as a result, there exist fewer options for cooling the CPU

Introduction A μP CPU performs arithmetic- logical operations
Memory holds information, has a path to/from CPU via main bus How does μP interact with the real world? IO ports connect external peripheral devices with HW registers ri, and allow ri to access memory Buses connect components E.g. data bus, address bus, specific enabling bits

Introduction IO ports connect a variety of IO devices
Could be a simple latch to read/write Or a separate embedded system, e.g. an Ethernet controller Common: General-Purpose IO, or GPIO, which exposes raw bits IO controller and memory controller together can become as complex as the CPU E.g. on a motherboard the CPU chip, generally the largest component, is surrounded by other, good- sized chips Trend is to relocate some of their functions onto the same die as the CPU

Block Diagram of Itanium μP

Actual Photo of Itanium2 μP

Block Diagram of AMD Athlon

Block Diagram of Compaq Alpha

Die Image of Compaq Alpha

Simple μP Operation We’ll analyze hypothetical, abstract μP operations
In effect the simplest generic operation is to: Read numeric input from a keyboard; here we read an integer from the keyboard Perform an operation on the read operand: addition of constant integer 7 for example Displaying new, computed result

Simple μP Operation, Step 1
1. Read numeric (e.g. integer) input data from keyboard, port 5: μP sends memory address for Port Input Instruction on the address bus μP sends memory read control signal Memory sends machine code for Port Read to μP μP decodes instruction, sends out port 5 on address bus μP sends control signal for Port Read Input port sends μP one byte from port 5, by placing it on the data bus

2. Add integer constant 7 to the data read from port 5: μP sends memory address for Add instruction on address bus μP sends control signal for Read on control bus Memory returns machine code for Add μP adds integer constant 7 to the value previously read from input port 5

3. Display new, computed integer: μP sends memory address for Port Output instruction on address bus μP sends memory read control signal Memory returns opcode (machine code) for Port Output instruction μP decodes instruction, sends output port address (0x7) on address bus μP sends sum from register on data bus μP sends port write signal on control bus

Arithmetic Logic Unit ALU
Full adder produces COUT and SUM using 3 1-bit inputs, A B and CIN Just via logic gates What does this do? See result below right:

This simple ALU performs any one of 16 arithmetic functions, on two 4-bit words and a carry in And what determines the function it performs? Selection plus MODE! Device can be programmed to perform a function on the two 4-bit inputs and produce a result on the output pins

Bitwise OR expressed as + here, meaning: fi = ai + bi, e.g. f0 = a0 + b0, etc., as if there were dual-input OR gate for each input bit ALU can perform any of these functions based on how it is programmed! Thus we have a device usable in different ways based on program That is a fundamental function and part of a microprocessor How do we program it? With an operational code (SELECT lines). We call this the OpCode We add 7; but how can this be done? Use the OpCode for Add, i.e. a binary value applied to the select code, to select an operation

Binary Coded Decimal Arithmetic
Binary Coded Decimal (BCD) is one of various plausible data types BCD uses 4 bits per decimal digit: thus a total of 6 options are wasted per decimal digit But computation can be done in binary, possible to re-use integer ALU part of the processor Since only 10 of 16 numeric choices are used, necessary to add 6 to each decimal digit whose sum exceeds 9 Original binary addition is correct only in binary for all 10 digits, and incorrect in BCD for sums > 9; the latter needing correction Adding the decimal value 6 (six) per BCD digit is that greater than 9 is that needed correction

With correction, BCD ALU part works like integer add part, with the additional step of adding 6, but the adder is already implemented in HW Only a constant storage for decimal 6 is needed Correction algorithm: Given 2 BCD digits A and B: if sum( A, B ) >= or carry( A, B ) = 1 then sum = add( sum( A, B ), ); end if;

BCD Correction Circuit

ALU is in the center of above diagram Register File is an ordered sequence of flip-flops that can be accessed collectively Core of algorithm is to do 2 adds The /4 means: this is a 4-bit line MuxB will either take the second operand or add 0, or add 0x110 It has 3 4-input options on it Registers in front of mux hold values so they are stable for the mux Output register holds value of first addition, usable for second addition

To read the 2 operands from the register file, send them into the A/B registers and do the first addition Then use the second addition. It will take the result of the first addition, then MUX B will switch to add 6 or 0 How do we know: which option of 2 to process? By inspecting the flag register In this simple case, one bit stores carry information Output register result goes into correction detection logic: It decides, to add 0 or to add 6 Combinational logic: If Carry, or if one of these 6 signals, assert the FIX signal that goes to the controller

Select desired operand registers in register file Transfer operand A from register file to A operand register Transfer operand B from register file to B operand register Set MUX A to select A for ALU Set MUX B to select B for ALU Apply ADD code to ALU select inputs Transfer first sum to output of ALU Output register Transfer COUT and other flags to output of Status register Set MUX A to select temporary sum for the ALU A input Set MUX B to select correction 0110 or 0000, based on 1st result Transfer final sum to ALU output register Select register in register file to write final result Write result to selected register in register file

Instruction Set Architecture (ISA)
ISA is boundary between Software (SW) and Hardware (HW) Specifies logical machine, visible to compiler & programmer Is functional specification for processor designers That boundary is sometimes a very low-level piece of system software that handles exceptions, interrupts, HW-specific services Could fall into OS domain

Instruction Set Architecture (ISA)
Specified by an ISA are: Operations: what to perform and in which order Temporary Operand Storage in the CPU: accumulator, stack, registers Note that stack can be word-sized, even bit-sized (e.g. successor for NCR’s Century architecture of the 1970s) Operands: Number of operands per instruction Operand location: where and how to specify/locate the operands Type and size of operands Instruction Encoding in binary

Instruction Set Architecture
ISA: Dynamic Static Interface (DSI)

Iron Law of μP Performance
Clock-rate doesn’t count, bus width doesn’t count, the number of registers and operations executed in parallel doesn’t count!  What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium® 4 based program runs at 3 GHz and completes that same program x in 2.5 minutes, the MIPS solution is better, since faster!

If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive Assuming same execution, same level of performance Meaning of this: Wall-clock time (Time) is time the user has to wait for program completion Program Size is indicator of overall physical complexity of computational task

Amdahl’s Law Articulated by Gene Amdahl
During 1967 AFIPS conference; yes, computers existed then already  Stating that the maximum speedup of a program P is dominated by its sequential portion S I.e. if some part of P can be perfectly or infinitely accelerated due to numerous parallel processors, but some part S of P is inherently sequential, then the resulting performance is dominated by S See Wikipedia sample:

Amdahl’s Law (From Wikipedia)
The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20, no matter how many processors are used n = element of N, N number of processors B = , meaning: 0% parallel code  1, 100%  0 T(n) = time to execute with n processors T(n) = T(1) * ( B + (1-B) / n ) S(n) = Speedup with n processors = T(1) / T(n) S(n) = 1 / ( B + (1 – B ) / n )

Amdahl’s Law (From Wikipedia)

Registers and Dependencies

Registers on Intel x86 Intel x86 known as a register-starved architecture Need for object code compatibility extended life of x86 architecture beyond anyone’s imagination

Registers on IBM 370 IBM’s 360/370 mainframe architecture preceded x86, had regular and richer register set Various formats, half-word, word, extended formats

Register & Data Dependencies
Inter-instruction dependencies, in CS parlance also known as dependences, arise between registers –or memory locations– being defined and used One instruction computes a result into a register (or memory); another instruction needs that result from that same register (or that memory location) Or, one instruction uses a datum; and after its use the same item is then recomputed Dependences require sequential execution, lest the result is unpredictable, i.e. wrong!

Register Dependencies
True-Dependence, AKA Data Dependence: <- synonymous! r3 ← r1 op r2 r5 ← r3 op r4 Read after Write, RAW Anti-Dependence, not a true dependence parallelize under right condition r1 ← r5 op r4 Write after read, WAR Output Dependence, similar to Anti-Dependence: can do something r5 ← r3 op r4 r3 ← r6 op r7 Write after Write, WAW, use in between

Register Dependencies
Control Dependence: // ri, i = 1..4 come in “live” if ( condition1 ) { r3 = r1 op r2; }else{  see the jump here? r5 = r3 op r4; } // end if write( r3 );

Register Renaming Only data dependence is a real dependence, hence called true dependence Other dependences are artifacts of insufficient resources, generally insufficient registers This means: if additional registers were available, then replacing some of these conflicting registers with new ones, could make the conflict disappear! Anti- and Output-Dependences are indeed such false dependences

Register Renaming Original Code: L1: r1 ← r2 op r3 L2: r4 ← r1 op r5
Compute Dependences before making register changes The term “register r is live at instruction foo” means: some other instruction at foo+i is known to reference register r without there being another assignment to r between foo and foo+i

Register Renaming Original Code: Dependences before: L1: r1 ← r2 op r3
L1, L2 true-Dep with r1 L1, L3 output-Dep with r1 L1, L4 anti-Dep with r3 L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3

Register Renaming What changes could a programmer or compiler make, for the sake of decreasing register dependence, if more resources (registers) were available? And if less dependences existed, a higher degree of parallelism can be achieved Thus execution could be increased!

Register Renaming Original Code: New Code, after adding regs:
L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead L3: r1 ← r3 op r6 r1 ← r30 op r6 L4: r3 ← r1 op r7 r3 ← r1 op r7 Dependences before: Dependences after: L1, L2 true-Dep with r1 L1, L2 true-Dep with r10 L1, L3 output-Dep with r1 L3, L4 true-Dep with r1 L1, L4 anti-Dep with r3 // ri, i = 1..7 are “live” L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3

Register Renaming With these additional, renamed regs, the new code could possibly run in half the time! First: Compute into r10 instead of r1; needs additional register r10; no time penalty! Also: In preceding code, store result into r30 instead r3, if r30 available; creates no added time penalty! Then the following regs are live afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! Caveat: r2 came in live, must go out live! While r10 and r30 are don’t cares afterwards; yet they are live too; no harm

Score Board Score-board sb is an array of HW programmable bits sb[], each identified by index; not visible in the API! Owned by μP! Score-board manages HW resources, specifically registers In single-bit HW array sb[], every bit i in sb[i] is associated with a specific register, the one identified by i , e.g. ri Association is by index, i.e. by name: sb[i] belongs to reg ri Only if sb[i] = 0, does that register i have valid data If sb[i] = 0 then register ri is NOT in process of being written If bit i is set, i.e. if sb[i] = 1, then that register ri is reserved, i.e. it is off limits for the moment; wait until sb[i] = 0 Initially all sb[*] are free to use, i.e. set to 0

Score Board Execution constraints: rd ← rs op rt
If either sb[s] or sb[t] are set → RAW dependence, hence HW stalls the computation; wait until both rs and rt are available, i.e. until sb[s] = 0 and sb[t] = 0 if sb[d] is set→ WAW dependence, hence HW stalls the write; wait until rd has been used; μP or even SW (compiler) can sometimes determine to rather use another register instead of rd Else, if none of the 3 registers are in use, i.e. if all score board register are 0, then dispatch the instruction immediately

Score Board To allow out of order (ooo) execution, upon computing the value of rd Update rd, and clear sb[d] For uses (AKA references), HW may use any register i, whose sb[i] is 0 For definitions (AKA assignments), HW may set any register j, whose sb[j] is 0 Independent of original order, in which source program was written, i.e. possibly ooo Provided, in the end all API visible registers hold programmed results

Bibliography http://en.wikipedia.org/wiki/Flynn's_taxonomy

Chapter 2 Microprocessors: From High Level View Down To Register level

Similar presentations

Presentation on theme: "Chapter 2 Microprocessors: From High Level View Down To Register level"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Microprocessors: From High Level View Down To Register level

Similar presentations

Presentation on theme: "Chapter 2 Microprocessors: From High Level View Down To Register level"— Presentation transcript:

Similar presentations

About project

Feedback