ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy

ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy
Herbert G. Mayer, PSU Status 1/12/2017

Syllabus Architecture Types Architecture Attributes
Flynn Classification 1966 Generic Computer Architecture Model Instruction Set Architecture (ISA) Iron Law of Performance Uniprocessor (UP) Architectures Multiprocessor (MP) Architectures Hybrid Architectures Dependencies Score Board & Tomasulo Method Bibliography

Architecture Types Single Accumulator Architecture (early 1940s), e.g. John von Neumann’s computer, or John Vincent Atanasoff’s computer Basis for ENIAC computer General Purpose Register Architecture –GPR 2-Address Architecture (GPR with one operand implied), e.g. IBM 360 3-Address Architecture (GPR with operands of arithmetic operation explicit), e.g. VAX 11/70 John von Neumann 1940 John Vincent Atanasoff

Architecture Types Stack Machines (e.g. B5000 see [2], B6000, HP3000 see [3]) Vector Architecture, e.g. Amdahl 470/6, competing with IBM’s 360 in the 1970s; blurs line to Multiprocessor Shared Memory Architecture Distributed Memory Architecture Data Flow Machine; see Jack Dennis’ work at MIT Jack Dennis

Architecture Types Systolic Architecture; see Intel® iWarp and CMU’s warp architecture Superscalar Architecture; see Intel 80860, AKA i860 VLIW Architecture; see Multiflow computer Pipelined Architecture; debatable  whether UP or hybrid EPIC Architecture; see Intel® Itanium® architecture

Architecture Attributes
Main memory (main store), separate from CPU Program instructions stored in main memory Also, data stored in memory; aka von Neumann architecture Data available in –distributed over– main memory, stack, heap, reserved OS space, free space, IO space Instruction pointer (AKA instruction counter, program counter), other special registers Von Neumann memory bottle-neck, everything travels on same bus Accumulator (single register) holds result of arithmetic/logical operation

Architecture Attributes
Memory controller handles memory access requests between processor and outside world to memory IO controller manages peripheral devices’ connection to bus; jointly AKA chipset Current trend is to move all or part of memory controller onto CPU chip; does not mean the controller is part of the CPU! Processor units include: FP unit, Integer unit, branch unit, control unit, register file, pathways

Data-Stream, Instruction-Stream
Data-Stream, Instruction-Stream Classification, defined by Michael J. Flynn 1966! Single-Instruction, Single-Data Stream (SISD) Architecture, e.g. (PDP-11) Single-Instruction, Multiple-Data Stream (SIMD) Architecture, e.g. Array Processors, Solomon, Illiac IV, BSP, TMC Multiple-Instruction, Single-Data Stream (MISD) Architecture, e.g. possibly: superscalar machines, pipelined, VLIW, EPIC Multiple-Instruction, Multiple-Data Stream Architecture (MIMD); perhaps true multiprocessor yet to be built; yes, debatable!

Data-Stream, Instruction-Stream

Generic Computer Architecture Model

Instruction Set Architecture (ISA)
ISA is boundary between Software (SW) and Hardware (HW) Specifies logical machine that is visible to the programmer & compiler Is functional specification for processor designers That boundary is sometimes a very low-level piece of system software that handles exceptions, interrupts, and HW-specific services Could fall into the domain of the OS

Specified by ISA: Operations: what to perform and in which order Temporary Operand Storage in the CPU: accumulator, stack, registers Note that stack can be word-sized, even bit-sized (design of successor for NCR’s Century architecture of the 1970s) Number of operands per instruction Operand location: where and how to specify/locate the operands Type and size of operands Instruction Encoding in binary

ISA: Dynamic Static Interface (DSI)

Iron Law of Processor Performance
Clock-rate doesn’t count, bus width doesn’t count, the number of registers and operations executed in parallel doesn’t count!  What counts is how long it takes for my computational task to complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium® 4– based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the former solution

If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive Assuming same execution, performance Meaning of this: Wall-clock time (Time) is time I have to wait for completion Program Size is indicator of overall physical complexity of computational task

Amdahl’s Law Articulated by Gene Amdahl During 1967 AFIPS conference
Stating that the maximum speedup of a program P is dominated by its sequential portion S I.e. if some part of program P can be perfectly accelerated due to numerous parallel processors, but some part S of P in inherently sequential, then the resulting performance is dominated by S See Wikipedia Sample:

Amdahl’s Law –Wikipedia Sample
The speedup of a program using multiple processors in parallel computing is limited by sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20 ×, shown in diagram n = element of N; with N number of processors B = element of { 0, 1 } T(n) = time to execute with n processors T(n) = T(1) ( B + (1-B) / n ) S(n) = Speedup T(1) / T(n) S(n) = 1 / ( B + ( 1 – B ) / n )

Amdahl’s Law (From Wikipedia)

Architecture Taxonomy

Uniprocessor (UP) Architectures
Ancient! Not used today in typical μP: Single Accumulator (SAA) Architecture, e.g. Von Neumann’s machine, developed in the 1940s Single register to hold operation results Was conventionally called accumulator Accumulator used as destination of arithmetic operations, and as (one) source Has central processing unit, memory unit, connecting memory bus pc points to next instruction (in memory) to be executed next Commercial sample: ENIAC

Uniprocessor (UP) Architectures

General Purpose Register (GPR)
Accumulates ALU results in n registers, n is typically = 4, 8, 16, . . . Allows register-to-register operations, fast! GPR is essentially a multi-register extension of SA architecture Two-address architecture specifies one source operand explicitly, another implicitly, plus one destination Three-address architecture specifies two source operands explicitly, plus an explicit destination Variations allow additional index registers, base registers, multiple index registers, etc.

General Purpose Register (GPR)

Stack Machine Architecture (SMA)
AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses; all are implied except for push and pop What is equivalent of push/pop on GPR? Pure Stack Machine (SMA) has no registers Hence performance would be poor, as all operations involve memory! However, one can design an SMA that implements n top of stack elements as registers: Stack Cache Sample architectures: Burroughs B5000, HP 3000

Implement impure stack operations that bypass tos operand addressing Sample code sequence to compute on SMA: res := a * ( b ) -- operand sizes are implied! push a -- destination implied: stack pushlit also destination implied push b -- ditto add sources, and destination implied mult sources, and destination implied pop res -- source implied: stack

Pipelined Architecture (PA)
Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PA Unit is referred to as a stage; also the time at which the action is done is referred to as stage Each of these stages/units can be initiated once per –very short– cycle Yet each subunit is implemented in HW just once Multiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part of overall instruction execution

Problem posed by non-unit time, differing # of cycles per operation cause different terminations Operations can abort in intermediate stage, if some earlier instruction changes the flow of control; named flushing and then re-priming of the pipeline E.g. due to a branch, exception, return, conditional branch, call Operation must stall in case of operand dependence: stall, caused by interlock; AKA dependency of data or control; for example, a datum is needed due to load, but not yet available! A stall does not require flushing and re-priming of pipeline; just causes wait

Ideally each instruction would be partitioned into the same number of stages, i.e. sub-operations Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations That equal-length time quantum might as well be a single sub-clock In practice hard for architect to achieve; compare for example noop and floating point divide! Vastly different timing needs!

Ideally all operations have independent operands i.e. one operand being computed is not needed as source of the next few operations If they were needed –and often they are— then this would cause dependence, which causes a stall read after write (RAW) write after read (WAR) write after write –with use in between (WAW) Also, ideally, all instructions just happen to be arranged sequentially one after another In reality, there are branches, calls, returns etc.

Idealized Pipeline Resource Diagram:

Some architectures drive pipeline depth to extreme Willamette O(20) stages, versus P6 O(10) stages Con: requires very accurate branch prediction Pro: can run at extremely high clock speed w/o requiring “better” silicon technology

Vector Architecture (VA)
Register implemented as HW array of identical registers, named vri[j], i = 0 .. n-1, j = 0 .. m-1, AKA vector registers VA may have scalar registers, named r0, r1, etc. Scalar register vri can also be the first (index 0) of the vector registers, e.g. vri[0] Vector registers vri[*] can load/store blocks each of contiguous data Still in sequence, but overlapped; number of steps to complete load/store of a vector depends on bus width Vector registers perform multiple operations of the same kind on contiguous operand blocks

VA operates sequentially, but processes n ≥ 1 operands in overlapped fashion: faster than n scalar ops!

Graph shows parallel data processing in single operation

Otherwise operation look like in GPR architecture Sample vector operations, assume 64-unit vector ops: ldv vr1, memi loads 64 memory locs from [mem+i=0..63] stv vr2, memj stores vr2 in 64 contiguous locs vadd vr1, vr2, vr3 -- register-register vector add cvaddf r0, vr1, vr2, vr3 -- has conditional meaning: -- sequential equivalent: for i = 0 to 63 do if bit i in r0 = 1 then vr1[i] = vr2[i] + vr3[i] else – must be 0 -- do not move corresponding bits end if end for -- parallel syntax equivalent: forall i = 0 to 63 doparallel end parallel for

Multiprocessor (MP) Architectures
Shared Memory Architecture (SMA) Equal access to memory for all n processors, p0 to pn-1 Only one will succeed in accessing shared memory, if there are multiple, simultaneous accesses Simultaneous access must be deterministic; needed a policy or an arbiter that ensure deterministic order Von Neumann bottleneck even tighter than for conventional UP system Typically there are ~ twice as many loads as there are stores

Generally, some processors are idle due to memory or other conflict Typical number of processors n = 4, but n = 8 and greater n possible, with large 2nd level cache, even larger 3rd level Early MP architectures had limited commercial success and acceptance, due to programming burden, frequently burden on the human programmer Morphing in 2000s into multi-core and hyper- threaded architectures, where programming burden is partly taken on by multi-threading OS

Here n > 1 CPUs share memory, via single bus

Distributed Memory Architecture DMA
Processors have private, AKA local memories Yet programmer has to see single, logical memory space, regardless of local distribution Hence each processor pi always has access to its own memory Memi And collection of all memories Memi with i = 0..n-1 is the full program’s logical data space Thus, processors must access others’ memories Done via Message Passing or Virtual Shared Memory Messages must be routed, the route be determined Route may require multiple, intermediate nodes

Focus of discussion here is not direct memory access, AKA DMA! Blocking when: message required and expected but hasn’t arrived yet Blocking when: message to be sent, but destination cannot receive; requires state knowledge of destination! Growing message buffer size increases illusion of asynchronicity of sending and receiving operations Key parameter: time for 1 hop and package overhead to send empty message Message may be even further delayed because of network congestion

Here n > 1 MP CPUs each have their own local memory Total memory is sum of all distributed parts

Systolic Array (SA) Architecture
Very few designed: CMU and Intel for DARPA Each processor has private memory Network is fixed by architecture-designed Systolic Pathway (SP) Each node is pre-connected via SP to some defined subset of other processors in SA Node connectivity: determined by network topology Systolic pathway is high-performance network: Sending and receiving may be synchronous (includes blocking) or asynchronous (data received are buffered) Typical network topologies: line, ring, torus, hex grid, mesh, etc.

Sample below is a ring; note that the wrap-around along x and y direction is not shown Processor can write to x or y gate; sends word off on x or y SP Processor can read from x or y gate; consumes word from x or y SP Buffered SA can write to gate, even if receiver is not ready to read Reading from a gate when no message is available causes blocking! Automatic code generation for non-buffered SA is quite hard, compiler must keep track of inter- processor synchronization Can view SP as an extension of memory with infinite capacity, but with sequential access

Note that each pathway x or y, may be bi-directional May have any number of pathways, nothing magic about 2: x and y; or 3: x, y, and z Possible to have I/O capability with each node Typical application: large polynomials of the form: y = k0 + k1*x1 + k2*x kn-1*xn-1 = Σ ki*xi Next example shows a torus without displaying the wrap-around pathways across both x- and y- dimensions

Hybrid Architectures Superscalar (SSA) Architecture
Replicates (duplicates) some operations in HW Seems like scalar architecture w.r.t. object code Is parallel architecture, as it has multiple copies of certain hardware units Is not an MP architecture: the multiple units do not have, for example, separate program counters Superscalar architecture has multiple ALU elements, possibly multiple FP add (FPA) units, FP multiply (FPM) units, and/or integer units Arithmetic operations can be simultaneous with load and store operations, provided no data dependence!

Hybrid Architectures Instruction fetch in superscalar architecture is speculative, since number of parallel operations unknown; rule: fetch too much! Speculate this might work! But fetch no more than the longest possible superscalar pattern Code sequence looks like sequence of instructions for scalar processor Example: 80486® code executed on Pentium® processors; this is 1980s!! Famous superscalar architecture example: Intel’s i80860 processor Object code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing certain sequences better suited for superscalar execution

Hybrid Architectures Fetch enough instruction bytes on superscalar target to support widest (most parallel) possible object sequence Decoding is bottle-neck for CISC; but is easier for RISC  32-bit operands, or 64-bit operands Sample of superscalar: i80860 has separate FPA, FPM, 2 integer ops, load, store with pre-post address-increment and decrement The i80860 μP was not very commercially successful Sample below: superscalar and pipelined architecture with max. of 3 instructions per cycle; here the pipelined stages are: IF, DE, EX, and WB

Hybrid Architectures N=3, i.e.3 IPC

VLIW Architecture Very Long Instruction Word, typically 128 bits or more Object code is no longer purely scalar, but explicitly parallel Just like limitation in superscalar: This is not a general MP architecture VLIW sub instructions do not have general simultaneous memory access Multiple memory accesses only if addresses disjoint and multiple ports to memory provided VLIW opcodes support parallel execution, with dependences resolved at compile time Compiler/programmer explicitly packs parallelizable operations into VLIW instruction

VLIW Architecture Just like horizontal microcode compaction
Non-VLIW opcodes are still scalar, can coexist with VLIW instructions Partial parallel, even scalar, operations possible by placing no-ops into some of the VLIW fields; i.e. not all fields must be filled with subinstructions Sample: Compute instruction of CMU warp® and Intel® iWarp® Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructions Data dependence example: Result of FPA cannot be used as operand for FPM in one VLIW instruction

VLIW Architecture Result of int1 cannot be used as operand for int2
Thus, need software-pipelining (later in this term) Below single VLIW instruction with 7 units, some sub-opcodes may be noop! Still needs VLIW opcode

VLIW Architecture VLIW sample with 5 units: FPA, FPM, INT1, Branch, Load

EPIC Itanium® Architecture
Originally code-named Merced, Itanium® is Intel’s first published, commercial 64-bit computer product, launched 2001, co-developed with HP Corp. IPF stands for Itanium Processor Family Published means: Smart Intel was diligently developing a contemporaneous, competing 64-bit processor, the extended version of its ancient x86 architecture, just in case, as a secret backup risk hedge 64-bit means that the logical address range spans 264 different memory bytes; and natural integer objects are 64 bits wide The exact format of data objects is described in section Data and Memory During its development at Intel, the first generation of Itanium processors was internally code-named Merced The family is now officially called IPF, for Itanium Processor Family, while early in its development it was referred to as IA- 64, for Intel 64-bit architecture; conflicting later with x86

Intel’s Itanium architecture is radically different from the widely used Intel x86 32-bit –and also from the x bit– architecture The old name IA-32 is obsolete; instead refer to x86 architecture, lest one incorrectly infers today that it be restricted to 32-bit addresses and integer types of 32- bit length That limitation no longer exists since introduction of 64-bit versions about ½ year after AMD’s extension of IA-32 to 64 bits; note also official Intel name EM64T Imagine how Intel felt, when AMD, having produced CPUs compatible with Intel’s chips, suddenly had a more advanced, attractive x86 CPU!

, AKA Groups instructions into bundles Straightens out branches by associating predicate with instructions, and . . . Executes both paths in parallel, say the else clause and then clause of an If Statement Decides at run time which predicate is true, and completes that path; aborts the other! Uses speculation to straighten branch tree Uses large, rotating register file Provides many registers, not just 64 GPRs

Photo of Itanium 2 Processor

Former Intel VP Pat Gelsinger, with Itanium Chips

EPIC Itanium® Registers
Itanium has 128 general registers (GR), 128 floating-point registers (FR), 64 single-bit predicate registers (PR), 8 branch registers (BR), and 128 application registers (AR) In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP) GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:

EPIC Itanium® Registers

EPIC Itanium® ISA Parallelism, Dependences, and Groups
Itanium instructions packaged in groups can execute in parallel; allows fast execution, if HW is available! Assembly programmer or compiler may craft groups as large as desired; the performance consequence is: All operations embedded in a single group can be executed simultaneously, in parallel, saving time over the equivalent sequential execution The physical silicon angle of this is: Of all operations that could be executed in parallel only those are actually performed in parallel, for which there exist HW resources E.g. on an Itanium® 2 implementation of IPF, there are 6 units available to operate in parallel

If fewer actions are enclosed in a group, some HW will idle If more actions could be included in a group, then all HW elements are active, yet some degree of possible parallelism will be lost; future HW implementations may execute that same object code faster due to the higher degree of parallelism Parallel execution is not feasible if dependencies exist between instructions On Itanium these dependencies are not resolved by the machine It is the human programmer or optimizer that explicitly tracks, what can be done in parallel, and what must be done in sequence. The machine just runs it, goal: TO BE FAST!

If a result has to be computed first before it can be read somewhere else (memory or register), a true dependence exists; AKA data dependence; conventional to say “dependence” On Itanium this is called: RAW (Read after Write) dependence If a result has to be read first before it can be re-computed, a false dependence is created, AKA anti-dependence On Itanium this is named: WAR (Write after Read) dependence If a result has to be computed first before it can be computed again, assuming that an intermediate reference is possible, output dependence is created Itanium calls this third dependence: WAW (Write after Write) dependence

Registers, Dependencies

Register & Data Dependencies
Inter-instruction dependencies, in CS parlance also known as dependences, arise between registers –like between program objects, or memory locations– being defined and used One instruction computes a result into a register (or memory); another instruction needs that result from that same register (or that memory location) Or, one instruction uses a datum; and after its use the same item is then recomputed Dependences require sequential execution, lest the result is unpredictable, i.e. wrong!

Register Dependencies
True Dependence, AKA Data Dependence: <- synonymous! r3 ← r1 op r2 r5 ← r3 op r4 Read after Write, RAW Anti Dependence, not a true dependence parallelize under right condition r1 ← r5 op r4 Write after read, WAR Output Dependence, similar to Anti-Dependence: can do something r5 ← r3 op r4 r3 ← r6 op r7 Write after Write, WAW, use in between

Register Dependencies
Control Dependence: // ri, i = 1..4 come in “live” if ( condition1 ) { r3 = r1 op r2; }else{  see the jump here? r5 = r3 op r4; } // end if write( r3 );

Register Renaming Only data dependence is a real dependence, hence called true dependence Other dependences are artifacts of insufficient resources, generally insufficient registers This means: if additional registers were available, then replacing some of these conflicting registers with new ones, could make the conflict disappear! Anti and Output Dependences are indeed such false dependences

Register Renaming Original Code: L1: r1 ← r2 op r3 L2: r4 ← r1 op r5
Compute Dependences before making register changes The term “register r is live at instruction foo” means: some other instruction at foo+i is known to reference register r without there being another assignment to r between foo and foo+i

Register Renaming Original Code: Dependences before: L1: r1 ← r2 op r3
L1, L2 true dep with r1 L1, L3 output dep with r1 L1, L4 anti dep with r3 L3, L4 true dep with r1 L2, L3 anti dep with r1 L3, L4 anti dep with r3

Register Renaming What changes could a compiler or programmer make, for the sake of decreasing register dependence, if more resources (registers) were available? Once less dependences exist, a higher degree of parallelism is achievable Thus execution speed could be increased! But the processor can make those very same changes in HW Note that x86 for example has many internal registers, exactly to make such changes for sake of added speed

Register Renaming Original Code: New Code, after adding regs:
L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead L3: r1 ← r3 op r6 r1 ← r30 op r6 L4: r3 ← r1 op r7 r3 ← r1 op r7 Dependences before: Dependences after: L1, L2 true dep with r1 L1, L2 true dep with r10 L1, L3 output dep with r1 L3, L4 true dep with r1 L1, L4 anti dep with r3 // ri, i = 1..7 are “live” L3, L4 true dep with r1 L2, L3 anti dep with r1 L3, L4 anti dep with r3

Register Renaming With these additional renamed registers the new code could possibly run in half the time! First: Compute into r10 instead of r1; needs additional register r10; no time penalty! Also: In preceding code, store result into r30 instead r3, if r30 available; creates no added time penalty! Then the following regs are live afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! Caveat: r2 came in live, must go out live! While r10 and r30 are don’t cares afterwards; yet they are live too; no harm

Score Board

Score Board Score-board sb is not not accessible to programmers!
Instead, Score-board is an array of HW programmable bits sb[], each identified by index; not visible in the ISA! Owned by the μP! Score-board manages HW resources, specifically registers Is a single-bit HW array sb[]. Every bit i in sb[i] is associated with a specific register, the one identified by i , e.g. ri Association is by index, i.e. by name: sb[i] belongs to reg ri Only if sb[i] = 0, does that register i have valid data, or we can say, if sb[i] = 0 then register ri is NOT being written If bit i is set, i.e. if sb[i] = 1, then that register ri is reserved, i.e. it is off limits for the moment; wait until sb[i] = 0 Initially all sb[*] are free to use, i.e. set to 0

Score Board Execution constraints: rd ← rs op rt
If either sb[s] or sb[t] are set → RAW dependence, hence HW stalls the computation; wait until both rs and rt are available, i.e. until sb[s] = 0 and sb[t] = 0 if sb[d] is set→ WAW dependence, hence HW stalls the write; wait until rd has been used; μP or even SW (compiler) can sometimes determine to rather use another register instead of rd Else, if none of the 3 registers are in use, i.e. if all score board register are 0, then dispatch the instruction immediately

Score Board To allow out of order (ooo) execution, upon computing rd
Update rd, and clear sb[d] For uses (AKA references), HW may use any register i, whose sb[i] is 0 For definitions (AKA assignments), HW may set any register j, whose sb[j] is 0 Independent of original order, in which source program was written, i.e. possibly ooo Provided, in the end all ISA visible registers hold programmed results

Score Board & ooo Execution
Out of order execution, AKA dynamic execution CDC supercomputers broke complex instruction (e.g. FP divide) into a semantically equivalent sequence of simpler sub-operations Each of which could be executed very swiftly On pipelined architecture, numerous sub-operations or multiple instructions exist and progress in various phases of completion E.g. CDC 6600 during 1960s IBM 360/91 during 1970s, Tomasulo’s genuine ooo algorithm IBM POWER1 μP in 1990 Intel x86 family, 1995 Pentium Pro®

Score Board & ooo Execution
Dozens of sub-operations progress simultaneously, yet in any order, including out of order, AKA ooo As long as the retiring order is logically equivalent to sequential operation of original instruction sequence Detail of ooo execution paradigm: Fetch next instruction i Dispatch i to instruction queue, AKA reservation station Then i waits in queue until input operands available Then i can leave queue, even before earlier, older instructions i is issued to appropriate functional unit for execution Results are queued up, to preserve original order Once older instructions have written back their results to the register file, then i’s result is written back to rd; called retire stage

Tomasulo Method (TM) Tomasulo’s Method (TM) developed for FP unit of IBM 360/91 family For fast execution, TM uses: Register renaming (RR), reservation stations (RS), common data bus (CDB), out of order execution (ooo) RR: provides additional non-ISA internal registers to free ISA regs and to eliminate dependencies Robert Tomasulo, IBM, RS: each functional unit (FU) has an associated RS; it controls, when instruction executes; holds all info needed for execution 1. Whether or not the FU is free to execute immediately 2. Which specific operation is to be executed, generally implied 3. Whether operands are available; note operands don't have to be −and generally aren’t− located in ISA FP-registers! CDB: internal bus that broadcasts values to all RS

Tomasulo Method CDB’s numerous connections increase impedance 
Requires careful EE design to allow operation at high clock frequency! Catchy phrase for Tomasulo’s Method: “An algorithm that preserves precedence while encouraging concurrency!” As a result of CDB and internal registers, FUs don't have to place results in ISA registers, and generally do not! Frees ISA registers for other purposes! Exceptions: Always desirable to associate exception and its handling with precise instruction causing it! On sequential computer: non-issue But as soon as parallelism used for fast execution, association of exception to instruction becomes fuzzy

Tomasulo Circuit, Taken from Wiki

Tomasulo Method See for example Alpha policy, explicitly allowing imprecise exception handling for sake of high speed When precision needed on Alpha, possible to switch to slower mode of execution, with precise exceptions TM uses “imprecise execution” type for such cases Typical Instruction Lifecycle on architecture using TM: 1. Issue, 2. Execute, 3. Post result 1. Instruction issue: If true dependence exists, instruction stalls until needed operand available. Else, if WAR or WAW dependence exists, eliminate via register renaming. Else instruction is retrieved from instruction queue to proceed: if operands available from registers and FU is free, send to execution; else stall until all resources available

Tomasulo Method 2. Instruction execution: If memory access, then compute address and place instruction into load/store buffer; else must be an ALU instruction, so execute at corresponding FU 3. Posting result of instruction: If instruction is a memory access, complete the load or store. Else must be ALU op; then the result of ALU operation is broadcast on CDB and received by RS waiting for the result TM used in various 1970s supercomputers Less use during period of minicomputer Renewed, strong use in numerous contemporary microprocessors

Bibliography http://en.wikipedia.org/wiki/Flynn's_taxonomy
VLIW Architecture: w-wp.pdf ACM reference to Multiflow computer architecture: f04/reading/ibm67-anderson-360.pdf

ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy

Similar presentations

Presentation on theme: "ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy

Similar presentations

Presentation on theme: "ECE 486/586 Computer Architecture Chapter 1 Computer Taxonomy"— Presentation transcript:

Similar presentations

About project

Feedback