3Microprocessors in the Seventies Asanovic/DevadasSpring 20026.823Microprocessors in the SeventiesInitial target was embedded controlFirst micro, 4-bit 4004 from Intel, designed for a desktopprinting calculatorConstrained by what could fit on single chipSingle accumulator architectures8-bit micros used in hobbyist personal computersMicral, Altair, TRS-80, Apple-IILittle impact on conventional computer market untilVISICALC spreadsheet for Apple-II (6502, 1MHz)First “killer” business application for personal computers
4DRAM in the Seventies Dramatic progress in MOSFET memory technology Asanovic/DevadasSpring 20026.823DRAM in the SeventiesDramatic progress in MOSFET memory technology1970, Intel introduces first DRAM (1Kbit 1103)1979, Fujitsu introduces 64Kbit DRAM=> By mid-Seventies, obvious that PCs would soon have > 64KBytes physical memory
5Microprocessor Evolution Asanovic/DevadasSpring 20026.823Microprocessor EvolutionRapid progress in size and speed through 70s– Fueled by advances in MOSFET technology and expanding marketsIntel i432– Most ambitious seventies’ micro; started in released 1981– 32-bit capability-based object-oriented architecture– Instructions variable number of bits long– Severe performance, complexity, and usability problemsIntel 8086 (1978, 8MHz, 29,000 transistors)– “Stopgap” 16-bit processor, architected in 10 weeks– Extended accumulator architecture, assembly-compatible with 8080– 20-bit addressing through segmented addressing schemeMotorola (1979, 8MHz, 68,000 transistors)– Heavily microcoded (and nanocoded)– 32-bit general purpose register architecture (24 address pins)– 8 address registers, 8 data registers
6Intel 8086 Typical format R ← R op M[X], many addressing modes Asanovic/DevadasSpring 20026.823Intel 8086Class Register PurposeData: AX,BX “general” purposeCX string and loop ops onlyDX mult/div and I/O onlyAddress: SP stack pointerBP base pointer (can also use BX)SI,DI index registersSegment: CS code segmentSS stack segmentDS data segmentES extra segmentControl: IP instruction pointer (lower 16 bit of PC)FLAGS C, Z, N, B, P, V and 3 control bitsTypical format R ← R op M[X], many addressing modesNot a GPR organization!
7IBM PC, 1981 Hardware Software Open System Asanovic/DevadasSpring 20026.823HardwareTeam from IBM building PC prototypes in 1979Motorola chosen initially, but was lateIBM builds “stopgap” prototypes using 8088 boards fromDisplay Writer word processor8088 is 8-bit bus version of 8086 => allows cheaper systemEstimated sales of 250,000100,000,000s soldSoftwareMicrosoft negotiates to provide OS for IBM. Later buys and modifies QDOS from Seattle Computer Products.Open SystemStandard processor, Intel 8088Standard interfacesStandard OS, MS-DOSIBM permits cloning and third-party software
8The Eighties: Microprocessor Revolution Asanovic/DevadasSpring 20026.823Personal computer market emerges– Huge business and consumer market for spreadsheets, wordprocessing and games– Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek6502, Intel 8088/86, …Minicomputers replaced by workstations– Distributed network computing and high-performance graphicsfor scientific and engineering applications (Sun, Apollo, HP,…)– Based on powerful 32-bit microprocessors with virtual memory,caches, pipelined execution, hardware floating-point Massively Parallel Processors (MPPs) appear– Use many cheap micros to approachsupercomputer performance (Sequent, Intel, Parsytec)
9The Nineties Distinction between workstation and PC disappears Asanovic/DevadasSpring 20026.823The NinetiesDistinction between workstation and PC disappearsParallel microprocessor-based SMPs take over low- end server and supercomputer marketMPPs have limited success in supercomputing marketHigh-end mainframes and vector supercomputers survive “killer micro” onslaught64-bit addressing becomes essential at high-endIn 2001, 4GB DRAM costs <$5,000CISC ISA (x86) thrives!
10Reduced ISA Diversity in Nineties Asanovic/DevadasSpring 20026.823Reduced ISA Diversity in NinetiesFew major companies in general-purpose market– Intel x86 (CISC)– IBM 390 (CISC)– Sun SPARC, SGI MIPS, HP PA-RISC (all RISCs)– IBM/Apple/Motorola introduce PowerPC (another RISC)– Digital introduces Alpha (another RISC)Software costs make ISA change prohibitively expensive– 64-bit addressing extensions added to RISC instruction sets– Short vector multimedia extensions added to all ISAs, but withoutcompiler support=> Focus on microarchitecture (superscalar, out-of-order)CISC x86 thrives!– RISCs (SPARC, MIPS, Alpha, PowerPC) fail to make significant inroadsinto desktop market, but important in server and technical computingmarkets“RISC advantage” shrinks with superscalar out-of-orderexecution
11Intel Pentium Pro, (1995) During decode, translate complex x86 Asanovic/DevadasSpring 20026.823Intel Pentium Pro, (1995)During decode, translate complex x86instructions into RISC-like micro-operations(uops)– e.g., “R Å R op Mem” translates intoload T, Mem # Load from Mem into temp regR Å R op T # Operate using value in tempExecute uops using speculative out-of-ordersuperscalar engine with register renamingPentium Pro family architecture (P6 family) usedon Pentium-II and Pentium-III processors
14P6 uops Each uop has fixed format of around 118 bits Asanovic/DevadasSpring 20026.823P6 uopsEach uop has fixed format of around 118 bits– opcode, two sources, and destination– sources and destination fields are 32-bits wide to holdimmediate or operandSimple decoders can only handle simple x86instructions that map to one uopComplex decoder can handle x86 translationsof up to 4 uopsComplicated x86 instructions handled bymicrocode engine that generates uopsequenceIntel data shows average of uops per x86instruction on SPEC95 benchmarks, on MSOffice applications
15Allocate ROB, RAT, RS entries Asanovic/DevadasSpring 20026.823P6 Reorder Buffer and RenamingReorder Buffer (ROB)Data Status3 uops/cycleAllocate ROB, RAT, RS entriesuop Buffer(6 entries)40 entries in ROBEAXRegister Alias Table (RAT)EAXEBXECXEDXESIEDIESPEBPRetirementRegister File(RRF)Values move from ROB to architectural register file(RRF) when committed
16P6 Reservation Stations and Asanovic/DevadasSpring 20026.823P6 Reservation Stations andExecution UnitsROB(40 entries)Renamed uops(3/cycle)Reservation Station (20 entries)dispatch up to 5 uops/cycleStoreDataStoreAddr.LoadAddr.Int. ALUInt. ALUFP ALUstores only leave MOB when uop commitsMemory Reorder Buffer(MOB)1 store loadLoad data8KB D-cache, 4-way s.a., 32-byte lines, divided into 4 interleaved banksD-TLBD-TLB has 64 entries for 4KB pages fully assoc., plus 8 entries for 4MB pages, 4-way s.a.
17P6 Retirement After uop writes back to ROB with no outstanding Asanovic/DevadasSpring 20026.823P6 RetirementAfter uop writes back to ROB with no outstandingexceptions or mispredicts, becomes eligible forretirementData written to RRF from ROBROB entry freed, RAT updateduops retired in order, up to 3 per cycleHave to check and report exceptions at valid x86instruction fault points– complex instructions (e.g., string move) maygenerate thousands of uops
20P6 Branch Target Buffer (BTB) Asanovic/DevadasSpring 20026.823P6 Branch Target Buffer (BTB)512 entries, 4-way set-associativeHolds branch target, plus two-level BHT fortaken/not-takenUnconditional jumps not held in BTBOne cycle bubble on correctly predicted takenbranches (no penalty if correctly predicted not-taken)
21Two-Level Branch Predictor Asanovic/DevadasSpring 20026.823Two-Level Branch PredictorPentium Pro uses the result from the last two branchesto select one of the four sets of BHT bits (~90-95% correct)Fetch PC2-bit global branch history shift registerShift in Taken/¬Taken results of each branchTaken/¬Taken?
22P6 Static Branch Prediction Asanovic/DevadasSpring 20026.823P6 Static Branch PredictionIf a branch misses in BTB, then static prediction performedBackwards branch predicted taken, forwards branch predicted not-taken
23P6 Branch Penalties RS Decode BTB predicted taken penalty RS Write BTB Asanovic/DevadasSpring 20026.823P6 Branch PenaltiesRSWriteBTBAccessI-CacheAccessx86uopDecodeRSReadExec.RenameRetireBTBpredictedtaken penaltyuop bufferROBFetch bufferReservationStationBranch resolvedDecode and predictbranch that missed in BTB(backwards taken, forwards not-taken)
24P6 System PCI Bus AGP Bus AGP Graphics Card Memory controller DRAM Asanovic/DevadasSpring 20026.823P6 SystemPCI BusAGP BusAGP Graphics CardMemory controllerDRAMGlueless SMP to 4 procs., split-transactionFrontside busCPUCPUCPUCPUL1 I$ L1 D$L1 I$ L1 D$L1 I$ L1 D$L1 I$ L1 D$Backside busL2 $L2 $L2 $L2 $
25Pentium-III Die Photo 256KB 8-way s.a. 16KB 4-way s.a. D$ Asanovic/DevadasSpring 20026.823ProgrammableExternal and BacksidePacked FP DatapathsInterrupt ControlClockBus LogicPage Miss Handler16KB4-way s.a. D$Integer DatapathsFloating-PointDatapathsMemory OrderBufferMemory Interface Unit (convert floats to/from memory format)MMX DatapathsRegister Alias TableAllocate entries(ROB, MOB, RS)ReservationStationBranchAddress CalcReorder Buffer(40-entry physical regfile + architect. regfile)MicroinstructionSequencer256KB8-way s.a.Instruction Fetch Unit:16KB 4-way s.a. I-cacheInstruction Decoders:3 x86 insts/cycle
26Pentium Pro vs MIPS R10000 Estimates of 30% hit for CISC versus RISC Asanovic/DevadasSpring 20026.823Estimates of 30% hit for CISC versus RISC– compare with original “RISC Advantage” of 2.6“RISC Advantage” decreased because size of out-of-order core largely independent of original ISA