Intel Pentium M. Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch.

Intel Pentium M

Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch Prediction Micro-ops fusion Micro-ops fusion Speed Step technology Speed Step technology Thermal Throttle 2 Thermal Throttle 2 Power and Performance Power and Performance

Quick Review of x86 8080 - 8-bit 8080 - 8-bit 8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model 8086/8088 - 16-bit (8088 had 8-bit external data bus) - segmented memory model 286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options 286 - introduction of protected mode, which included: segment limit checking, privilege levels, read- and exe-only segment options 386 - 32-bit - segmented and flat memory model - paging 386 - 32-bit - segmented and flat memory model - paging 486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor) 486 - first pipeline - expanded the 386's ID and EX units into five-stage pipeline - first to include on-chip cache - integrated x87 FPU (before it was a coprocessor) Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon after Pentium (586) - first superscalar - included two pipelines, u and v - virtual-8086 mode - MMX soon after Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecture Pentium Pro (686 or P6) - three-way superscalar - dynamic execution - out-of-order execution, branch prediction, speculative execution - very successful micro-architecture Pentium 2 and 3 - both P6 Pentium 2 and 3 - both P6 Pentium 4 - new NetBurst architecture Pentium 4 - new NetBurst architecture Pentium M - enhanced P6 Pentium M - enhanced P6

Pentium Pro Roots NexGen 586 (1994) NexGen 586 (1994) Decomposes IA32 instructions into simpler RISC-like operations (R-ops or micro-ops) Decomposes IA32 instructions into simpler RISC-like operations (R-ops or micro-ops) Decoupled Approach Decoupled Approach NexGen bought by AMD NexGen bought by AMD AMD K5 (1995) – also used micro-ops AMD K5 (1995) – also used micro-ops Intel Pentium Pro Intel Pentium Pro Intel’s first use of decoupled architecture Intel’s first use of decoupled architecture

Pentium-M Overview Introduced March 12, 2003 Introduced March 12, 2003 Initially called Banias Initially called Banias Created by Israeli team Created by Israeli team Missed deadline by less than 5 days Missed deadline by less than 5 days Marketed with Intel’s Centrino Initiative Marketed with Intel’s Centrino Initiative Based on P6 microarchitechture Based on P6 microarchitechture

P6 Pipeline in a Nutshell Divided into three clusters (front, middle, back) Divided into three clusters (front, middle, back) In-order Front-End In-order Front-End Out-of-order Execution Core Out-of-order Execution Core Retirement Retirement Each cluster is independent Each cluster is independent I.e. if a mispredicted branch is detected in the front- end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions I.e. if a mispredicted branch is detected in the front- end, the front-end will flush and retch from the corrected branch target, all while the execution core continues working on previous instructions

P6 Pipeline in a Nutshell

P6 Front-End Major units: IFU, ID, RAT, Allocator, BTB, BAC Major units: IFU, ID, RAT, Allocator, BTB, BAC Fetching (IFU) Fetching (IFU) Includes I-cache, I-streaming cache, ITLB, ILD Includes I-cache, I-streaming cache, ITLB, ILD No pre-decoding No pre-decoding Boundary markings by instruction-length decoder (ILD) Boundary markings by instruction-length decoder (ILD) Branch Prediction Branch Prediction Predicted (speculative) instructions are marked Predicted (speculative) instructions are marked Decoding (ID) Decoding (ID) Conversion of instructions (macro-ops) into micro-ops Conversion of instructions (macro-ops) into micro-ops Allocation of Buffer Entries: RS, ROB, MOB Allocation of Buffer Entries: RS, ROB, MOB

P6 Execution Core Reservation Station (RS) Reservation Station (RS) Waiting micro-ops ready to go Waiting micro-ops ready to go Scheduler Scheduler Out-of-order Execution of micro-ops Out-of-order Execution of micro-ops Independent execution units (EU) Independent execution units (EU) Must be careful about out-of-order memory access Must be careful about out-of-order memory access Memory ordering buffer (MOB) interfaces to the memory subsystem Memory ordering buffer (MOB) interfaces to the memory subsystem Requirements for execution Requirements for execution Available operands, EU, and write-back bus Available operands, EU, and write-back bus Optimal performance Optimal performance

P6 Retirement In-order updating of architected machine state In-order updating of architected machine state Re-order buffer (ROB) Re-order buffer (ROB) Micro-op retirement – “all or none” Micro-op retirement – “all or none” Architecturally illegal to retire only part of an IA-32 instruction Architecturally illegal to retire only part of an IA-32 instruction In-ordering handling of exceptions In-ordering handling of exceptions Legal to handle mid-execution, but illegal to handle mid-retirement Legal to handle mid-execution, but illegal to handle mid-retirement

PM Changes to P6 Most changes made in P6 front-end Most changes made in P6 front-end Added and expanded on P4 branch predictor Added and expanded on P4 branch predictor Micro-ops fusion Micro-ops fusion Addition of dedicated stack engine Addition of dedicated stack engine Pipeline length Pipeline length Longer than P3, shorter than P4 Longer than P3, shorter than P4 Accommodates extra features above Accommodates extra features above

PM Changes to P6, cont. Intel has not released the exact length of the pipeline. Intel has not released the exact length of the pipeline. Known to be somewhere between the P4 (20 stage) and the P3 (10 stage). Rumored to be 12 stages. Known to be somewhere between the P4 (20 stage) and the P3 (10 stage). Rumored to be 12 stages. Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, … Trades off slightly lower clock frequencies (than P4) for better performance per clock, less branch prediction penalties, …

Blue Man Group Commercial Break

Banias 1 st version 1 st version 77 million transistors, 23 million more than P4 77 million transistors, 23 million more than P4 1 MB on die Level 2 cache 1 MB on die Level 2 cache 400 MHz FSB (quad pumped 100 MHZ) 400 MHz FSB (quad pumped 100 MHZ) 130 nm process 130 nm process Frequencies between 1.3 – 1.7 GHz Frequencies between 1.3 – 1.7 GHz Thermal Design Point of 24.5 watts Thermal Design Point of 24.5 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

Dothan Launched May 10, 2004 Launched May 10, 2004 140 million transistors 140 million transistors 2 MB Level 2 cache 2 MB Level 2 cache 400 or 533 MHz FSB 400 or 533 MHz FSB Frequencies between 1.0 to 2.26 GHz Frequencies between 1.0 to 2.26 GHz Thermal Design Point of 21(400 MHz FSB) to 27 watts Thermal Design Point of 21(400 MHz FSB) to 27 watts http://www.intel.com/pressroom/archive/photos/centrino.htm

Dothan cont. 90 nm process technology on 300 mm wafer. 90 nm process technology on 300 mm wafer. Provide twice the capacity of the 200 mm while the process dimensions double the transistor density Provide twice the capacity of the 200 mm while the process dimensions double the transistor density Gate dimensions are 50nm or approx half the diameter if the influenza virus Gate dimensions are 50nm or approx half the diameter if the influenza virus P and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20% P and n gate voltages are reduced by enhancing the carrier mobility of the Si lattice by 10-20% Draws less than 1 W average power Draws less than 1 W average power

Bus Utilizes a split transaction deferred reply protocol Utilizes a split transaction deferred reply protocol 64-bit width 64-bit width Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processor Delivers up to 3.2 Gbps (Banis) or 4.2 Gbps (Dothan) in and out of the processor Utilizes source synchronous transfer of addresses and data Utilizes source synchronous transfer of addresses and data Data transferred 4 times per bus clock Data transferred 4 times per bus clock Addresses can be delivered times per bus clock Addresses can be delivered times per bus clock

Bus update in Dothan Bus update in Dothan http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power http://www.intel.com/technology/itj/2005/volume09issue01/art05_perf_power

L1 Cache 64KB total 64KB total 32 K instruction 32 K instruction 32 K data (4 times P4M) 32 K data (4 times P4M) Write-back vs. write-through on P4 Write-back vs. write-through on P4 In write-through cache, data is written to both L1 and main memory simultaneously In write-through cache, data is written to both L1 and main memory simultaneously In write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes In write-back cache, data can be loaded without writing to main memory, increasing speed by reducing the number of slow memory writes

L2 cache 1 – 2 MB 1 – 2 MB 8-way set associative 8-way set associative Each set is divided into 4 separate power quadrants. Each set is divided into 4 separate power quadrants. Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrants Each individual power quadrant can be set to a sleep mode, shutting off power to those quadrants Allows for only 1/32 of cache to be powered at any time Allows for only 1/32 of cache to be powered at any time Increased latency vs. improved power consumption Increased latency vs. improved power consumption

Prefetch Prefetch logic fetches data to the level 2 cache before L1 cache requests occur Prefetch logic fetches data to the level 2 cache before L1 cache requests occur Reduces compulsory misses due to an increase of valid data in cache Reduces compulsory misses due to an increase of valid data in cache Reduces bus cycle penalties Reduces bus cycle penalties

Schedule P6 Pipeline in detail P6 Pipeline in detail Front-End Front-End Execution Core Execution Core Back-End Back-End Power Issues Power Issues Intel SpeedStep Intel SpeedStep Testing the Features Testing the Features x86 system registers x86 system registers Performance Testing Performance Testing

IA-32 Memory Management IA-32 Memory Management Classic segmented model (cannot be disabled in protected mode) Classic segmented model (cannot be disabled in protected mode) Separation of code, data, and stack into "segments“ Separation of code, data, and stack into "segments“ Optional paging Optional paging Segments divided into pages (typically 4KB) Segments divided into pages (typically 4KB) Additional protection to segment-protection Additional protection to segment-protection I.e. provides read-write protection on a page-by-page basis I.e. provides read-write protection on a page-by-page basis Stage 11 (stage 1) - Selection of address for next I-cache access Stage 11 (stage 1) - Selection of address for next I-cache access Speculation – address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.) Speculation – address chosen from competing sources (i.e. BTB, BAC, loop detector, etc.) Calculation of linear address from logical (segment selector + offset) Calculation of linear address from logical (segment selector + offset) Segment selector – index into a table of segment descriptors, which include base address, size, type, and access right of the segment Segment selector – index into a table of segment descriptors, which include base address, size, type, and access right of the segment Remember: only six segment selectors, so only six usable at a time Remember: only six segment selectors, so only six usable at a time 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments 32-bit code nowadays uses flat model, so OS can make do with only a few (typically four) segments IFU chooses address with highest priority and sends it to stage two IFU chooses address with highest priority and sends it to stage two P6 Front-end: Instruction Fetching

Stage 12-13 - Accessing of caches Stage 12-13 - Accessing of caches Accesses instruction caches with address calculated in stage one Accesses instruction caches with address calculated in stage one Includes standard cache, victim cache, and streaming buffer Includes standard cache, victim cache, and streaming buffer With paging, consults ITLB to determine physical page number (tag bits) With paging, consults ITLB to determine physical page number (tag bits) Without paging, linear address from stage one becomes physical address Without paging, linear address from stage one becomes physical address Obtains branch prediction from branch target buffer (BTB) Obtains branch prediction from branch target buffer (BTB) BTB takes two cycles to complete one access BTB takes two cycles to complete one access Instruction boundary (ILD) and BTB markings Instruction boundary (ILD) and BTB markings Stage 14 - Completion of instruction cache access Stage 14 - Completion of instruction cache access Instructions and their marks are sent to instruction buffer or steered to ID Instructions and their marks are sent to instruction buffer or steered to ID

P6 Front-end: Instruction Fetching

P6 Front-end: Instruction Decoding Stage 15-16 - Decoding of IA32 Instructions Stage 15-16 - Decoding of IA32 Instructions Alignment of instruction bytes Alignment of instruction bytes Identification of the ends of up to three instructions Identification of the ends of up to three instructions Conversion of instructions into micro-ops Conversion of instructions into micro-ops Stage 17 - Branch Decoding Stage 17 - Branch Decoding If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target If the ID notices a branch that went unpredicted by the BTB (i.e. if the BTB had never seen the branch before), flushes the in-order pipe, and re-fetches from the branch target Branch target calculated by BAC Branch target calculated by BAC Early catch saves speculative instructions from being sent through the pipeline Early catch saves speculative instructions from being sent through the pipeline Stage 21 - Register Allocation and Renaming Stage 21 - Register Allocation and Renaming Synonymous with stage 17 (a reminder of independent working units) Synonymous with stage 17 (a reminder of independent working units) Allocator used to allocate required entries in ROB, RS, LB, and SB Allocator used to allocate required entries in ROB, RS, LB, and SB Register Alias Table (RAT) consulted Register Alias Table (RAT) consulted Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF) Maps logical sources/destinations to physical entries in the ROB (or sometimes RRF) Stage 22 – Completion of Front-End Stage 22 – Completion of Front-End Marked micro-ops are forwarded to RS and ROB, where they await execution and retirement, respectively. Marked micro-ops are forwarded to RS and ROB, where they await execution and retirement, respectively.

P6 Front-end: Instruction Decoding

Register Alias Table Introduction Provides register renaming of integer and floating-point registers and flags Provides register renaming of integer and floating-point registers and flags Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB) Maps logical (architected) entries to physical entries usually in the re-order buffer (ROB) Physical entries are actually allocated by the Allocator Physical entries are actually allocated by the Allocator The physical entry pointers become a part of the micro- op’s overall state as it travels through the pipeline The physical entry pointers become a part of the micro- op’s overall state as it travels through the pipeline

RAT Details P6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycle P6 is 3-way super-scalar, so the RAT must be able to rename up to six logical sources per cycle Any data dependences must be handled Any data dependences must be handled Ex:op1) ADD EAX, EBX, ECX (dest. = EAX) op2) ADD EAX, EAX, EDX Ex:op1) ADD EAX, EBX, ECX (dest. = EAX) op2) ADD EAX, EAX, EDX op3) ADD EDX, EAX, EDX Instead of making op2 wait for op1 to retire, the RAT provides data forwarding Instead of making op2 wait for op1 to retire, the RAT provides data forwarding Same case for op3, but RAT must make sure that it gets the result from op2 and not op1 Same case for op3, but RAT must make sure that it gets the result from op2 and not op1

RAT Implementation Difficulties Speculative Renaming Speculative Renaming Since speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch misprediction Since speculative micro-ops flow by, the RAT must be able to undo its mappings in the case of a branch misprediction Partial-width register reads and writes Partial-width register reads and writes Consider a partial-width write followed by a larger-width read Consider a partial-width write followed by a larger-width read Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipeline Data required by the read is an assimilation of multiple previous writes to the register – to make sure, RAT must stall the pipeline Retirement Overrides Retirement Overrides Common interaction between RAT and ROB Common interaction between RAT and ROB When a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination register When a micro-op retires, its ROB entry is removed and its result may be latched into an architected destination register If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entry If any active micro-ops source the retired op’s destination, they must not reference the outdated ROB entry Mismatch stalls Mismatch stalls Associated with flag renaming Associated with flag renaming

The Allocator Works in conjunction with RAT to allocate required entries Works in conjunction with RAT to allocate required entries In each cycle, assumes three ROB, RS, and LB and two SB entries In each cycle, assumes three ROB, RS, and LB and two SB entries Once micro-ops arrive, it determines how many entries are really needed Once micro-ops arrive, it determines how many entries are really needed ROB Allocation ROB Allocation If three entries aren’t available the allocator will stall If three entries aren’t available the allocator will stall RS Allocation RS Allocation A bitmap is used to determine which entries are free A bitmap is used to determine which entries are free If the RS is full, pipeline is stalled If the RS is full, pipeline is stalled RS must make sure valid entries are not overwritten RS must make sure valid entries are not overwritten MOB Allocation MOB Allocation Allocation of LB and SB entries also done by allocator Allocation of LB and SB entries also done by allocator

PM Changes to P6 Front-End Micro-op fusion Micro-op fusion Dedicated Stack Engine Dedicated Stack Engine Enhanced branch prediction Enhanced branch prediction Additional stages Additional stages Intel’s secret Intel’s secret Most likely required for extra functionality above Most likely required for extra functionality above

Micro-ops Fusion Fusion of multiple micro-ops into one micro-op Fusion of multiple micro-ops into one micro-op Less contention for buffer entries Less contention for buffer entries Similarity to SIMD data packing Similarity to SIMD data packing Two examples of fusion from Intel documentation: Two examples of fusion from Intel documentation: IA32 load-and-operate and store instructions IA32 load-and-operate and store instructions Not known for certain whether these are the only cases of fusion Not known for certain whether these are the only cases of fusion Possibly inspired by MacroOps used in K7 (Athlon) Possibly inspired by MacroOps used in K7 (Athlon)

Dedicated Stack Engine Traditional out-of-order implementations update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with every stack related instruction Traditional out-of-order implementations update the Stack Pointer Register (ESP) by sending a µop to update the ESP register with every stack related instruction Pentium M implementation Pentium M implementation A delta register (ESP D ) is maintained in the front end A delta register (ESP D ) is maintained in the front end A historic ESP (ESP O ) is then kept in the out-of-order execution core A historic ESP (ESP O ) is then kept in the out-of-order execution core Dedicated logic was added to update the ESP by adding the ESP O with the ESP D Dedicated logic was added to update the ESP by adding the ESP O with the ESP D

Improvements The ESP O value kept in the out-of-order machine is not changed during a sequence of stack operations, this allows for more parallelism opportunities to be realized The ESP O value kept in the out-of-order machine is not changed during a sequence of stack operations, this allows for more parallelism opportunities to be realized Since ESP D updates are now done by a dedicated adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on more complex operations Since ESP D updates are now done by a dedicated adder, the execution unit is now free to work on other µops and the ALU’s are freed to work on more complex operations Decreased power consumption since large adders are not used for small operations and the eliminated µops do not toggle through the machine Decreased power consumption since large adders are not used for small operations and the eliminated µops do not toggle through the machine Approximately 5% of the µops have been eliminated Approximately 5% of the µops have been eliminated

Complications Since the new adder lives in the front end all of its calculations are speculative. This necessitates the addition of recovery table for all values of ESP O and ESP D Since the new adder lives in the front end all of its calculations are speculative. This necessitates the addition of recovery table for all values of ESP O and ESP D If the architectural value of ESP is needed inside of the out-of-order machine the decode logic then needs to insert a µop that will carry out the ESP calculation If the architectural value of ESP is needed inside of the out-of-order machine the decode logic then needs to insert a µop that will carry out the ESP calculation

Branch Prediction Longer pipelines mean higher penalties for mispredicted branches Longer pipelines mean higher penalties for mispredicted branches Improvements result in added performance and hence less energy spent per instruction retired Improvements result in added performance and hence less energy spent per instruction retired

Branch Prediction in Pentium M Enhanced version of Pentium 4 predictor Enhanced version of Pentium 4 predictor Two branch predictors added that run in tandem with P4 predictor: Two branch predictors added that run in tandem with P4 predictor: Loop detector Loop detector Indirect branch detector Indirect branch detector 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance 20% lower misprediction rate than PIII resulting in up to 7% gain in real performance

Branch Prediction Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php

Loop Detector A predictor that always branches in a loop will always incorrectly branch on the last iteration A predictor that always branches in a loop will always incorrectly branch on the last iteration Detector analyzes branches for loop behavior Detector analyzes branches for loop behavior Benefits a wide variety of program types Benefits a wide variety of program types http://www.intel.com/technology/itj/2003/volume07 issue02/art03_pentiumm/p05_branch.htm

Indirect Branch Predictor Picks targets based on global flow control history Picks targets based on global flow control history Benefits programs compiled to branch to calculated addresses Benefits programs compiled to branch to calculated addresses http://www.intel.com/technology/itj/2003/volume07iss ue02/art03_pentiumm/p05_branch.htm

Reservation Station Used as a store for µops to wait for their operands and execution units to become available Used as a store for µops to wait for their operands and execution units to become available Consists of 20 entries Consists of 20 entries Control portion of the entry can be written to from one of three ports Control portion of the entry can be written to from one of three ports Data portion can be written to from one of 6 available ports Data portion can be written to from one of 6 available ports 3 for ROB 3 for ROB 3 for EU write backs 3 for EU write backs Scheduler then uses this to schedule up to 5 µops at a time Scheduler then uses this to schedule up to 5 µops at a time During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32 During pipeline stage 31 entries that are ready for dispatch are then sent to stage 32

Cancellation Reservation Station assumes that all cache accesses will be hits Reservation Station assumes that all cache accesses will be hits In the case of a cache miss micro-ops that are dependant on the write-back data need to be cancelled and rescheduled at a later time In the case of a cache miss micro-ops that are dependant on the write-back data need to be cancelled and rescheduled at a later time Can also occur due to a future resource conflict Can also occur due to a future resource conflict

Retirement Takes 2 clock cycles to complete Takes 2 clock cycles to complete Utilizes reorder buffer (ROB) to control retirement or completion of μops Utilizes reorder buffer (ROB) to control retirement or completion of μops ROB is a multi-ported register file with separate ports for ROB is a multi-ported register file with separate ports for Allocation time writes of µop fields needed at retirement Allocation time writes of µop fields needed at retirement Execution Unit write-backs Execution Unit write-backs ROB reads of sources for the Reservation Station ROB reads of sources for the Reservation Station Retirement logic reads of speculative result data Retirement logic reads of speculative result data Consists of 40 entries with each entry 157 bits wide Consists of 40 entries with each entry 157 bits wide The ROB participates in The ROB participates in Speculative execution Speculative execution Register renaming Register renaming Out-of-order execution Out-of-order execution

Speculative Execution Buffers results of the execution unit before commit Buffers results of the execution unit before commit Allows maximum rate for fetch and execute by assuming that branch prediction is perfect and no exceptions have occurred Allows maximum rate for fetch and execute by assuming that branch prediction is perfect and no exceptions have occurred If a misprediction occurs: If a misprediction occurs: Speculative results stored in the ROB are immediately discarded Speculative results stored in the ROB are immediately discarded Microengine will restart by examining the committed state in the ROB Microengine will restart by examining the committed state in the ROB

Register Renaming Entries in the ROB that will hold the results of speculative µops are allocated during stage 21 of the pipeline Entries in the ROB that will hold the results of speculative µops are allocated during stage 21 of the pipeline In stage 22 the sources for the µops are delivered based upon the allocation in stage 21. In stage 22 the sources for the µops are delivered based upon the allocation in stage 21. Data is written to the ROB by the Execution Unit into the renamed register during stage 83 Data is written to the ROB by the Execution Unit into the renamed register during stage 83

Out-of-order Execution Allows µops to complete and write back their results without concern for other µops executing simultaneously Allows µops to complete and write back their results without concern for other µops executing simultaneously The ROB reorders the completed µops into the original sequence and updates the architectural state The ROB reorders the completed µops into the original sequence and updates the architectural state Entries in ROB are treated as FIFO during retirement Entries in ROB are treated as FIFO during retirement µops are originally allocated in sequential order so the retirement will also follow the original program order µops are originally allocated in sequential order so the retirement will also follow the original program order Happens during pipeline stage 92 and 93 Happens during pipeline stage 92 and 93

Exception Handling Events are sent to the ROB by the EU during stage 83 Events are sent to the ROB by the EU during stage 83 Results sent to the ROB from the Execution Unit are speculative results, therefore any exceptions encountered may not be real Results sent to the ROB from the Execution Unit are speculative results, therefore any exceptions encountered may not be real If the ROB determines that branch prediction was incorrect it inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations from the machine If the ROB determines that branch prediction was incorrect it inserts a clear signal at the point just before the retirement of this operation and then flushes all the speculative operations from the machine If speculation is correct, the ROB will invoke the correct microcode exception handler If speculation is correct, the ROB will invoke the correct microcode exception handler All event records are saved to allow the handler to repair the result or invoke the correct macro handler All event records are saved to allow the handler to repair the result or invoke the correct macro handler Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event handler Pointers for the macro and micro instructions are also needed to allow the program to resume after completion by the event handler If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline stages 93 and 94 If the ROB retires an operation that faults, both the in-order and out-of-order sections are cleared. This happens during pipeline stages 93 and 94

Memory Subsystem Memory Ordering Buffer (MOB) Memory Ordering Buffer (MOB) Execution is out-of-order, but memory accesses cannot just be done in any order Execution is out-of-order, but memory accesses cannot just be done in any order Contains mainly the LB and the SB Contains mainly the LB and the SB Speculative loads and stores Speculative loads and stores Not all loads can be speculative Not all loads can be speculative I.e. a memory-mapped I/O ld could have unrecoverable side effects I.e. a memory-mapped I/O ld could have unrecoverable side effects Stores are never speculative (can’t get back overwritten bits) Stores are never speculative (can’t get back overwritten bits) But to improve performance, stores are queued in the store buffer (SB) to allow pending loads to proceed But to improve performance, stores are queued in the store buffer (SB) to allow pending loads to proceed Similar to a write-back cache Similar to a write-back cache

Power Issues Power use = α * C * V 2 * F Power use = α * C * V 2 * F α = activity factor α = activity factor C = C = effective capacitance V = voltage V = voltage F = operating frequency F = operating frequency Power use can be reduced linearly by lowering frequency and capacitance and quadratically by scaling voltage Power use can be reduced linearly by lowering frequency and capacitance and quadratically by scaling voltage

Mobile Use Mobile is bursty – full power is only necessary for brief periods Mobile is bursty – full power is only necessary for brief periods Intel developed SpeedStep technology to take advantage of this fact and reduce power consumption during periods of inactivity Intel developed SpeedStep technology to take advantage of this fact and reduce power consumption during periods of inactivity http://www.intel.com/technology/itj/2003/volume07issue02/art05_power/p05_thermal.htm

SpeedStep I and II SpeedStep I and II used in previous generations SpeedStep I and II used in previous generations Only two states: Only two states: High performance (High frequency mode) High performance (High frequency mode) Lower power use (Low frequency mode) Lower power use (Low frequency mode) Problems Problems Slow transition times Slow transition times Limited opportunity for optimization Limited opportunity for optimization

Pentium M Goals Optimize for performance when plugged in Optimize for performance when plugged in Optimize for long battery-life when unplugged Optimize for long battery-life when unplugged ModelFrequency (max / min)Vcore (max / min) Pentium M 1,6GHz1,6GHz / 600MHz1,484v / 0,956v Pentium M 1,5GHz1,5GHz / 600MHz1,484v / 0,956v Pentium M 1,4GHz1,4GHz / 600MHz1,484v / 0,956v Pentium M 1,3GHz1,3GHz / 600MHz1,388v / 0,956v Pentium M 1,1GHz Low Voltage 1,1GHz / 600MHz1,180v / 0,956v Pentium M 900MHz Ultra Low Voltage 1,6GHz / 600MHz1,004v / 0,844v

SpeedStep III Optimized to fix limitations of previous generations Optimized to fix limitations of previous generations Three innovations: Three innovations: Voltage-Frequency switching separation Clock partitioning and recovery Event blocking Freq.Volt. 1.6GHz 1.6GHz 1.484 V 1.4GHz1.42V 1.2GHz1.276V 1GHz1.164V 800MHz1.036V 600MHz 0.956 V The 6 states of the Pentium M 1,6GHz

Voltage-Frequency switching separation Voltage scaling is stepped up and down incrementally Voltage scaling is stepped up and down incrementally This prevents clock noise and allows the processor to remain responsive during transition This prevents clock noise and allows the processor to remain responsive during transition Once voltage target is reached, frequency is throttled Once voltage target is reached, frequency is throttled http://www.intel.com/technology/itj/2003/volume07iss ue02/art03_pentiumm/p10_speedstep.htm

Clock partitioning and recovery During transition, only the core clock and phase- locked-loop are stopped During transition, only the core clock and phase- locked-loop are stopped This keeps logic active even while the clock is stopped This keeps logic active even while the clock is stopped http://www.intel.com/technology/itj/2003/volume07iss ue02/art03_pentiumm/p10_speedstep.htm

Event blocking To prevent loss of events during frequency and voltage scaling when the core clock is stopped, interrupts, pin events, and snoop requests are sampled and saved To prevent loss of events during frequency and voltage scaling when the core clock is stopped, interrupts, pin events, and snoop requests are sampled and saved These events are retransmitted once the core clock becomes available These events are retransmitted once the core clock becomes available http://www.intel.com/technology/itj/2003/volume07iss ue02/art03_pentiumm/p10_speedstep.htm

Leakage Transistors in off state still draw current Transistors in off state still draw current As transistors shrink and clock speed increases, transistors leak more current causing higher temperatures and more power use As transistors shrink and clock speed increases, transistors leak more current causing higher temperatures and more power use

Strained Silicon http://www.research.ibm.com/resources/press/strainedsilicon/

Benefits of Strained Silicon Electrons flow up to 70% faster due to reduced resistance Electrons flow up to 70% faster due to reduced resistance This leads to chips which are up to 35% faster, without decrease in chip size This leads to chips which are up to 35% faster, without decrease in chip size Intel’s "uni-axial" strained silicon process reduces leakage by at least five times without reducing performance – the 65nm process will realize another reduction of at least four times Intel’s "uni-axial" strained silicon process reduces leakage by at least five times without reducing performance – the 65nm process will realize another reduction of at least four times

High-K Transistor Gate Dielectric (coming soon) The dielectric used since the 1960s, silicon dioxide, is so thin now that leakage is a significant problem The dielectric used since the 1960s, silicon dioxide, is so thin now that leakage is a significant problem A high-k (high dielectric constant) material has been developed by Intel to replace silicon dioxide A high-k (high dielectric constant) material has been developed by Intel to replace silicon dioxide This high-k material reduces leakage by a factor of 100 below silicon dioxide This high-k material reduces leakage by a factor of 100 below silicon dioxide

More Advances to Expect Continued lowering of capacitance has helped reduce power consumption Continued lowering of capacitance has helped reduce power consumption Tri-gate transistors decreases leakage by increasing the amount of surface area for electrons to flow through Tri-gate transistors decreases leakage by increasing the amount of surface area for electrons to flow through

x86 System Registers EFLAGS EFLAGS Various system flags Various system flags CPUID CPUID Exposes type and available features of processor Exposes type and available features of processor Model Specific Registers (MSRs) Model Specific Registers (MSRs) rdmsr and wrmsr rdmsr and wrmsr Examples Examples Enabling/Disabling SpeedStep Enabling/Disabling SpeedStep Determining and changing voltage/frequency points Determining and changing voltage/frequency points More More

Performance Testing P4 2.2GHz vs. PM 1.6GHz P4 2.2GHz vs. PM 1.6GHz

Benchmark

Battery Life

Pentium M vs AMD Turion

Gaming

Battery Life

Future Processors Yonah Yonah Dual-core processor Dual-core processor Manufactured on a 65 nm process Manufactured on a 65 nm process Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-pumped) Starting at 2.16GHz with a 667 MHz FSB (166MHz quad-pumped) Shared 2MB L2 cache Shared 2MB L2 cache Increased floating point performance with SSE3 instructions Increased floating point performance with SSE3 instructions Merom Merom Based on EM64T ISA Based on EM64T ISA Consume ~0.5 W of power, half of what the Dothan consumes Consume ~0.5 W of power, half of what the Dothan consumes Possibility of laptops with 10 hours of battery life Possibility of laptops with 10 hours of battery life

Intel Pentium M. Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch.

Similar presentations

Presentation on theme: "Intel Pentium M. Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intel Pentium M. Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch.

Similar presentations

Presentation on theme: "Intel Pentium M. Outline History History P6 Pipeline in detail P6 Pipeline in detail New features New features Improved Branch Prediction Improved Branch."— Presentation transcript:

Similar presentations

About project

Feedback