Presentation is loading. Please wait.

Presentation is loading. Please wait.

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.

Similar presentations


Presentation on theme: "Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science."— Presentation transcript:

1 Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science State University of New York Binghamton, NY {gurhan, ghose, Peter M. Kogge Dept. of Computer Science and Engineering University of Notre Dame Notre Dame, IN International Symposium on Low Power Electronics and Design (ISLPED’01) * supported in part by DARPA through the PAC-C program and NSF

2 Motivation/Goals Current Trends in Microarchitecture: Aggressive out-of-order execution, use of register renaming, multiple FUs, sizable on-chip caches, large register files, ROB etc. Impact on Energy/Power Dissipation: Absolute power dissipation of processor is high Areal energy/power density of high-end superscalar processors is becoming an immediate, serious concern - will soon become comparable to that of nuclear reactors Consequences: Intermittent and permanent failures on the die and serious challenges for the cooling facilities/packaging Goal: Limit energy dissipation through technology independent techniques with no impact on performance

3 Typical Superscalar Datapath

4 The Dispatch Buffer Instruction Dispatch Buffer (a.k.a. Issue Queue) is one of the major source of power dissipation in modern superscalar processors: up to 22% of total chip power Major components of power dissipation in Dispatch Buffer are: 1.Dispatch (Entry Setup = Locating free entries + writing to them) 2.Issue (FU arbitration + Reading selected instr. From the DB) 3.Forwarding (Tag comparison + latching) To function units (issue) From Decode/Dispatch Stage From function units (forwarding) SPECint 95SPECfp 95 I 50.1 % I 53.8 % F 28.9 % D 24.2 % F 25.7 % D 17.3 %

5 Main Results 60% plus energy savings within the DB achieved using three relatively independent techniques:  Replacing traditional comparators with dissipate-on-match comparator  Not reading or writing leading zero bytes  Using bit-line segmentation to reduce bit line capacitance and dissipations during reads and writes No impact on cycle time Only 12% increase in layout area of DB – for 4 metal layer, 0.5 micron layout: smaller increase with additional metal layers

6 Low Power Comparator Traditional comparators dissipate power on mismatches Only 5% of total comparisons matches This is a major source for power dissipation in Dispatch Buffer Number of bits matching  % of total cases 2 LSBs 4 LSBs 6 LSBs All 8 bits Avg. SPECint Avg. SPECfp Avg. all SPEC LSB = least significant bits Dispatch Buffer Comparator Statistics

7 Low Power Comparator Traditional comparators dissipate power on mismatches Only 5% of total comparisons matches This is a major source for power dissipation in Dispatch Buffer DB X X X Forwarding Bus X Matching slotWaiting slotInactive slot Number of bits matching  % of total cases 2 LSBs 4 LSBs 6 LSBs All 8 bits Avg. SPECint Avg. SPECfp Avg. all SPEC LSB = least significant bits Dispatch Buffer Comparator Statistics 8 bit phys. reg. number

8 Low Power Comparator Idea: Design of a new comparator that dissipates power only on matches Number of bits matching  % of total cases 2 LSBs 4 LSBs 6 LSBs All 8 bits Avg. SPECint Avg. SPECfp Avg. all SPEC LSB = least significant bits New dissipate-on-match comparator: Domino logic with pass-transistor at the front end Dispatch Buffer Comparator Statistics

9 Zero-Byte Encoding Observation: The simulated execution of the SPEC 95 benchmarks show that about half of the byte fields within operands are all zeros Reasons: Use of small integer literals (address offsets, literal operands, flags, byte ops, etc.) Consequence of byte packing and unpacking operations and usage of the bit or byte masks to isolate parts of the operands Some floating point operands may not use all of the bits allowed in the mantissa field Use of lower-precision data may not make use of full datapath width 32 and 64-bit Integer Operands (90%of all operands)

10 Zero-Byte Encoding Idea: Instead of driving byte with all-zeroes, encode it using the ZI (Zero Indicator) bit and only drive this bit, thus achieving power savings during writes. Associated Circuit Techniques: Readout Logic

11 Zero-Byte Encoding Stored ZI bit disables reading of associated byte- avoiding bitline discharge and sense-amp dissipation Associated Circuit Techniques: Readout Logic

12 Zero-Byte Encoding Associated Circuit Techniques: Encoding Logic for bytes of all zeroes

13 Bitline Segmentation The DB is essentially a Register File with additional associative logic for data forwarding. To function units (issue) From Decode/Dispatch Stage WRITE PORTS READ PORTS For each instruction dispatched in a cycle a write port is needed for entry setup process For each instruction issued in a cycle a read port is needed to move the instruction from DB to FU. The bitlines associated with each read and write port present a high capacitive load, which consists of a component that varies linearly with the number of rows in the DB. This component is due to the wire capacitances of the bitlines and the diffusion capacitance of the pass transistors that connect the bitcells to the bitlines

14 Bitline Segmentation Idea: The DB is reconstructed into segments. Capacitive loading on each segment is lowered: Each segment is connected to only 16 pass devices Wire capacitance is lowered: Wire length of the bitline segment is one fourth of the original bitline Bitline segmented DB

15 Evaluation Methodology Used a true cycle-by-cycle register-level simulator for a typical superscalar pipeline. Simplescalar has been substantially modified for this purpose to mimic real superscalar datapaths Simulated the execution of SPEC 95 benchmarks. Collected transition counts for each major datapath component Used SPICE measurements for the VLSI layout of dispatch buffer and reorder buffer in a 0.5 micron, 4- metal layer process to estimate the power dissipated for each type of transition within each major component (migrating to 0.18 micron soon!)

16 Evaluation Methodology Datapath Power Estimator

17 Results Traditional vs. New Comparator Power dissipation within the DB during forwarding 44% 49% 45% mW Power Dissipation

18 Results Traditional vs. New Comparator Total power dissipation within the DB 11% 14% 12% mW Power Dissipation

19 Results Zero-Byte Encoding and Bitline Segmentation Power dissipation within the DB during instruction dispatch mW Power Dissipation 54%, 26%, 61% 53%, 21%, 59% 53%, 23%, 60%

20 Results Zero-Byte Encoding and Bitline Segmentation Power dissipation within the DB during instruction issue mW Power Dissipation 41%, 32%, 58% 41%, 35%, 60% 40%, 41%, 62%

21 Results Zero-Byte Encoding, Bitline Segmentation and New Comparators Total power dissipation within the DB mW Power Dissipation 31%, 44%, 59% 32%, 46%, 60% 33%, 50%, 61%

22 Related Work Zero byte encoding of function unit results (Brooks and Martonosi, 1999) Zero-byte compression on buses, register files, DB and ROB in superscalar datapath (Ponomarev, Ghose, Kucuk, Kogge and Toomarian, 2000) Zero-byte compression for I-caches (Villa, Zhang, Asanovic, 2000) Zero-byte compression in simple scalar datapath (Canal, Gonzalez and Smith, 2000) Dynamic resizing of issue queue (Buyuktosunoglu, Albonesi, Schuster, Brooks, Bose and Cook, Folegnani and Gonzalez, 2001) Dynamic resizing of dispatch buffer and reorder buffer (Ponomarev, Kucuk and Ghose, 2001)

23 Conclusion We studied three relatively independent techniques to reduce the energy dissipation in the instruction dispatch buffers of modern superscalar processors: 1.New comparators that dissipate the energy mainly on the tag matches 2.Zero-Byte encoding to reduce the number of bitlines that have to be driven during instruction dispatch and issue as well as during forwarding of the results to the waiting instructions in the DB 3.Bitline segmentation to reduce the length of bitlines (to reduce wire and diffusion capacitances) 1.Total power reduction is about 60% The DB power reductions are achieved without compromising the cycle time and only through a modest growth in the area of the DB (about 12%) Our studies also show that the use of all the techniques that reduce the DB power can also be used to achieve reductions of a similar scale in other datapath artifacts that use associative addressing (such as the Reorder Buffer and LOAD/STORE Queue.)


Download ppt "Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science."

Similar presentations


Ads by Google