Presentation is loading. Please wait.

Presentation is loading. Please wait.

2015-05-26 Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Similar presentations


Presentation on theme: "2015-05-26 Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung."— Presentation transcript:

1 2015-05-26 Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung

2 For Single Thread Performance in Mobile – Out-of-Order Superscalar processors Consume much more energy than In-Order Processor – Dynamic Instruction Scheduling Issue Queue Reorder Buffer Load/Store Queue Propose Front-end Execution Architecture (FXA). – In-Order Execution Unit (IXU) – Out-of-Order Execution Unit (OXU) – The IXU and the OXU are placed in series. 2

3 3

4 4

5 The IXU functions as a filter for the OXU. In-Order Execution Unit (IXU) – Check whether instructions are ready. Read From the Physical Register File (PRF). Bypassed from the Functional Units(FU) in the IXU. – Depending on Whether an Instruction is ready, the instruction is processed as follows A ready instruction is executed and is not dispatched to the IQ (Issue Queue). A not-ready instruction goes through the IXU as a NOP. The instruction is dispatched to the IQ. (No Stall) – The instruction is committed as in conventional superscalar processor. (Reorder Buffer) Out-of-Order Execution Unit (OXU) – Same way as it is executed in conventional superscalar processor. 5

6 6

7 7

8 8

9 9

10 10

11 11

12 IXU Cannot Execute I3 – Because of a long and consecutive chain of dependent instructions. Generally, dependent instructions are rarely placed in a long and consecutive chain. -> IXU can execute many instructions. 12

13 Branch – The IXU can execute branch instructions with handling misprediction. Floating Point – The IXU cannot execute FP operations. – Long latency -> the pipeline length is prolonged. Load/Store – Use Load Store Queue (LSQ) 13

14 Bypassing between IXU and OXU – IXU -> OXU is not necessary. Order – OXU -> IXU is omitted. Performance degradation is not significant 14

15 Optimization of IXU – The latency of bypass network is increased because of FUs. Decrease the number of FUs in backward stages. [3, 1, 1] – Partially omit operand-bypassing in IXU. Bypassing between FUs that are more distant than two stages 15 FU

16 Instructions Executed in IXU – Instructions that are already ready when they are entered to the IXU Very small (5.5%) – Instructions that become newly ready in the IXU – 35% (1 Stage) to 54% (3 Stage, FU[3, 1, 1]) Performance Improvement – Effects of FUs in IXU 4 stage (Conventional Superscalar Processor) to 7 stage (FXA) FU : 4(4 issue OoO Superscalar) to 7 (5 in IXU, 2 in OXU) – Variable Branch Misprediction Penalty IXU / OXU 16

17 The number of FUs is increased. – IXU and OXU – Static energy consumption : increased. – Dynamic energy consumption : increased or equal. PRF – IXU/OXU access PRF simultaneously. The number of Issue Queue Access is decreased. – Because of IXU – Reduce 86% of energy consumption. 17

18 Evaluate IPCs using an in-house cycle-accurate processor simulator. Run SPEC CPU 2006. – Compiled using gcc 4.5.3 with –O3 evaluated energy consumption and chip areas using the McPAT simulator (Parameter : Table 2) 18

19 BIG – Out-of-Order superscalar (ARM Cortex-A57 big Core) – baseline HALF – Issue width and IQ capacity are half those in BIG LITTLE – In-Order processor (ARM Cortex-A53 LITTE Core) HALF+FX – HALF with IXU (3 Stage, FU [3, 1, 1]) BIG+FX – BIG with IXU (3 Stage, FU [3, 1, 1]) 19

20 20

21 21 Maximum : 67%, geometric mean : 5.7%

22 22 Geometric mean : 7.4%

23 23 Geometric mean : 4.5%

24 24

25 25

26 26

27 27

28 Proposed FXA, which has two execution units, the IXU and OXU. 5.7% higher performance 17% lower energy consumption 25% higher performance/energy ratio 28

29 29


Download ppt "2015-05-26 Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung."

Similar presentations


Ads by Google