Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Similar presentations


Presentation on theme: "1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,"— Presentation transcript:

1 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA

2 AS-ET-OR Caps Team Irisa 2 Why designing wide issue superscalar processors SMT Superscalar Processors !

3 AS-ET-OR Caps Team Irisa 3 Doubling the issue width  Functional Units  Silicon area: 2x  Power consumption: 2x  Same latency  Register file:  Silicon area: > 8x  Power consumption: > 4x  access time: 1.5x  Wake-up logic entries:  monitors twice as many inputs  area, consumption, response time  Bypass network:  wider multiplexors >2x  longer communications

4 AS-ET-OR Caps Team Irisa 4 An unwritten rule applied on all superscalar processor designs  For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit

5 AS-ET-OR Caps Team Irisa 5 The register file issue

6 AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file

7 AS-ET-OR Caps Team Irisa 7 Conventional clustered design C1C0C2C3 Register File

8 AS-ET-OR Caps Team Irisa 8 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area

9 AS-ET-OR Caps Team Irisa 9 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 8) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz

10 AS-ET-OR Caps Team Irisa 10 Let us reduce the number of ports on each individual register

11 AS-ET-OR Caps Team Irisa 11 Register Write Specialization C1C0C2C3 S0 S1 S2 S3

12 AS-ET-OR Caps Team Irisa 12 Distributed Register File and Register Write Specialization C0C1C3C2

13 AS-ET-OR Caps Team Irisa 13 Register Write Specialization  Each cluster writes only a subset of the registers  Less write ports on every individual physical register  But allocation to clusters must precede register renaming  4-cluster 8-way distributed register file 512 entries  320 x w2 per register bit  3 cycles access time  8.5 W

14 AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, L3 -> RES3 4:Op RES3,RES2 -> RES4 New map table

15 AS-ET-OR Caps Team Irisa 15 Register Write Specialization and Register Renaming (2)  Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list

16 AS-ET-OR Caps Team Irisa 16 Register Write Specialization and Register Renaming (3)  An alternative:  Compute the number of registers in each register subset  Pick the right number of registers from each of the free lists  No need for recycling registers Think about round-robin distribution !

17 AS-ET-OR Caps Team Irisa 17 Performance issues  Register Write Specialization only:  round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performances

18 AS-ET-OR Caps Team Irisa 18 Register Read Specialization C1C0C2C3 S0 S1

19 AS-ET-OR Caps Team Irisa 19 Register Read Specialization  Limits number of read ports on each individual register  Puts strong constraints on allocation of instructions to clusters  Caution:  Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable

20 AS-ET-OR Caps Team Irisa 20 WSRS architectures Combining Register Read Specialization and Register Write Specialization

21 AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster

22 AS-ET-OR Caps Team Irisa 22 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster

23 AS-ET-OR Caps Team Irisa 23 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)

24 AS-ET-OR Caps Team Irisa 24 Each individual physical register: 4 identical copies of (2-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)

25 AS-ET-OR Caps Team Irisa 25 4-cluster 8-way WSRS architecture : the wake-up logic  The wake-up logic monitors all possible sources for each operand  FUs from only two clusters are possible sources  only 6 possible sources ! 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity

26 AS-ET-OR Caps Team Irisa 26 4-cluster 8-way WSRS architecture : bypass network  Possible sources for each operand  FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.

27 AS-ET-OR Caps Team Irisa 27 Local fast-forwarding inside a single cluster 2 out of 4 consumers are reached on the next cycle Partial fast-forwarding inside a pair of adjacent clusters: 3 out of 4 consumers are reached on the next cycle ! Complete fast-forwarding: consumer is close: may be possible to implement! 4-cluster WSRS architecture : fast-forwarding

28 AS-ET-OR Caps Team Irisa 28 4-cluster WSRS architecture: Nothing is entirely free !  Strong constraint on allocation of instructions to clusters:  The cluster executing a dyadic instruction depends on the position of its operands in the register subsets.  Degrees of freedom:  Monadic instructions can be executed on two clusters  One out of two commutative dyadic instructions can be executed on two clusters  Design clusters able to execute instructions in two forms ? A-B and -B + A

29 AS-ET-OR Caps Team Irisa 29 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1

30 AS-ET-OR Caps Team Irisa 30 Commutativity for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 op S2

31 AS-ET-OR Caps Team Irisa 31 4-cluster WSRS architecture : nothing comes from free (2)  Extra free lists and associated logic  Extra pipeline stage(s):  Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles  But shorter register access time : - 2 cycles

32 AS-ET-OR Caps Team Irisa 32 Performance issues on 4-way WSRS architectures  Workload may be unbalanced among the clusters:  Use of the degrees of freedom monadic instructions « commutative » clusters  Higher probability of local consumption of a register Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture

33 AS-ET-OR Caps Team Irisa 33 Summary  Register Write Specialization  limiting the number of write ports on each physical register  leads to naturally use distributed register file  mastering power consumption, silicon area and access time  But Some extra complexity in register renaming

34 AS-ET-OR Caps Team Irisa 34 Summary (2)  Register Write Specialization + Register Read Specialization  Further limits the number of ports on each physical register  mastering power consumption, silicon area and access time  side effects: mastering wake-up logic and bypass network complexity  But  constraints instruction allocation to clusters

35 AS-ET-OR Caps Team Irisa 35 Future works  Intelligent instruction allocation policies  Exploration of other possible interconnections  Use of heterogeneous clusters  SMT mode


Download ppt "1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,"

Similar presentations


Ads by Google