Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.

Similar presentations


Presentation on theme: "1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors."— Presentation transcript:

1 1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors

2 2/26 OUTLINES Abstract Introduction Rotary Pipeline Concept Implementation Issues Simulation Relation to other approaches Conclusions

3 3/26 ABSTRACT ABSTRACT Rotary Pipeline Processors is a new architecture for superscalar computing Rotary Pipeline Processors is a new architecture for superscalar computing Registers flow around the pipeline Registers flow around the pipeline Performance is only limited by data rates Performance is only limited by data rates Operation flows by the intervals of self-time clock Operation flows by the intervals of self-time clock

4 4/26 INTRODUCTION INTRODUCTION Most current designs uses parallel pipeline to implement multiple instructions... Most current designs uses parallel pipeline to implement multiple instructions... Synchronization problems decreasing performance in pipelines Synchronization problems decreasing performance in pipelines In Rotary Pipeline Instructions dispatched to ALUs from the center of the pipeline. Data circulates in clockwise manner and processed by ALUs and Memory Accesses In Rotary Pipeline Instructions dispatched to ALUs from the center of the pipeline. Data circulates in clockwise manner and processed by ALUs and Memory Accesses

5 5/26 ROTARY PİPELİNE CONCEPT Ovewiew : - A rotary pipeline rotates the registers to processors around the ring. When registers comes to an functio unit to be processed it is used and result is reloaded - Unused registers are not locked and continious to rotate - ALU Operations occure in parallel

6 6/26 ROTARY PİPELİNE CONCEPT (Cont’d) Basic Pipeline Constructions : A set of flip- flops are used to select which registers will be used and which will be left to cont.

7 7/26 ROTARY PİPELİNE CONCEPT (Cont’d) Adding A register File : If the rotary pipeline is large and there are many Register Files then Multiported register File will be used to store waiting register files Figure 3

8 8/26 Rotary Bus Allocation : Register files are dispatched to busses on the basis of first come first serve principle. If Ins. are independed then they continious to travel. when it is used only one unit then # of busses will increase (Figure 4 ) ROTARY PİPELİNE CONCEPT (Cont’d)

9 9/26

10 10/26 Instruction Issue : -Sequential Instructions are sent in the same directions so overlapping and register dependencies are resolved - If an ıns. is not processed by a function unit simply NOP issued resulting decrease in performance - Dynamic Instruction reordering - Assume Load command followed by Add operation and first unit is ALU... - Only %3 performance is gained - Mispredicted Branch result decreasing in performans ROTARY PİPELİNE CONCEPT (Cont’d)

11 11/26 By the data driven nature of rotary pipeline Ins. Ordering is not so important. Completion of the instructions are out of order. Figure 4... By the data driven nature of rotary pipeline Ins. Ordering is not so important. Completion of the instructions are out of order. Figure 4... ROTARY PİPELİNE CONCEPT (Cont’d)

12 12/26 ROTARY PİPELİNE CONCEPT (Cont’d)

13 13/26 ROTARY PİPELİNE CONCEPT (Cont’d) CONDITIONAL EXECUTION : CONDITIONAL EXECUTION : Conditional execution of arithmetic and logical instruction may be handled by using an extra control logic at each ALU. This controls the writing of the results to the rotary pipeline by controlling the output switch network. Conditional execution of arithmetic and logical instruction may be handled by using an extra control logic at each ALU. This controls the writing of the results to the rotary pipeline by controlling the output switch network.

14 14/26 BRANCHES: BRANCHES: Branches have always adverse effect on the performans of the pipelines. Unconditional branches are easy to handle and predicted before the operation begins but conditional branches are dependent upon the outcome of execution stage and difficult to handle. This can be solved by the speculation execution technique. Branches have always adverse effect on the performans of the pipelines. Unconditional branches are easy to handle and predicted before the operation begins but conditional branches are dependent upon the outcome of execution stage and difficult to handle. This can be solved by the speculation execution technique. ROTARY PİPELİNE CONCEPT (Cont’d)

15 15/26 ROTARY PİPELİNE CONCEPT (Cont’d) SPECULATIVE EXECUTION: SPECULATIVE EXECUTION: - If an execution is marked as speculative - If an execution is marked as speculative it could be revoked. it could be revoked. - If the register file is used… (results not written to reg.) - If the register file is used… (results not written to reg.) - If a larger register file is used… ( Temp. Reg. Files ) - If a larger register file is used… ( Temp. Reg. Files ) - If a larger rotary pipeline is used…( Flip flops ) - If a larger rotary pipeline is used…( Flip flops )

16 16/26 IMPLEMENTATION Data encoding and completion detection: Data encoding and completion detection: -Determining of completion of evaluation for a logic -Determining of completion of evaluation for a logic block; block; 1. Embedding the completion signal within the data 1. Embedding the completion signal within the data 2. Localised timing using matched delays 2. Localised timing using matched delays

17 17/26 IMPLEMENTATION (Cont’d) Embedding the completion signal within the data is done by using 1 of 4 encoding technique. Here a completion signal is embedded within the data and as seen in Figure 5 a coding sheme is used. But in bundled data binary encoding is used Embedding the completion signal within the data is done by using 1 of 4 encoding technique. Here a completion signal is embedded within the data and as seen in Figure 5 a coding sheme is used. But in bundled data binary encoding is used Matched delays method subjected to change according to thermal effects and manufecturer tolerance Matched delays method subjected to change according to thermal effects and manufecturer tolerance Figure 5

18 18/26 IMPLEMENTATION (Cont’d) Using Dynamic Logic : - Dynamic logic and inverted 1 of 4 encoded data dovetail nicely because precharging the logic depends upon the clearing 1 of 4 encoding function before evaluation. - Dynamic logic and inverted 1 of 4 encoded data dovetail nicely because precharging the logic depends upon the clearing 1 of 4 encoding function before evaluation. - Completion detection process can be simplified by using AND gates instead of C elements in the circuit. - Completion detection process can be simplified by using AND gates instead of C elements in the circuit. Figure 6

19 19/26 IMPLEMENTATION (Cont’d) Outline Of a Stage in the Pipeline: Outline Of a Stage in the Pipeline: A banks of transistors are used to download/upload data to registers A banks of transistors are used to download/upload data to registers Figure 7

20 20/26 IMPLEMENTATION (Cont’d) Controlling The Pipeline : Each Stage of the pipeline passes through the following stages: Each Stage of the pipeline passes through the following stages: - Empty : ALU is prechared and flip-flops are reset - Empty : ALU is prechared and flip-flops are reset - Waiting for data : Precharge and reset are released - Waiting for data : Precharge and reset are released - Latching data : SR flip flops store the results - Latching data : SR flip flops store the results - Precharge : After latching data ALU precharge commence - Precharge : After latching data ALU precharge commence - Reset : Once the next stage issues completion, the latches of this stage may be reset - Reset : Once the next stage issues completion, the latches of this stage may be reset - Empty : Completing cycle - Empty : Completing cycle

21 21/26 IMPLEMENTATION (Cont’d) Figure 8

22 22/26 SIMULATION SIMULATION Instruction Set Choice : Instruction Set Choice : ARM instructions are used for the convenience of comparison with existing clock. ARM instructions are used for the convenience of comparison with existing clock. Characteristics of the Ins. ; Characteristics of the Ins. ; 1. conditionals: Every instruction can be conditionally executed 2. PC : The program counter is one of the general purpose registers and may be written to, thereby causing a branch; 3. Load and store multiple instructions in one register

23 23/26 SIMULATION (Cont’d) SIMULATION (Cont’d) Initial Results : Initial Results : ARM Instruction sets and only store and compress benchmarks are used to test performance ARM Instruction sets and only store and compress benchmarks are used to test performance - Firstly ALU, Memory Access and Branch - Firstly ALU, Memory Access and Branch units taken units taken - A number of ALU units added.. - A number of ALU units added.. - Dynamic Instruction reordering increased the - Dynamic Instruction reordering increased the performance by %3 performance by %3 - Branch prediction and using larger memory register file - Branch prediction and using larger memory register file increased the performance (Figure 9) increased the performance (Figure 9) - But soon memory accesses will limit the performance - But soon memory accesses will limit the performance

24 24/26 Figure 9

25 25/26 RELATION TO OTHER APPROACHES Data transfer capability within the stages Data transfer capability within the stages In Rp, Data is passed throuh latches between pipeline stages. Rotary pipeline is beter than clock applications where data is only available after clock periods In Rp, Data is passed throuh latches between pipeline stages. Rotary pipeline is beter than clock applications where data is only available after clock periods Amulet is a single processor which data is transparent at latches in situations of pipeline refillings Amulet is a single processor which data is transparent at latches in situations of pipeline refillings CFPP, as data traversed along the pipeline register values filter down and at the end of the cycle, operands gathered at the very beginning of the pipeline CFPP, as data traversed along the pipeline register values filter down and at the end of the cycle, operands gathered at the very beginning of the pipeline RP differs from other superscaler processors by avoiding global Comm. RP differs from other superscaler processors by avoiding global Comm.

26 26/26 CONCLUSIONS CONCLUSIONS Rotary Pipelines are self timed structures which allows multiple instructions to be implemented at the same time Rotary Pipelines are self timed structures which allows multiple instructions to be implemented at the same time Variations: Variations: 1. Passing complete registers.. 1. Passing complete registers.. 2. Passing only active registers… 2. Passing only active registers… In Rotary Pipelines, structure emphisized on performance rather than size and low power. In Rotary Pipelines, structure emphisized on performance rather than size and low power. RPs have fewer busses comp. to other superscaler processors RPs have fewer busses comp. to other superscaler processors Suitable for self time circuits but not clocked implementations Suitable for self time circuits but not clocked implementations

27 27/26 Questions?... Questions?...


Download ppt "1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors."

Similar presentations


Ads by Google