Presentation is loading. Please wait.

Presentation is loading. Please wait.

Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel.

Similar presentations


Presentation on theme: "Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel."— Presentation transcript:

1 Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel

2 USC Asynchronous Group2 Motivation Can we reduce asynchronous pipelines communication overhead while hiding precharge time?Can we reduce asynchronous pipelines communication overhead while hiding precharge time? Can we have cycle time in asynchronous pipelines as fast, if not faster, than best synchronous counterparts.Can we have cycle time in asynchronous pipelines as fast, if not faster, than best synchronous counterparts.

3 USC Asynchronous Group3 Motivation: System Performance Fixed stage pipelineFixed stage pipeline –Low pipeline usage: Low latency is critical –High pipeline usage: Cycle time is the limiting factor to generate new outputs as fast as possible Flexible stage pipelineFlexible stage pipeline –With zero forward overhead and short cycle time, we can achieve a given desired throughput with fewer stages

4 USC Asynchronous Group4 Motivation: System Performance Pipelines with loop dependenciesPipelines with loop dependencies –Optimal cycle time is the sum of latency around the loop –Pipelining is required to ensure precharge/reset is not in the critical path –Our scheme requires less pipeline stages to achieve same performance

5 USC Asynchronous Group5 Introduction Asynchronous pipeline schemes using Taken Detector (TD)Asynchronous pipeline schemes using Taken Detector (TD) Best use in coarse-grained pipelinesBest use in coarse-grained pipelines Two schemes targeting different requirements (a possible third SI scheme as well)Two schemes targeting different requirements (a possible third SI scheme as well)

6 USC Asynchronous Group6 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionConclusion

7 USC Asynchronous Group7 Definition Stage: A collection of logic that is precharged or evaluated at the same timeStage: A collection of logic that is precharged or evaluated at the same time Cycle: The time it takes for a stage to start next evaluation from the current oneCycle: The time it takes for a stage to start next evaluation from the current one Forward Latency: The time it takes between the start of the evaluation of current stage to next stageForward Latency: The time it takes between the start of the evaluation of current stage to next stage

8 USC Asynchronous Group8 Background Outline Sutherland’s Micropipeline schemeSutherland’s Micropipeline scheme Ted William’s PS0 and PC0 pipeline schemesTed William’s PS0 and PC0 pipeline schemes Renaudin’s DCVSL pipeline schemeRenaudin’s DCVSL pipeline scheme Martin’s deep pipeline schemeMartin’s deep pipeline scheme

9 USC Asynchronous Group9 Sutherland’s Micropipeline Father of Asynchronous Pipeline. Presented in Turing Award lectureFather of Asynchronous Pipeline. Presented in Turing Award lecture Delay InsensitiveDelay Insensitive C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC C Cd Pd P REGREG C Cd Pd P REGREG LOGICLOGIC c c c R(in) A(in) D(in) A(out) R(out) D(out)

10 USC Asynchronous Group10 William’s PC0 Speed IndependentSpeed Independent Cycle Time (P) = 3tF  +1tF  +4tC+4tDCycle Time (P) = 3tF  +1tF  +4tC+4tD Forward Latency (L f ) = 1tF  +1tD+1tCForward Latency (L f ) = 1tF  +1tD+1tC Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F3 D1 C1C2C3 D2 D3 D(in) R(in) A(in) A(out) R(out) Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F1 Precharged Function Block F3 Precharged Function Block F2 D(out)

11 USC Asynchronous Group11 PC0 Timing Diagram The cycle time is shown in read arrows while the blue arrows show the precharge phaseThe cycle time is shown in read arrows while the blue arrows show the precharge phase

12 USC Asynchronous Group12 Dependency Graph C2  F2  C3  F3  C4  F4  D2  C1  F1  C2  F2  C3  F3  D1  D2  D3  CC FF DD CC FF DD 0 0 00 +1 Folded Dependency Graph Flat Dependency Graph

13 USC Asynchronous Group13 William’s PC1 Cycle Time (P) = 2tF  +4tC+4tDCycle Time (P) = 2tF  +4tC+4tD Forward Latency (L f ) = 1tF  +2tC+1tDForward Latency (L f ) = 1tF  +2tC+1tD Precharged Function Block F1 Precharged Function Block F2 DA C1C2 DB D2 D(in) R(in) A(in) A(out) R(out) D(out) C Latch

14 USC Asynchronous Group14 William’s PS0 Not Speed IndependentNot Speed Independent Cycle Time (P) = 3tF  +1tF  +2tDCycle Time (P) = 3tF  +1tF  +2tD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF  Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 D2 D3 D(in) A(in) A(out) D(out)

15 USC Asynchronous Group15 PS0 Timing Diagram

16 USC Asynchronous Group16 PS0 Timing Assumption The pipeline has to meet the following timing assoumptionThe pipeline has to meet the following timing assoumption tF 

17 USC Asynchronous Group17 Renaudin’s DCVSL Pipeline Compare to Ted’s PC0 onlyCompare to Ted’s PC0 only Use DCVSL exclusivelyUse DCVSL exclusively Introduce Latched DCVSLIntroduce Latched DCVSL Improve cycle time but not forward latencyImprove cycle time but not forward latency Cycle Time (P) = 1tF  +1tF  + 4tC +2tDCycle Time (P) = 1tF  +1tF  + 4tC +2tD Forward Latency (L f ) = 1tF  + 1tC +1tDForward Latency (L f ) = 1tF  + 1tC +1tD

18 USC Asynchronous Group18 DCVS Logic Family DCVS Logic Latched DCVS Logic

19 USC Asynchronous Group19 More on DCVSL AdvantageAdvantage –Fast, based on the dynamic domino type logic –Build-in Four-Phase handshaking –Robust completion sensing –Storage element DisadvantageDisadvantage –Higher Complexity - increase in number of transistors and area –Higher Power dissipation

20 USC Asynchronous Group20 DCVS Pipeline Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 C1C2C3 D2 D3 D(in) R(in) A(in) A(out) R(out) D(out) Cycle Time (P) = 1tF  +1tF  +4tC +2tDCycle Time (P) = 1tF  +1tF  +4tC +2tD (2tF  +4tC +2tD ) (2tF  +4tC +2tD ) Forward Latency (L f ) = 1tF  +1tC +1tDForward Latency (L f ) = 1tF  +1tC +1tD

21 USC Asynchronous Group21 DCVS Pipeline Timing Diagram

22 USC Asynchronous Group22 DCVS Dependency Graph CC FF DD CC FF DD 0 0 00 +1 Folded Dependency Graph Cycle Time (P) = 1tF  +1tF  +4tC +2tDCycle Time (P) = 1tF  +1tF  +4tC +2tD Forward Latency (L f ) = 1tF  +1tC +1tDForward Latency (L f ) = 1tF  +1tC +1tD

23 USC Asynchronous Group23 Martin’s Pipeline Schemes Deep pipeliningDeep pipelining Quasi Delay-Insensitive (QDI)  No timing assumptionQuasi Delay-Insensitive (QDI)  No timing assumption Based on different handshaking reshufflingBased on different handshaking reshuffling Best scheme has high concurrency which reduce control overheadBest scheme has high concurrency which reduce control overhead Control logic is more complexControl logic is more complex

24 USC Asynchronous Group24 Basic Asynchronous Handshaking 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 Reshuffling eliminates the explicit variable xReshuffling eliminates the explicit variable x Large control overheadLarge control overhead L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe

25 USC Asynchronous Group25 Handshaking Reshuffling Still wait for predecessor to reset before resetting itself  larger overhead for more inputsStill wait for predecessor to reset before resetting itself  larger overhead for more inputs 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe

26 USC Asynchronous Group26 Precharge-Logic Half-Buffer Doesn’t wait for the predecessor to reset before it resets its outputs. Yet, the control logic wait for the reset of the predecessor only after current stage has resetDoesn’t wait for the predecessor to reset before it resets its outputs. Yet, the control logic wait for the reset of the predecessor only after current stage has reset 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe

27 USC Asynchronous Group27 Precharge-Logic Full-Buffer Allows the neutrality test of the output data to overlap with raising the left enablesAllows the neutrality test of the output data to overlap with raising the left enables Complex control logic, requires extra state variableComplex control logic, requires extra state variable 2 L0L0 L1L1 LeLe ReRe R0R0 R1R1 1 L0L0 L1L1 LeLe ReRe R0R0 R1R1 3 L0L0 L1L1 LeLe ReRe R0R0 R1R1 L1L1 LeLe LeLe L1L1 R1R1 R1R1 ReRe ReRe en  en 

28 USC Asynchronous Group28 Martin’s PCHB Full-adder

29 USC Asynchronous Group29 Martin’s Pipeline in General The Cycle time is limited by the properties of QDIThe Cycle time is limited by the properties of QDI –Next stage has to finish precharge before the current stage can evaluate next input Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 D1 D2 D3 D(in) D(out) Control LeLe LeLe ReRe

30 USC Asynchronous Group30 Performance Analysis on PCFB Control logic can be seen as completion detection (D) plus C-element (C)Control logic can be seen as completion detection (D) plus C-element (C) Reshuffling of handshaking just changes the degree of the concurrency but it doesn’t affect the best case performance analysisReshuffling of handshaking just changes the degree of the concurrency but it doesn’t affect the best case performance analysis Cycle Time (P) = 3tF  +1tF  +2tC +2tDCycle Time (P) = 3tF  +1tF  +2tC +2tD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF 

31 USC Asynchronous Group31 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionConclusion

32 USC Asynchronous Group32 Taken Pipeline Use of Taken DetectorUse of Taken Detector Two schemes to satisfy different requirementsTwo schemes to satisfy different requirements Both are not speed independentBoth are not speed independent

33 USC Asynchronous Group33 Initial Idea Precharge: only when next stage has taken the current resultPrecharge: only when next stage has taken the current result Evaluation: only when next stage has prechargedEvaluation: only when next stage has precharged Similar idea to Martin’s pipeline schemesSimilar idea to Martin’s pipeline schemes

34 USC Asynchronous Group34 Further Observation PrechargePrecharge –We can precharge the current stage as soon as the first level logic of next stage has evaluated  next stage has taken the result EvaluateEvaluate –Evaluation can be started as soon as the guarded N-transistor in the first level logic of next stage has turned off

35 USC Asynchronous Group35 Relax Precharge (RP) Constraint Current stage can precharge as soon as the first level logic of next stage has evaluated: Next stage has Taken the resultCurrent stage can precharge as soon as the first level logic of next stage has evaluated: Next stage has Taken the result Current stage can evaluate as soon as the first level logic of next stage has precharged, blocking the new result from passing throughCurrent stage can evaluate as soon as the first level logic of next stage has precharged, blocking the new result from passing through No need for extra control logic except TD which is similar to completion detectorNo need for extra control logic except TD which is similar to completion detector

36 USC Asynchronous Group36 RP Pipeline Scheme Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 TD1TD2TD3 D(in)D(out) Cycle Time (P) = 2tF  + 1tF1  +1tF1  +2tTDCycle Time (P) = 2tF  + 1tF1  +1tF1  +2tTD Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF 

37 USC Asynchronous Group37 RP Timing Diagram

38 USC Asynchronous Group38 RP Timing Assumption Easy to meet timing assumptionEasy to meet timing assumption

39 USC Asynchronous Group39 RP Timing Assumption Cont. tF1 i is the first level logic of stage itF1 i is the first level logic of stage i tF2 i is the logic after the first level of stage itF2 i is the logic after the first level of stage i Assuming rising and falling of TD is the sameAssuming rising and falling of TD is the same

40 USC Asynchronous Group40 Relax Evaluation (RE) Constraint Current stage can start the evaluation about the same time as the next stage turns off the guarded N-transistors in the first level logicCurrent stage can start the evaluation about the same time as the next stage turns off the guarded N-transistors in the first level logic Requires general C-element, yet improve cycle timeRequires general C-element, yet improve cycle time

41 USC Asynchronous Group41 RE Pipeline Scheme TD can be skewed for fast evaluation detectionTD can be skewed for fast evaluation detection Cycle Time (P) = 2tF  + 1tF1  +1tTD +1tCCycle Time (P) = 2tF  + 1tF1  +1tTD +1tC Forward Latency (L f ) = 1tF Forward Latency (L f ) = 1tF  Precharged Function Block F1 Precharged Function Block F2 Precharged Function Block F3 TD1TD2TD3 D(in)D(out) GC1 + ++

42 USC Asynchronous Group42 RE Timing Diagram

43 USC Asynchronous Group43 RE Timing Assumption 1 Precharge constraintPrecharge constraint

44 USC Asynchronous Group44 RE Timing Assumption 2 Evaluation constraint (Min Delay)Evaluation constraint (Min Delay)

45 USC Asynchronous Group45 Issue in Fine-Grained Pipelines In a fine-grained pipeline, such as Martin’s single gate pipeline, RE scheme may require buffering due to process variationIn a fine-grained pipeline, such as Martin’s single gate pipeline, RE scheme may require buffering due to process variation –Buffering is necessary because of second timing assumption, next gate (stage) may not have turned off N-stack before the result from current stage reaches it

46 USC Asynchronous Group46 Taken Detector (TD) Similar to Completion DetectorSimilar to Completion Detector Detect both evaluation and prechargeDetect both evaluation and precharge Inputs are the output of first level logic of each stageInputs are the output of first level logic of each stage

47 USC Asynchronous Group47 Datapath Merging & Splitting Datapath merging and splitting can be done similar to William’s styleDatapath merging and splitting can be done similar to William’s style Precharged Function Block F2a Precharged Function Block F3 TD2a TD3 D(out) Precharged Function Block F2b Precharged Function Block F1 TD1 TD2b C D(in)

48 USC Asynchronous Group48 Outline Background reviewBackground review –Sutherland –Ted William –Renaudin –Martin Taken pipelineTaken pipeline Performance comparisonPerformance comparison ConclusionsConclusions

49 USC Asynchronous Group49 Comparison of RE and Synchronous Skew Tolerant Assuming 4 stages pipeline, stage 1-4, and 4 phases clockingAssuming 4 stages pipeline, stage 1-4, and 4 phases clocking Synchronous:Synchronous: –Stage 1 starts next evaluation after stage 4 starts evaluation Asynchronous:Asynchronous: –Stage 1 starts next evaluation after we detect the completion of the first level logic of stage 3

50 USC Asynchronous Group50 Comparison Assumptions It is a balanced pipeline—all stages have equal evaluation timeIt is a balanced pipeline—all stages have equal evaluation time Precharge time is same as evaluation timePrecharge time is same as evaluation time

51 USC Asynchronous Group51 Graphical Comparison

52 USC Asynchronous Group52 Optimum Number of Stages Optimum Number of Stages (ONS)Optimum Number of Stages (ONS) Cycle Time is not the only factor in system performance, Forward Latency is also a limiting factorCycle Time is not the only factor in system performance, Forward Latency is also a limiting factor Larger cycle time can be compensated by increasing the number of stagesLarger cycle time can be compensated by increasing the number of stages However, high L f means system throughput can not be increased by adding more stagesHowever, high L f means system throughput can not be increased by adding more stages

53 USC Asynchronous Group53 Conclusion With Taken logic and some easy to meet timing requirement, we can achieve the best cycle time and forward latencyWith Taken logic and some easy to meet timing requirement, we can achieve the best cycle time and forward latency The performance comparison with existing pipeline schemes are favorableThe performance comparison with existing pipeline schemes are favorable Implementation is still required to prove the theoryImplementation is still required to prove the theory


Download ppt "Asynchronous Pipelines Author: Peter Yeh Advisor: Professor Beerel."

Similar presentations


Ads by Google