Download presentation

Presentation is loading. Please wait.

Published byJanie Weber Modified about 1 year ago

1
Alexander Smirnov Alexander Taubin

2
Determine ◦ max throughput ◦ causes of throughput limit ◦ max achievable throughput ◦ cost of achieving a given throughput level Data independent token flow ◦ No early evaluation ◦ DEMUXes send data all ways Cells across library/design implement the same handshaking protocol

3
Previous work Cell characterization Protocol characterization Throughput of asynchronous pipelines (reminder) Throughput analysis Throughput optimization

4
Early works on the throughput of async. pipelines: M. Greenstreet, K. Steighlitz; T. Willams; A. Lines Time separation of events (TSE) based approaches to throughput analysis: T. Amon, H. Hulgaard, S. Burns, G. Boriello; S. Chakraborty, D. Dill; P. McGee, S. Nowick; Simulation based approaches: C. Brej; K. Fazel Slack matching (throughput optimization) approaches: P. Prakash, A. Martin; P. Beerel, M. Davies, A. Lines, N. Kim;

5
Cell characterization example (in Liberty) 5 Cell (in ASIC) is a physical implementation of a gate Characterization is a way of abstracting away the details and specifying the parameters needed on the higher level of hierarchy Cell characterization ◦ abstracts away cell implementation details ◦ specifies functionality, timing, area, power consumption, etc ◦ necessary and sufficient for efficient synthesis, optimization and simulation De-facto standard – Synopsys “Liberty”

6
Conventional gate: ◦ Implements function of input wires ◦ Special signals clock set clear etc 6 Asynchronous stage ◦ Implements function of input channels Special signals request acknowledge data0 data1 reset

7
Reuse Synopsys Liberty whenever possible Use attributes to specify roles of pins in handshaking, channel, etc Specify functionality in terms of channels (abstract out control functionality) Use Data → Data timing arcs to specify channel → channel attributes: slack, number of tokens at initialization * PCHB stage example

8
Abstract channel: forward/backward control and forward data propagation Assumption: handshake protocol is the same across the library/design 8 L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

9
Abstract channel: forward/backward control and forward data propagation Assumption: handshake protocol is the same across the library/design Use cell characterization to infer handshake protocol Abstraction and characterization allow identifying protocol loops in every stage for every pair of channels 9 L - Left/Right F - Forward/Backward C - Control/Data E - Evaluation/Reset

10
Goal: enumerate all handshake cycles ◦ handshake cycles are same across the design (assumption) ◦ for practical protocols a handshake cycle covers 3 stages ◦ enumerate all possible cycles in a full timing graph of a 4- stage FIFO, normalize cycles and remove identical 10 * PCHB stage example Complexity negligible

11
Asynchronous pipeline throughput is determined by loops ◦ Handshaking ◦ Algorithmic (rings) and congestion Pipeline throughput is known for basic pipeline compositions Bottleneck based – pipeline compositions are bottleneck candidates

12
T. Willams (1990), A. Lines (1995): Throughput T ◦ x – token count ◦ s – slack ◦ d – dynamic slack ◦ c – cycle time x is invariant for a ring in a pipeline with deterministic (data independent) token flow 12 liflif cici

13
for serial composition of pipelines with throughputs T 1, T 2 the resulting throughput T resulting = min{T 1,T 2 } T resulting is observed at d min x d max TjTj TiTi T2T2 T1T1 13

14
for parallel composition of pipelines with throughputs T 1, T 2 the resulting throughput T resulting is observed at T2T2 T1T1 14

15
Peak throughput of a is limited by the slowest component to determine the throughput of a pipeline it is sufficient to discover that slowest combination of stages - throughput bottleneck Bottleneck candidates (BCs): ◦ Handshake (h/s) cycle ◦ Re-converging paths ◦ Algorithmic cycle (ring) BC characterized by cycle time rang

16
Length of each h/s cycle in the protocol computed for each window of length 2 m 3 (HB stages). Handshake cycles are known from protocol analysis Lengths of each cycle ( i min and i max ) are computed for each cycle “in place” and then Heuristic: cycles involving multiple branches not considered complexity or where v i are primary outputs of a stages environment reaction times * PCHB stages example

17
Theorem: if a BC is a bottleneck, reaction times on its borders never exceed those used to compute It follows from the theorem that BC can be analyzed in isolation to determine BCs are sorted with respect to BC with the highest is a bottleneck – it defines the throughput of the design

18
Requires results of handshake cycle analysis Identify pairs of re-converging paths, compute Reduce the number of pairs of re-converging paths: ◦ one pair of re-converging paths identified per fork-join ◦ pipelines is assumed to have deterministic (data independent token flow) number of initial tokens in any two re-converging paths is the same Number of BCs can be reduced if optimization not needed

19
Heuristics for identifying rings, re-converging paths include: ◦ consider two of any set of rings with common arc(s) (longest and shortest)

20
Throughput of rings, re-converging path pairs is computed using the equations from T. Willams, A. Lines BUT ◦ If a handshake cycle covers re-converging paths (if the length of the shorter branch is 0-2 half-buffer stages) the equations from T. Willams, A. Lines do not apply Throughput such bottleneck candidate is determined by the handshake cycles

21
Identify handshake bottlenecks (slide window) Optimize handshake bottlenecks (if necessary) Identify BCs due to algorithmic loops and dynamic slack imbalance ◦ CPM, modified to handle loops ◦ Trade memory for time – store arrival times, significant predecessors ◦ Eliminate unnecessary graph exploration

22
Predicted throughput variation range (% of the actual simulated throughput) Predicted throughput variation depend on: ◦ Due to asymmetry in library cells throughput varies depending on the data (actual throughput variation) ◦ Uncertainty introduced by heuristics (currently incomplete synchronization trees introduce height uncertainty)

23
Throughput estimation is heuristic based i.e. error is possible Shown is the % difference of the actual throughput and the predicted variation range bound weighted by actual throughput In 92.5% of test cases measured throughput is within the predicted variation range, the maximum error observed is 27%

24
Alleviate bottlenecks with throughput less than the goal by ◦ Handshake pipelining ◦ Ring padding, slack matching Iteratively ◦ insert stages ◦ update all BCs

25
Alleviate bottlenecks with throughput less than the goal by ◦ Handshake pipelining ◦ Ring padding, slack matching Iteratively ◦ insert stages ◦ update all BCs

26
The approach allows automatically optimize the throughput up to the level limited by: ◦ library cells ◦ data deficient (long non-pipelined) rings Fully optimized throughput is higher (cycle time smaller) for ◦ FIFOs ◦ circuits without synchronization trees (fan-out 1)

27
Based on Synopsys Liberty developed asynchronous cell/stage characterization used for synthesis, throughput analysis/optimization Protocol characterization automatically inferred from cell characterization Support for hierarchical designs (with possible loss of precision) All bottlenecks are identified All bottlenecks except for data deficient rings are automatically alleviated Optimization tested with stage insertion but other optimizations can be used Analysis results easily adjusted to reflect non- structural changes

28

29
Currently not considering handshake cycles involving branches Unless merges/forks are properly characterized analysis in hierarchical designs is imprecise Currently synchronization trees are assumed balanced, for incomplete trees one sync cell delay I added to the variation range

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google