Presentation is loading. Please wait.

Presentation is loading. Please wait.

A PLA based Asynchronous Micropipelining Approach for Sub- threshold Circuit Design Authors: Nikhil Jayakumar* Rajesh Garg* Bruce Gamache $ Sunil P. Khatri*

Similar presentations


Presentation on theme: "A PLA based Asynchronous Micropipelining Approach for Sub- threshold Circuit Design Authors: Nikhil Jayakumar* Rajesh Garg* Bruce Gamache $ Sunil P. Khatri*"— Presentation transcript:

1 A PLA based Asynchronous Micropipelining Approach for Sub- threshold Circuit Design Authors: Nikhil Jayakumar* Rajesh Garg* Bruce Gamache $ Sunil P. Khatri* *Department of Electrical Engineering,Texas A&M University. $ Conexant Systems, Inc.

2 2 Outline Motivation Motivation Introduction Introduction Approach Approach Results Results Conclusions Conclusions

3 3 Sub-threshold Leakage As supply voltage scales down, the V T of the devices is scaled down as well. As supply voltage scales down, the V T of the devices is scaled down as well. Leakage increases exponentially with decreasing V T Leakage increases exponentially with decreasing V T Leakage power is becoming comparable with dynamic power. Leakage power is becoming comparable with dynamic power. A larger V T would reduce leakage but increase delay. A larger V T would reduce leakage but increase delay. We can turn this dilemma into an opportunity !! We can turn this dilemma into an opportunity !! Use sub-threshold leakage current to implement circuits. Use sub-threshold leakage current to implement circuits. Set VDD less than V T. Set VDD less than V T.

4 4 Advantages of Sub-threshold Circuit Design We performed simulations on a 21 stage ring oscillator (BPTM 65nm) We performed simulations on a 21 stage ring oscillator (BPTM 65nm) Power is significantly lower (100-500X). Power is significantly lower (100-500X). PDP improves by 10-20X. PDP improves by 10-20X. Transconductance is an exponential function of V gs Transconductance is an exponential function of V gs Circuit noise margins are high. Circuit noise margins are high. I on /I off = 100 – 200. I on /I off = 100 – 200. Circuits get faster at higher temperature. Circuits get faster at higher temperature.

5 5 Disadvantages of Sub- threshold Circuit Design I ds is highly dependent on PVT variations I ds is highly dependent on PVT variations Need dynamic compensating circuitry such as the one mentioned in: Need dynamic compensating circuitry such as the one mentioned in: “A Variation-tolerant Sub-threshold Design Approach”, N. Jayakumar, S. Khatri [DAC’05] “A Variation-tolerant Sub-threshold Design Approach”, N. Jayakumar, S. Khatri [DAC’05] Used Adaptive Body Biasing. Used Adaptive Body Biasing. I ds is small which results in large delay. I ds is small which results in large delay. Delay gets worse by 10-25X. Delay gets worse by 10-25X. Therefore, application space is in very low power applications such as sensor networks. Therefore, application space is in very low power applications such as sensor networks. Design methodologies for sub-threshold digital circuit design are ad-hoc. Design methodologies for sub-threshold digital circuit design are ad-hoc.

6 6 Contribution of this paper Provide a systematic EDA framework for the design of complex digital systems using sub- threshold Network of PLA (NPLA) based circuits. Provide a systematic EDA framework for the design of complex digital systems using sub- threshold Network of PLA (NPLA) based circuits. Use asynchronous micropipelining to provide a greater throughput. Use asynchronous micropipelining to provide a greater throughput. Ideally suited for Data-flow type circuits. Ideally suited for Data-flow type circuits.

7 7 Why NPLAs? NPLAs are fast and area-efficient when compared to standard-cell based designs NPLAs are fast and area-efficient when compared to standard-cell based designs - “ Cross-talk immune VLSI design using a Network of PLAs Embedded in a Regular Layout Fabric ”, S.Khatri, R. Brayton, A. Sangiovanni-Vincentelli [ICCAD’00] Predictable delay of dynamic PLAs Predictable delay of dynamic PLAs Good circuit implementation choice for sub-threshold/near-threshold logic. Good circuit implementation choice for sub-threshold/near-threshold logic. Regular Layout Structure Regular Layout Structure Compatible with Restrictive Design Rules (RDRs) required to handle current and future lithographic issues. Compatible with Restrictive Design Rules (RDRs) required to handle current and future lithographic issues. Technology independent optimizations (literal reduction) utilized better Technology independent optimizations (literal reduction) utilized better No intervening technology mapping step. No intervening technology mapping step. Implementing Structured ASICs Implementing Structured ASICs An array of fixed-size PLAs is ideally suited for implementing Structured ASIC type designs. An array of fixed-size PLAs is ideally suited for implementing Structured ASIC type designs. - “ A METAL and VIA Mask Customizable VLSI Design Scheme using an Array of Dynamic PLAs ”, N.Jayakumar, S.Khatri [ICCAD’04]

8 8 PLA structure – Precharged NOR-NOR AND PLANE OR PLANE

9 9 PLA structure – Precharged NOR-NOR Inputs run vertically Inputs run vertically Wordlines run horizintally Wordlines run horizintally Outputs run vertically Outputs run vertically A dummy wordline and a dummy output line are provided for self-timing. A dummy wordline and a dummy output line are provided for self-timing.

10 10 PLA structure – Precharged NOR-NOR completion is the last signal to switch. Input latches to latch data from previous level

11 11 Asynchronous Micropipeline Structure Each PLA has Each PLA has Data Inputs –D (input) Data Inputs –D (input) Data Outputs – O (output) Data Outputs – O (output) Hand-shaking control signals - P1, P2 (input) Hand-shaking control signals - P1, P2 (input) Controls asynchronous handshake Controls asynchronous handshake PLA evaluation/precharge done signal – completion (output) PLA evaluation/precharge done signal – completion (output) Switches high when evaluation completes, switches low when precharge completes. Switches high when evaluation completes, switches low when precharge completes. Internal clock signal – INTCLK (output) Internal clock signal – INTCLK (output) Generated from completion, P1 and P2 to control operation of the PLA. Generated from completion, P1 and P2 to control operation of the PLA. INTCLK = low → PLA precharges INTCLK = low → PLA precharges INTCLK = high → PLA evaluates INTCLK = high → PLA evaluates level 1 level 2 level n

12 12 Handshaking Logic PLA p (at level k) precharges (INTCLK goes low) if its P1 rises PLA p (at level k) precharges (INTCLK goes low) if its P1 rises PLA q at next higher level has latched the output data of p. PLA q at next higher level has latched the output data of p. PLA p evaluates (INTCLK goes high) if its P2 rises and its completion signal is low PLA p evaluates (INTCLK goes high) if its P2 rises and its completion signal is low PLA p is currently in the precharged state (its completion signal is low). PLA p is currently in the precharged state (its completion signal is low). PLA r at next lower level has completed evaluation and has new data ready (P2 for PLA p has risen). PLA r at next lower level has completed evaluation and has new data ready (P2 for PLA p has risen). Handshaking logic is therefore as shown below: Handshaking logic is therefore as shown below:

13 13 Micro-Pipeline Operation level 1 level 2 level n Initially all PLAs are precharged. Initially all PLAs are precharged. Drive primary inputs (D of level 1 PLAs). Drive primary inputs (D of level 1 PLAs). P2 signals of level 1 PLAs are asserted. P2 signals of level 1 PLAs are asserted. After evaluation is done, completion signals of level 1 PLAs go high. After evaluation is done, completion signals of level 1 PLAs go high. Therefore level 2 PLAs start evaluating. Therefore level 2 PLAs start evaluating. Data gets latched at input of level 2 PLAs, INTCLK of level 2 PLAs go high. Data gets latched at input of level 2 PLAs, INTCLK of level 2 PLAs go high. This causes level 1 PLAs to start precharging. This causes level 1 PLAs to start precharging. When evaluation of level 2 PLAs is done, their completion signals go high When evaluation of level 2 PLAs is done, their completion signals go high This causes level 3 PLAs to start evaluating This causes level 3 PLAs to start evaluating

14 14 Micro-Pipeline Operation This goes on till the PLAs at level n finish evaluation (indicated by their completion signal going high). This goes on till the PLAs at level n finish evaluation (indicated by their completion signal going high). Consumer circuit latches the output and asserts P1 of level n PLAs Consumer circuit latches the output and asserts P1 of level n PLAs This cause level n PLAs to precharge. This cause level n PLAs to precharge. When completion of level n-1 PLAs goes high and level n PLAs have precharged, then level n PLAs can evaluate again. When completion of level n-1 PLAs goes high and level n PLAs have precharged, then level n PLAs can evaluate again. level 1 level 2 level n

15 15 Non-micropipelined vs Micropipelined Delay for non- micropipelined NPLA = T pchg + n x (T eval ) Delay for non- micropipelined NPLA = T pchg + n x (T eval ) Delay of micropipelined PLA = (T eval + T pchg + handshaking time) Delay of micropipelined PLA = (T eval + T pchg + handshaking time) level 1 level 2 level n

16 16 Verilog Simulation of Micropipelining We simulated the handshaking protocol in verilog. We simulated the handshaking protocol in verilog. Verified correct operation. Verified correct operation. If consumer circuit holds off asserting P1 for level n PLAs, the entire pipeline stalls. If consumer circuit holds off asserting P1 for level n PLAs, the entire pipeline stalls. Note that when level i is in precharge, level i+1 is in evaluation and vice-versa. Note that when level i is in precharge, level i+1 is in evaluation and vice-versa.

17 17Synthesis-Algorithm First levelize the given multi-level network N First levelize the given multi-level network N Generate a DFS of network nodes and sort in increasing order of levels Generate a DFS of network nodes and sort in increasing order of levels Greedily include new nodes from multi level network, into a current PLA. Greedily include new nodes from multi level network, into a current PLA. Assume current PLA p has nodes {n} in it. Assume current PLA p has nodes {n} in it. Candidate nodes {m} for inclusion in PLA p are: Candidate nodes {m} for inclusion in PLA p are: Nodes in the fanout of nodes in {n}. Nodes in the fanout of nodes in {n}. Nodes at the same level as nodes in {n}. Nodes at the same level as nodes in {n}. We evaluate favorability of nodes in {m} is as: favorability(m) = 2 * (#common fanins (m,{n}) + (#common fanouts (m,{n}. We evaluate favorability of nodes in {m} is as: favorability(m) = 2 * (#common fanins (m,{n}) + (#common fanouts (m,{n}. The first term favors sharing of inputs with existing nodes {n}, while the second term favors sharing of outputs. The first term favors sharing of inputs with existing nodes {n}, while the second term favors sharing of outputs. Sharing of inputs was empirically determined to be more useful in yielding smaller PLA counts. Sharing of inputs was empirically determined to be more useful in yielding smaller PLA counts. We include the node with the highest favorability value. We include the node with the highest favorability value. 4 2 3 1 1 1 5 2 5

18 18Synthesis-Algorithm Current PLA p is grown until it violates size constraints Current PLA p is grown until it violates size constraints Nodes {n} in the current PLA are converted into a two-level network N. Nodes {n} in the current PLA are converted into a two-level network N. We run espresso on N. We run espresso on N. If the number of inputs, outputs and height of this two-level network are bounded, then PLA p is grown If the number of inputs, outputs and height of this two-level network are bounded, then PLA p is grown If not, then we start growing a new PLA. If not, then we start growing a new PLA. Build a PLA dependency graph Build a PLA dependency graph Each vertex corresponds to a unique PLA Each vertex corresponds to a unique PLA Each edge connects the output of a PLA to the input of another PLA Each edge connects the output of a PLA to the input of another PLA Node being included in current PLA p are constrained by the following: Node being included in current PLA p are constrained by the following: the node being included should not violate size constraints of a PLA. the node being included should not violate size constraints of a PLA. the inclusion of this node should not result in a cyclic PLA dependency graph the inclusion of this node should not result in a cyclic PLA dependency graph If such a node is not available pick the next most favorable node. If such a node is not available pick the next most favorable node. 4 2 3 1 1 1 5 2 5

19 19 Synthesis-Algorithm After synthesis, the output of a PLA at level i may drive PLAs at level > i+1 After synthesis, the output of a PLA at level i may drive PLAs at level > i+1 Such a case will cause micro- pipelining to fail. Such a case will cause micro- pipelining to fail. Insert Stutter blocks for signals which traverse one or more levels of PLAs. Insert Stutter blocks for signals which traverse one or more levels of PLAs. Stutter blocks are banks of latches to delay signals which traverse more than 1 levels of PLAs. Stutter blocks are banks of latches to delay signals which traverse more than 1 levels of PLAs. Multiple stutter blocks are inserted for signals traversing multiple levels. Multiple stutter blocks are inserted for signals traversing multiple levels. Stutter block PLA1PLA2 PLA3 PLA4 PLA5

20 20 Experiments 65nm technology. 65nm technology. VDD = 0.2V VDD = 0.2V PLA size : 16 inputs, 14 outputs, 24 rows PLA size : 16 inputs, 14 outputs, 24 rows Delay, Energy results from SPICE using 65nm BPTM model cards. Delay, Energy results from SPICE using 65nm BPTM model cards. Comparison made with non-micropipelined PLA. Comparison made with non-micropipelined PLA. Thoughput of PLA = 1/(T eval +T pchg +2. H eval +H pchg ) Thoughput of PLA = 1/(T eval +T pchg +2. H eval +H pchg ) T eval = Evaluation time for a PLA (~210ns) T eval = Evaluation time for a PLA (~210ns) T pchg = Precharge time for a PLA (~155ns) T pchg = Precharge time for a PLA (~155ns) H eval = Handshake time before start of evaluation (~60ns) H eval = Handshake time before start of evaluation (~60ns) H pchg = Handshake time before start of precharge (~25ns) H pchg = Handshake time before start of precharge (~25ns)

21 21 Results - Delay Ckt#PLAs # Stutter Blks Delay(ns) ↓ Non-µpipeµpipeImpr. alu41452885510 5.66 X apex624122465510 4.83 X C4321142255510 4.42 X C4991442255510 C8801652255510 C135521103305510 6.48 X C190824133935510 7.72 X C267034133515510 6.89 X C354067467505510 14.72 X pair65354565510 8.95 X rot19133095510 6.07 X Avg28.0914.55 6.78 X Delay = 1/throughput for micropipelined. Delay = 1/throughput for micropipelined. Delay is constant since PLA size is fixed. Delay is constant since PLA size is fixed.

22 22 Results – Area Ckt#PLAs # Stutter Blks Area(µ 2 ) ↑ Non-µpipeµpipeOvh. alu4145940812768 1.36 X apex624121612824192 1.5 X C432114739210080 1.36 X C499144940812096 1.29 X C8801651075214112 1.31 X C135521101411220832 1.48 X C190824131612824864 1.54 X C267034132284831584 1.38 X C354067464502475936 1.69 X pair65354368067200 1.54 X rot19131276821504 1.68 X Avg28.0914.55 1.47 X Area estimates based on layout of PLAs along with stutter blocks. Area estimates based on layout of PLAs along with stutter blocks.

23 23 What about Energy consumption? Non-micropipelined NPLAs precharge together and then evaluate in a domino fashion. Non-micropipelined NPLAs precharge together and then evaluate in a domino fashion. Energy wasted due to leakage in the “Precharged” and the “Evaluated” states. Energy wasted due to leakage in the “Precharged” and the “Evaluated” states. Micropipelined PLAs spend little time in the “Precharged” or “Evaluated” states. Micropipelined PLAs spend little time in the “Precharged” or “Evaluated” states. Timing Diagram for a non-micropipelined NPLA

24 24 Results – Energy Ckt#PLAs # Stutter Blks Energy(fJ) ↓ Non-µpipeµpipeImpr. alu41455984.81811.433.3 apex624129033.093261.192.77 C4321143877.2213972.78 C4991444961.021768.642.8 C8801656088.112052.222.97 C1355211010198.862863.683.56 C1908241313814.193307.964.18 C2670341318694.334472.114.18 C3540674673900.569777.187.56 pair653544442.779047.274.91 rot19138966.682774.153.23 Avg28.0914.55 3.84 Results show energy consumption for one computation through the NPLA circuit. Results show energy consumption for one computation through the NPLA circuit. Significant reduction in energy consumption is observed. Significant reduction in energy consumption is observed.

25 25 Conclusions We have proposed an asynchronous micropipelined design approach that reclaims some of the speed penalty associated with subthreshold circuit design. We have proposed an asynchronous micropipelined design approach that reclaims some of the speed penalty associated with subthreshold circuit design. Ideally suited for data-flow type applications. Ideally suited for data-flow type applications. We implemented: We implemented: Handshaking protocol for micropipelining. Handshaking protocol for micropipelining. Circuit Design aspects of the approach. Circuit Design aspects of the approach. Logic synthesis for micropipelined NPLAs. Logic synthesis for micropipelined NPLAs. We validated the approach with Verilog and Spice simulations. We validated the approach with Verilog and Spice simulations. Results show that: Results show that: Design can be sped up by ~ 7X. Design can be sped up by ~ 7X. Area Overhead is ~ 47%. Area Overhead is ~ 47%. Energy consumption is lower by ~ 4X. Energy consumption is lower by ~ 4X. Techniques described can be used for regular operating conditions (VDD > V T ) as well. Techniques described can be used for regular operating conditions (VDD > V T ) as well.

26 26 Thank you. Questions?


Download ppt "A PLA based Asynchronous Micropipelining Approach for Sub- threshold Circuit Design Authors: Nikhil Jayakumar* Rajesh Garg* Bruce Gamache $ Sunil P. Khatri*"

Similar presentations


Ads by Google