Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh.

Similar presentations


Presentation on theme: "Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh."— Presentation transcript:

1 Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh Electrical Engineering Department University of Southern California ASYNC 2007 – Berkeley, California March 12 th 2007

2 Motivation and Goal Mainstream acceptance of asynchronous design Leverage-off of ASIC standard-cell library-based design flow Achieve significant benefits to overcome sync momentum Our research goal for async designs… High-speed standard-cell flow Applications where designs yield significant improvement throughput and throughput per area energy efficiency

3 Single Track Full Buffer (Ferretti’02) Follows 2 phase protocol High performance standard cell circuit family Comparison to synchronous standard-cell 4.5x better latency 1+GHz in 0.18µm ~2.4X faster than synchronous 2.8x more area R L S B RCD SCD A Reset L R B Forward path Reset path 1-of-N 12 L R

4 Block Processing – Pipelining and Parallelism K cases M people pipelines Latency l Let c be the person cycle time Steinhart Aquarium First M cases arrive at t = l Subsequent M cases arrive every c time units Consider two scenarios Baseline cycle time C1, latency L1 Improved cycle time C2 = C1/2.4, latency L2 = L1/4.5 Questions How does cycle time affect throughput? How does latency affect throughput ?

5 Block Processing – Combined Cycle Time and Latency Effect Large K: throughput ratio  cycle time ratio Small K: throughput ratio  latency ratio 4.32 Throughput vs Number of cases 0 5 10 15 20 020040060080010001200 Number of cases (K) Throughput Baseline Improved 2.6

6 Talk Outline Turbo coding and decoding – an introduction Tree soft-input soft-output (SISO) decoder Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

7 01111 N bits Turbo Coding – Introduction Error correcting codes Adds redundancy The input data is K bits The output code word is N bits (N>K) The code rate is r = K/N Type of codes Linear code Convolutional code (CC) Turbo code Encoder 01111 K bits

8 Turbo Encoding - Introduction Turbo Encoding Berrou, Glavieux and Thitimajshima (1993) Performance close to Shannon channel capacity Typically uses two convolutional codes and an interleaver Interleaver used to improve error correction increases minimum distance of code creates a large block code Interleaver Inner CC Outer CC Turbo Encoder

9 Turbo Decoding Received Data memory Inner SISO De- interleaver Interleaver Outer SISO Turbo decoder components Two soft-in soft-out (SISO) decoders one for inner CC and one for outer CC soft input: a priori estimates of input data soft output: a posterior estimates of input data SISO often based on Min-Sum formulation Interleaver / De-interleaver maps SISO outputs to SISO inputs same permutation as used in encoder Iterative nature of algorithm leads to block processing One SISO must finish before next SISO starts

10 The Decoding Problem t = 0t = K Sent bit is 1 Sent bit is 0 Requires finding paths in a graph called a trellis Node: State j of encoder at time index k Edge: Represents receiving a 0 or 1 in node for state j at time k Path: Represents a possible decoded sequence the algorithm finds multiple paths Example Trellis For a 2-state encoder, encoding K bits Decoded Sequence 0 1 0 0 0 1 0 1 0 0 t = k

11 Min-Sum SISO Problem Formulation Branch and path metrics Branch metric (BM) indicates difference between expected and received values Path metric sum of associated branch metrics Min-Sum Formulation: for each time index k find Minimum path metric over all paths for which bit k = 1 Minimum path metric over all paths for which bit k = 0 t = 0t = kt = K Sent bit is 1 Sent bit is 0 Minimum path metric when bit k = 1 is 13 Minimum path metric when bit k = 0 is 16 110 21 32 11 1 3 11 3 11 21

12 Talk Outline Turbo coding and decoding – an introduction Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

13 Conventional SISO - O(K) latency Calculation of the minimum path can be divided into two phases Forward state metric for time k and state j: Backward state metric for time k and state j: Data dependency loop prevents pipelining Cycle time limited to latency of 2-way ACS Latency is O(K) t = 0 t = kt = Kt = k-1 Received bit is 1 Received bit is 0 t = k+1

14 Tree SISO – low latency architecture Tree SISO (Beerel/Chugg JSAC’01) Calculate BMs for larger and larger segments of trellis.( ) Analogous to creating group-wise PG logic for tree adders Tree SISO can process the entire trellis in parallel No data dependency loops so finer pipelining possible Latency is O(log K) t=2 2 2 1 2 t=0t=1 t=0 t=1 t=2t=3t=4 1 1 21 1 2 1 3

15 Remainder of Talk Outline Turbo Coding – an introduction Turbo Decoding Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

16 Synchronous Base-Line Turbo Decoder Synchronous turbo decoder base-line IBM 0.18µm Artisan standard cell library SCCC code was used with a rate of ½ Number of iterations performed is 6 Gate level pipelined to achieve high throughput Performed timing-driven P&R Peak frequency of 475MHz SISO area of 2.46mm 2 To achieve high throughput, multiple blocks instantiated

17 Asynchronous Turbo Decoder Static Single Track Full Buffer Standard-Cell Library (Golani’06) Total of (only) 14 cells in IBM 0.18µm process Extensive spice simulations were performed optimized trade-off between performance and robustness Chip design Standard ASIC place-and-route flow (congestion-based) ECO optimization flow Chip level simulation Performed on critical sub-block (55K transistors) Verified timing constraints Measured latency and throughput using Synopsys Nanosim

18 Keeper S R M2 M1 M3 M11 M12 NR M10 L A Channel wire Static Single Track Full Buffer (Ferretti’01) Statically drive line → improves noise margin Sender Receiver 1-of-N data SST channel 1-of-N static single-track protocol 1-of-N 12 Holds low Drives high Holds high Drives low

19 Asynchronous Implementation Challenges - I FORK Join Degradation in throughput Unbalanced fork and join structure The token on the short branch is stalled due to imbalance This leads to over all slowing down of the fork join FORK Join Slack matching Improves the throughput because of an additional pipeline buffer Identify fork / join bottlenecks and resolve by adding buffers After P&R long wires can also create such a problem This can be solved by adding buffers on long wires using ECO flow

20 Asynchronous Implementation Challenges - II SSTFB implements only point to point communication Use dedicated Fork cells Creates another pipeline stage To slack match buffers are needed on the other paths Integrate Fork within Full Adder FA Fork Full adder Full adder Full adder Full adder Full adder Full adder Full adder 45% less area than full adder and fork Decreases the number of slack matching buffers required Full Adder + ForkFull Adder with Integrated Fork Full adder Full adder Full adder Full adder FORK Full adder Full adder Full adder Full adder FORK

21 Asynchronous Implementation Challenges – III Buffer 60% of the design are slack matching buffers Most of the time these buffers occur in linear chains Slack2 Slack4 17% area and 10% power improvement for SLACK2 30% area and 19% power improvement for SLACK4 To save area and power two new cells were created SLACK2 SLACK4 Slack2Buffer

22 Remainder of Talk Outline Turbo Coding – an introduction Turbo Decoding Tree SISO low-latency turbo decoder architecture Synchronous turbo decoder Asynchronous turbo decoder Comparisons and conclusions

23 Comparisons Synchronous Peak frequency of 475MHz Logic area of 2.46mm 2 Asynchronous Peak frequency of 1.15GHz Logic area of 6.92mm 2 Design time comparison Synchronous: ~4 graduate-student months Asynchronous: ~12 graduate-student months

24 Synch vs Async M pipelined 8-bit Tree SISOs Latency l Let c be the sync clock cycle time (475 MHz) First M bits arrive at t = l Subsequent M bits arrive every c time units Two implementations Synch: cycle time C1 and latency L1 Async: cycle time C2 = C1/2.4 latency L2 = L1/4.5 Desired comparisons Throughput comparison vs block size Energy comparison vs block size Received Memory Interleaver/ De- interleaver K bits

25 Comparisons – Throughput / Area For small block sizes asynchronous provides better throughput/area As block size ↑ the two implementations become comparable For block sizes of 512 bits synchronous cannot achieve async throughput 2.13 M=8 1.28 M=3 3.91 M=11

26 Comparisons – Energy/Block For equivalent throughputs and small block sizes asynchronous is more energy efficient than synchronous Async advantages grow with larger async library (e.g., w/ BUF1of4)

27 Conclusions Asynchronous turbo decoder vs. synchronous baseline static STFB offers significant improvements for small block sizes more than 2X throughput/area higher peak throughput (~500Mbps) more energy efficient well-suited for low-latency applications (e.g. voice) High-performance async advantageous for applications which require high performance (e.g., pipelining) low latency block processing for which parallelism has diminishing returns synchronous design requires extensive parallelism to achieve equivalent throughput

28 Future Work Library Design Larger library with more than 1 size per cell 1-of-4 encoding Async CAD Automated slack matching Static timing analysis

29 Questions ??


Download ppt "Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel Asynchronous CAD/VLSI Group Ming Hsieh."

Similar presentations


Ads by Google