Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Similar presentations


Presentation on theme: "Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering."— Presentation transcript:

1 Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Department University of Michigan

2 Digital ChannelEncoder Symbols Codes Decoder Codes Symbols TransmitterReceiver Noise Analog TransmitterAnalog Receiver Analog Channel Digital Communication System

3 010110110 … Convolutional Coding Input Symbols Digital Channel Encoder Symbols Codes Decoder Codes Symbols TransmitterReceiver O0O0 … … Rate 1/n Convolutional Encoder O n-1 … n-bit Codes

4 IS95 Convolutional Encoding Used in the reverse link of IS95 CDMA system 256 states (8 state registers) Rate 1/3 Maximum Free Distance coding S7S6 S5 S4 S3S2S1S0 I O0O0 O1O1 O2O2

5 Viterbi Decoding (VD) VD is optimal for convolutional codes.  Maximum likelihood decoding scheme.  Minimum error for additive white Gaussian noise channel. VD procedure.  Construction of a complex graph called trellis.  Computation of the shortest path.

6 … IS95 VD Trellis 256 nodes # of symbols 12 34

7 Challenge of Large-State VD Designs High computational complexity. VDs with hundreds of states require multiple Gops throughput, when symbol transfer rates reach Mbps. Parallel processing. High interconnect power dissipation. Complex routing among the processors. For large-state VDs, global data transfer and interconnect issues must be considered carefully

8 Viterbi Decoder Designs 256 0.24mW 10mW 2 4 8 16 32 64 128 512 Number of states Throughput (Mbps) 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 3W 6mW 0.57W 3W 1.8W 0.35W 2W 0.66W 0.75W 0.66W 7.65mW Our design 20Mbps, 0.45W

9 Presentation Outline Viterbi decoding overview Our contributions Data transfer oriented hierarchical inter- processor optimization Intra-processor power optimization Chip data

10 0 0 0 O0O0 O1O1 I Encoding Example Inputs …, 0, 0, 1, 0 Outputs …, 01,11,11, 00 Rate ½ 3 bit 8 states

11 Viterbi Decoding … 00 11 Transition Output 001 011 101 111 000 010 100 110 Code received 00 Edge Weight 0 2 11 01

12 0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 0 0 0 0 0 0 0 0 Vertex Weight 001 011 101 111 000 010 100 110 Viterbi Decoding … Code received 0011 01 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 0 = Min{0+0,0+2} 0 Add-Compare-Select (ACS)

13 0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 001 011 101 111 000 010 100 110 Viterbi Decoding … 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 Code received 0011 01

14 VD Summary Each decoded symbol requires a layer of similar computations: 2N edge weight computations (N = # of states). N add-compare-select (ACS) operations. Operations within each layer are independent.

15 Viterbi Decoder Architectures Design space: number of processors used Sequential architecture One ACS processor Low design complexity Very low throughput Parallel architecture One ACS processor per state Very complex routing problem High power dissipation due to long interconnects

16 Viterbi Decoder Architectures Intermediate solutions Design space: number of processors used ACS processor buses ACS processor ACS processor ACS processor

17 Key Issues  How many ACS processors?  Which ACS operations are executed in each processor?  Which ACS operations can be executed concurrently?  In what order are the operations executed?  Can processors be pipelined? Operation Partitioning Operation Packing Non-forwarding scheduling

18 Q: Which operations are executed in each ACS processor? A: Operation partitioning for global data transfer reduction

19 Operation Partitioning Example 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 4 transfers 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 transfers ACS1 ACS2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

20 Operation Partitioning Results  Obtain solution by iterative bi-partitioning (KL).  For 64+ partitions, >50% data transfers are global.  Largest absolute reduction: 4 to 32 partitions. 1 2 4 8 16 32 64 128 256 0 20 40 60 80 100 Number of partitions Global data transfers (%) Our partitioning approach C.-M. Wu et al, 2000

21 Q: Which operations are executed simultaneously? A: Operation packing for global bus minimization

22 Operation Packing Example 2 # buses required 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 0,1,2,43,5,6,7 Global buses A slice of operations 0 3 1 5 2 6 4 7 time 0 5 1 3 4 6 2 7 1 1 1 1 1 # buses required time 0 3 0 1 5 2 2 6 2 4 7 0 0 0 2 2

23 Operation Packing Packing procedure for global bus minimization One operation from each partition in each slice Global data transfers within a slice done simultaneously Bus cost: the number of ACS units connected Our heuristic Distribute global transfers evenly in all slices

24 Operation Packing Results Comparison solution: one bus between any two ACS processors Global buses reduction: 31% on the average Most effective range: 8 to 32 partitions Partitions248163264128 Bus cost (heuristic)22481642111 Bus cost (comparison)237153163127 Reduction (%)0334347483313

25 Q: In what order should operations be executed? Q: Can ACS units be pipelined? A: Non-forwarding scheduling

26 Non-forwarding Scheduling 77 0 1 2 4 0 1 2 4 3 5 6 3 5 6 0 5 1 3 4 6 2 7 layer n+1 layer n 1 3 4 6 2 7 0 5 1 3 4 6 2 7 0 5 layer n+1 layer n 0 5 1 3 4 6 2 7 time

27 Non-forwarding Scheduling Results Greedy heuristic: Pick slice with the least dependencies first. Iteratively pick the next slice such that the upper bound of the non-forwarding pipeline depth derived by the chosen slices is maximized. Architectures with 16 or more parallel processors allow very limited non-forwarding pipeline depth. Partitions248163264128 Max pipeline depth632892000

28 Q: How many ACS processors should be used? 428163264128 Large global data transfer reduction 428163264128 Large global bus reduction 428163264128 Deep non-forwarding pipeline

29 Viterbi Decoder Architecture Global Data Control ACS A/D Buses ACS CPU Interface Lock Control Backtrace Control Backtrace RAM Data Buffer code Program counter pc ACS Subcore 1 2 3 4 5 6 7 8 1234 5678 4 buses

30 Processor Internal Architecture 16-bit datapath 8 pipeline stages Instruction ROM Data Fetch Overflow Detect ACS Unit Register File Branch Metric code lock address/data buses Backtrace RAM pc

31 Processor Level Power Reduction Combine precomputation and saturation arithmetic. If one or two operands overflow, ACS is partially shut off. No significant degradation of the decoding performance.

32 Chip Implementation Design: RTL Verilog Synthesis: Design Analyzer Placement: manual floorplan Routing: Silicon Ensemble Verification: gate level Verilog Power estimation: Primepower

33 Chip Summary Technology0.25umMetal layers5 Core size2.4x2.4mm 2 Transistors325K Core frequency640MHzSupply voltage2.5V Throughput20MbpsPower0.45W

34 Conclusion Design case study of a 256-state IS95 VD Hierarchical optimization methodology Global data transfer minimization Global bus reduction Non-forwarding scheduling Precomputation and saturation arithmetic Viterbi decoder 8 pipelined processors 4 global buses Throughput 20Mbps Power dissipation 450mW


Download ppt "Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering."

Similar presentations


Ads by Google