Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Department University of Michigan

Digital ChannelEncoder Symbols Codes Decoder Codes Symbols TransmitterReceiver Noise Analog TransmitterAnalog Receiver Analog Channel Digital Communication System

010110110 … Convolutional Coding Input Symbols Digital Channel Encoder Symbols Codes Decoder Codes Symbols TransmitterReceiver O0O0 … … Rate 1/n Convolutional Encoder O n-1 … n-bit Codes

IS95 Convolutional Encoding Used in the reverse link of IS95 CDMA system 256 states (8 state registers) Rate 1/3 Maximum Free Distance coding S7S6 S5 S4 S3S2S1S0 I O0O0 O1O1 O2O2

Viterbi Decoding (VD) VD is optimal for convolutional codes.  Maximum likelihood decoding scheme.  Minimum error for additive white Gaussian noise channel. VD procedure.  Construction of a complex graph called trellis.  Computation of the shortest path.

… IS95 VD Trellis 256 nodes # of symbols 12 34

Challenge of Large-State VD Designs High computational complexity. VDs with hundreds of states require multiple Gops throughput, when symbol transfer rates reach Mbps. Parallel processing. High interconnect power dissipation. Complex routing among the processors. For large-state VDs, global data transfer and interconnect issues must be considered carefully

Viterbi Decoder Designs 256 0.24mW 10mW 2 4 8 16 32 64 128 512 Number of states Throughput (Mbps) 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 3W 6mW 0.57W 3W 1.8W 0.35W 2W 0.66W 0.75W 0.66W 7.65mW Our design 20Mbps, 0.45W

Presentation Outline Viterbi decoding overview Our contributions Data transfer oriented hierarchical inter- processor optimization Intra-processor power optimization Chip data

0 0 0 O0O0 O1O1 I Encoding Example Inputs …, 0, 0, 1, 0 Outputs …, 01,11,11, 00 Rate ½ 3 bit 8 states

Viterbi Decoding … 00 11 Transition Output 001 011 101 111 000 010 100 110 Code received 00 Edge Weight 0 2 11 01

0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 0 0 0 0 0 0 0 0 Vertex Weight 001 011 101 111 000 010 100 110 Viterbi Decoding … Code received 0011 01 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 0 = Min{0+0,0+2} 0 Add-Compare-Select (ACS)

0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 001 011 101 111 000 010 100 110 Viterbi Decoding … 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 Code received 0011 01

VD Summary Each decoded symbol requires a layer of similar computations: 2N edge weight computations (N = # of states). N add-compare-select (ACS) operations. Operations within each layer are independent.

Viterbi Decoder Architectures Design space: number of processors used Sequential architecture One ACS processor Low design complexity Very low throughput Parallel architecture One ACS processor per state Very complex routing problem High power dissipation due to long interconnects

Viterbi Decoder Architectures Intermediate solutions Design space: number of processors used ACS processor buses ACS processor ACS processor ACS processor

Key Issues  How many ACS processors?  Which ACS operations are executed in each processor?  Which ACS operations can be executed concurrently?  In what order are the operations executed?  Can processors be pipelined? Operation Partitioning Operation Packing Non-forwarding scheduling

Q: Which operations are executed in each ACS processor? A: Operation partitioning for global data transfer reduction

Operation Partitioning Example 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 4 transfers 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 transfers ACS1 ACS2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Operation Partitioning Results  Obtain solution by iterative bi-partitioning (KL).  For 64+ partitions, >50% data transfers are global.  Largest absolute reduction: 4 to 32 partitions. 1 2 4 8 16 32 64 128 256 0 20 40 60 80 100 Number of partitions Global data transfers (%) Our partitioning approach C.-M. Wu et al, 2000

Q: Which operations are executed simultaneously? A: Operation packing for global bus minimization

Operation Packing Example 2 # buses required 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 0,1,2,43,5,6,7 Global buses A slice of operations 0 3 1 5 2 6 4 7 time 0 5 1 3 4 6 2 7 1 1 1 1 1 # buses required time 0 3 0 1 5 2 2 6 2 4 7 0 0 0 2 2

Operation Packing Packing procedure for global bus minimization One operation from each partition in each slice Global data transfers within a slice done simultaneously Bus cost: the number of ACS units connected Our heuristic Distribute global transfers evenly in all slices

Operation Packing Results Comparison solution: one bus between any two ACS processors Global buses reduction: 31% on the average Most effective range: 8 to 32 partitions Partitions248163264128 Bus cost (heuristic)22481642111 Bus cost (comparison)237153163127 Reduction (%)0334347483313

Q: In what order should operations be executed? Q: Can ACS units be pipelined? A: Non-forwarding scheduling

Non-forwarding Scheduling 77 0 1 2 4 0 1 2 4 3 5 6 3 5 6 0 5 1 3 4 6 2 7 layer n+1 layer n 1 3 4 6 2 7 0 5 1 3 4 6 2 7 0 5 layer n+1 layer n 0 5 1 3 4 6 2 7 time

Non-forwarding Scheduling Results Greedy heuristic: Pick slice with the least dependencies first. Iteratively pick the next slice such that the upper bound of the non-forwarding pipeline depth derived by the chosen slices is maximized. Architectures with 16 or more parallel processors allow very limited non-forwarding pipeline depth. Partitions248163264128 Max pipeline depth632892000

Q: How many ACS processors should be used? 428163264128 Large global data transfer reduction 428163264128 Large global bus reduction 428163264128 Deep non-forwarding pipeline

Viterbi Decoder Architecture Global Data Control ACS A/D Buses ACS CPU Interface Lock Control Backtrace Control Backtrace RAM Data Buffer code Program counter pc ACS Subcore 1 2 3 4 5 6 7 8 1234 5678 4 buses

Processor Internal Architecture 16-bit datapath 8 pipeline stages Instruction ROM Data Fetch Overflow Detect ACS Unit Register File Branch Metric code lock address/data buses Backtrace RAM pc

Processor Level Power Reduction Combine precomputation and saturation arithmetic. If one or two operands overflow, ACS is partially shut off. No significant degradation of the decoding performance.

Chip Implementation Design: RTL Verilog Synthesis: Design Analyzer Placement: manual floorplan Routing: Silicon Ensemble Verification: gate level Verilog Power estimation: Primepower

Chip Summary Technology0.25umMetal layers5 Core size2.4x2.4mm 2 Transistors325K Core frequency640MHzSupply voltage2.5V Throughput20MbpsPower0.45W

Conclusion Design case study of a 256-state IS95 VD Hierarchical optimization methodology Global data transfer minimization Global bus reduction Non-forwarding scheduling Precomputation and saturation arithmetic Viterbi decoder 8 pipelined processors 4 global buses Throughput 20Mbps Power dissipation 450mW

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Similar presentations

Presentation on theme: "Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Similar presentations

Presentation on theme: "Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering."— Presentation transcript:

Similar presentations

About project

Feedback