Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Department University of Michigan
Digital ChannelEncoder Symbols Codes Decoder Codes Symbols TransmitterReceiver Noise Analog TransmitterAnalog Receiver Analog Channel Digital Communication System
… Convolutional Coding Input Symbols Digital Channel Encoder Symbols Codes Decoder Codes Symbols TransmitterReceiver O0O0 … … Rate 1/n Convolutional Encoder O n-1 … n-bit Codes
IS95 Convolutional Encoding Used in the reverse link of IS95 CDMA system 256 states (8 state registers) Rate 1/3 Maximum Free Distance coding S7S6 S5 S4 S3S2S1S0 I O0O0 O1O1 O2O2
Viterbi Decoding (VD) VD is optimal for convolutional codes. Maximum likelihood decoding scheme. Minimum error for additive white Gaussian noise channel. VD procedure. Construction of a complex graph called trellis. Computation of the shortest path.
… IS95 VD Trellis 256 nodes # of symbols 12 34
Challenge of Large-State VD Designs High computational complexity. VDs with hundreds of states require multiple Gops throughput, when symbol transfer rates reach Mbps. Parallel processing. High interconnect power dissipation. Complex routing among the processors. For large-state VDs, global data transfer and interconnect issues must be considered carefully
Viterbi Decoder Designs mW 10mW Number of states Throughput (Mbps) W 6mW 0.57W 3W 1.8W 0.35W 2W 0.66W 0.75W 0.66W 7.65mW Our design 20Mbps, 0.45W
Presentation Outline Viterbi decoding overview Our contributions Data transfer oriented hierarchical inter- processor optimization Intra-processor power optimization Chip data
0 0 0 O0O0 O1O1 I Encoding Example Inputs …, 0, 0, 1, 0 Outputs …, 01,11,11, 00 Rate ½ 3 bit 8 states
Viterbi Decoding … Transition Output Code received 00 Edge Weight
Vertex Weight Viterbi Decoding … Code received = Min{0+0,0+2} 0 Add-Compare-Select (ACS)
Viterbi Decoding … Code received
VD Summary Each decoded symbol requires a layer of similar computations: 2N edge weight computations (N = # of states). N add-compare-select (ACS) operations. Operations within each layer are independent.
Viterbi Decoder Architectures Design space: number of processors used Sequential architecture One ACS processor Low design complexity Very low throughput Parallel architecture One ACS processor per state Very complex routing problem High power dissipation due to long interconnects
Viterbi Decoder Architectures Intermediate solutions Design space: number of processors used ACS processor buses ACS processor ACS processor ACS processor
Key Issues How many ACS processors? Which ACS operations are executed in each processor? Which ACS operations can be executed concurrently? In what order are the operations executed? Can processors be pipelined? Operation Partitioning Operation Packing Non-forwarding scheduling
Q: Which operations are executed in each ACS processor? A: Operation partitioning for global data transfer reduction
Operation Partitioning Example transfers transfers ACS1 ACS
Operation Partitioning Results Obtain solution by iterative bi-partitioning (KL). For 64+ partitions, >50% data transfers are global. Largest absolute reduction: 4 to 32 partitions Number of partitions Global data transfers (%) Our partitioning approach C.-M. Wu et al, 2000
Q: Which operations are executed simultaneously? A: Operation packing for global bus minimization
Operation Packing Example 2 # buses required ,1,2,43,5,6,7 Global buses A slice of operations time # buses required time
Operation Packing Packing procedure for global bus minimization One operation from each partition in each slice Global data transfers within a slice done simultaneously Bus cost: the number of ACS units connected Our heuristic Distribute global transfers evenly in all slices
Operation Packing Results Comparison solution: one bus between any two ACS processors Global buses reduction: 31% on the average Most effective range: 8 to 32 partitions Partitions Bus cost (heuristic) Bus cost (comparison) Reduction (%)
Q: In what order should operations be executed? Q: Can ACS units be pipelined? A: Non-forwarding scheduling
Non-forwarding Scheduling layer n+1 layer n layer n+1 layer n time
Non-forwarding Scheduling Results Greedy heuristic: Pick slice with the least dependencies first. Iteratively pick the next slice such that the upper bound of the non-forwarding pipeline depth derived by the chosen slices is maximized. Architectures with 16 or more parallel processors allow very limited non-forwarding pipeline depth. Partitions Max pipeline depth
Q: How many ACS processors should be used? Large global data transfer reduction Large global bus reduction Deep non-forwarding pipeline
Viterbi Decoder Architecture Global Data Control ACS A/D Buses ACS CPU Interface Lock Control Backtrace Control Backtrace RAM Data Buffer code Program counter pc ACS Subcore buses
Processor Internal Architecture 16-bit datapath 8 pipeline stages Instruction ROM Data Fetch Overflow Detect ACS Unit Register File Branch Metric code lock address/data buses Backtrace RAM pc
Processor Level Power Reduction Combine precomputation and saturation arithmetic. If one or two operands overflow, ACS is partially shut off. No significant degradation of the decoding performance.
Chip Implementation Design: RTL Verilog Synthesis: Design Analyzer Placement: manual floorplan Routing: Silicon Ensemble Verification: gate level Verilog Power estimation: Primepower
Chip Summary Technology0.25umMetal layers5 Core size2.4x2.4mm 2 Transistors325K Core frequency640MHzSupply voltage2.5V Throughput20MbpsPower0.45W
Conclusion Design case study of a 256-state IS95 VD Hierarchical optimization methodology Global data transfer minimization Global bus reduction Non-forwarding scheduling Precomputation and saturation arithmetic Viterbi decoder 8 pipelined processors 4 global buses Throughput 20Mbps Power dissipation 450mW