Download presentation
Presentation is loading. Please wait.
Published byReynold Mathews Modified over 8 years ago
1
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Department University of Michigan
2
Digital ChannelEncoder Symbols Codes Decoder Codes Symbols TransmitterReceiver Noise Analog TransmitterAnalog Receiver Analog Channel Digital Communication System
3
010110110 … Convolutional Coding Input Symbols Digital Channel Encoder Symbols Codes Decoder Codes Symbols TransmitterReceiver O0O0 … … Rate 1/n Convolutional Encoder O n-1 … n-bit Codes
4
IS95 Convolutional Encoding Used in the reverse link of IS95 CDMA system 256 states (8 state registers) Rate 1/3 Maximum Free Distance coding S7S6 S5 S4 S3S2S1S0 I O0O0 O1O1 O2O2
5
Viterbi Decoding (VD) VD is optimal for convolutional codes. Maximum likelihood decoding scheme. Minimum error for additive white Gaussian noise channel. VD procedure. Construction of a complex graph called trellis. Computation of the shortest path.
6
… IS95 VD Trellis 256 nodes # of symbols 12 34
7
Challenge of Large-State VD Designs High computational complexity. VDs with hundreds of states require multiple Gops throughput, when symbol transfer rates reach Mbps. Parallel processing. High interconnect power dissipation. Complex routing among the processors. For large-state VDs, global data transfer and interconnect issues must be considered carefully
8
Viterbi Decoder Designs 256 0.24mW 10mW 2 4 8 16 32 64 128 512 Number of states Throughput (Mbps) 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 3W 6mW 0.57W 3W 1.8W 0.35W 2W 0.66W 0.75W 0.66W 7.65mW Our design 20Mbps, 0.45W
9
Presentation Outline Viterbi decoding overview Our contributions Data transfer oriented hierarchical inter- processor optimization Intra-processor power optimization Chip data
10
0 0 0 O0O0 O1O1 I Encoding Example Inputs …, 0, 0, 1, 0 Outputs …, 01,11,11, 00 Rate ½ 3 bit 8 states
11
Viterbi Decoding … 00 11 Transition Output 001 011 101 111 000 010 100 110 Code received 00 Edge Weight 0 2 11 01
12
0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 0 0 0 0 0 0 0 0 Vertex Weight 001 011 101 111 000 010 100 110 Viterbi Decoding … Code received 0011 01 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 0 = Min{0+0,0+2} 0 Add-Compare-Select (ACS)
13
0 1 0 0 1 1 2 1 0 2 2 1 1 1 1 2 001 011 101 111 000 010 100 110 Viterbi Decoding … 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 1 1 0 2 1 1 1 2 2 1 2 2 1 1 0 1 2 0 0 1 1 1 1 0 2 0 2 2 2 2 2 1 1 0 1 1 0 2 1 2 1 1 1 0 2 0 2 1 Code received 0011 01
14
VD Summary Each decoded symbol requires a layer of similar computations: 2N edge weight computations (N = # of states). N add-compare-select (ACS) operations. Operations within each layer are independent.
15
Viterbi Decoder Architectures Design space: number of processors used Sequential architecture One ACS processor Low design complexity Very low throughput Parallel architecture One ACS processor per state Very complex routing problem High power dissipation due to long interconnects
16
Viterbi Decoder Architectures Intermediate solutions Design space: number of processors used ACS processor buses ACS processor ACS processor ACS processor
17
Key Issues How many ACS processors? Which ACS operations are executed in each processor? Which ACS operations can be executed concurrently? In what order are the operations executed? Can processors be pipelined? Operation Partitioning Operation Packing Non-forwarding scheduling
18
Q: Which operations are executed in each ACS processor? A: Operation partitioning for global data transfer reduction
19
Operation Partitioning Example 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 4 transfers 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 transfers ACS1 ACS2 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
20
Operation Partitioning Results Obtain solution by iterative bi-partitioning (KL). For 64+ partitions, >50% data transfers are global. Largest absolute reduction: 4 to 32 partitions. 1 2 4 8 16 32 64 128 256 0 20 40 60 80 100 Number of partitions Global data transfers (%) Our partitioning approach C.-M. Wu et al, 2000
21
Q: Which operations are executed simultaneously? A: Operation packing for global bus minimization
22
Operation Packing Example 2 # buses required 0 1 2 4 0 1 2 4 3 5 6 7 3 5 6 7 0,1,2,43,5,6,7 Global buses A slice of operations 0 3 1 5 2 6 4 7 time 0 5 1 3 4 6 2 7 1 1 1 1 1 # buses required time 0 3 0 1 5 2 2 6 2 4 7 0 0 0 2 2
23
Operation Packing Packing procedure for global bus minimization One operation from each partition in each slice Global data transfers within a slice done simultaneously Bus cost: the number of ACS units connected Our heuristic Distribute global transfers evenly in all slices
24
Operation Packing Results Comparison solution: one bus between any two ACS processors Global buses reduction: 31% on the average Most effective range: 8 to 32 partitions Partitions248163264128 Bus cost (heuristic)22481642111 Bus cost (comparison)237153163127 Reduction (%)0334347483313
25
Q: In what order should operations be executed? Q: Can ACS units be pipelined? A: Non-forwarding scheduling
26
Non-forwarding Scheduling 77 0 1 2 4 0 1 2 4 3 5 6 3 5 6 0 5 1 3 4 6 2 7 layer n+1 layer n 1 3 4 6 2 7 0 5 1 3 4 6 2 7 0 5 layer n+1 layer n 0 5 1 3 4 6 2 7 time
27
Non-forwarding Scheduling Results Greedy heuristic: Pick slice with the least dependencies first. Iteratively pick the next slice such that the upper bound of the non-forwarding pipeline depth derived by the chosen slices is maximized. Architectures with 16 or more parallel processors allow very limited non-forwarding pipeline depth. Partitions248163264128 Max pipeline depth632892000
28
Q: How many ACS processors should be used? 428163264128 Large global data transfer reduction 428163264128 Large global bus reduction 428163264128 Deep non-forwarding pipeline
29
Viterbi Decoder Architecture Global Data Control ACS A/D Buses ACS CPU Interface Lock Control Backtrace Control Backtrace RAM Data Buffer code Program counter pc ACS Subcore 1 2 3 4 5 6 7 8 1234 5678 4 buses
30
Processor Internal Architecture 16-bit datapath 8 pipeline stages Instruction ROM Data Fetch Overflow Detect ACS Unit Register File Branch Metric code lock address/data buses Backtrace RAM pc
31
Processor Level Power Reduction Combine precomputation and saturation arithmetic. If one or two operands overflow, ACS is partially shut off. No significant degradation of the decoding performance.
32
Chip Implementation Design: RTL Verilog Synthesis: Design Analyzer Placement: manual floorplan Routing: Silicon Ensemble Verification: gate level Verilog Power estimation: Primepower
33
Chip Summary Technology0.25umMetal layers5 Core size2.4x2.4mm 2 Transistors325K Core frequency640MHzSupply voltage2.5V Throughput20MbpsPower0.45W
34
Conclusion Design case study of a 256-state IS95 VD Hierarchical optimization methodology Global data transfer minimization Global bus reduction Non-forwarding scheduling Precomputation and saturation arithmetic Viterbi decoder 8 pipelined processors 4 global buses Throughput 20Mbps Power dissipation 450mW
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.