Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Slides:



Advertisements
Similar presentations
Noise-Predictive Turbo Equalization for Partial Response Channels Sharon Aviran, Paul H. Siegel and Jack K. Wolf Department of Electrical and Computer.
Advertisements

Computer Organization, Bus Structure
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
EXPLORING HIGH THROUGHPUT COMPUTING PARADIGM FOR GLOBAL ROUTING Yiding Han, Dean Michael Ancajas, Koushik Chakraborty, and Sanghamitra Roy Electrical and.
Maximum Likelihood Sequence Detection (MLSD) and the Viterbi Algorithm
Submission May, 2000 Doc: IEEE / 086 Steven Gray, Nokia Slide Brief Overview of Information Theory and Channel Coding Steven D. Gray 1.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Logic and Computer Design Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS ( ) School of Computing, UNF.
EE 3220: Digital Communication Dr Hassan Yousif 1 Dr. Hassan Yousif Ahmed Department of Electrical Engineering College of Engineering at Wadi Aldwasser.
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
Introduction to Parallel Processing Ch. 12, Pg
1 Combining the strengths of UMIST and The Victoria University of Manchester Asynchronous Signal Processing Systems Linda Brackenbury APT GROUP, Computer.
M. S. Engineering College, Bangalore 1 Final Project Presentation Design and ASIC Implementation of Low-Power Viterbi Decoder for WLAN Applications Academic.
CS-334: Computer Architecture
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Digital Communications I: Modulation and Coding Course Term Catharina Logothetis Lecture 12.
EEE440 Computer Architecture
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
L38: Viterbi Decoder저전력 설계
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
Timo O. Korhonen, HUT Communication Laboratory 1 Convolutional encoding u Convolutional codes are applied in applications that require good performance.
Programmable Logic Controllers LO1: Understand the design and operational characteristics of a PLC system.
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
Muhammad Shoaib Bin Altaf. Outline Motivation Actual Flow Optimizations Approach Results Conclusion.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
SNS COLLEGE OF ENGINEERING Department of Electronics and Communication Engineering Subject: Digital communication Sem: V Convolutional Codes.
Convolutional Coding In telecommunication, a convolutional code is a type of error- correcting code in which m-bit information symbol to be encoded is.
University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.
Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Register Transfer Specification And Design
Parallel Programming By J. H. Wang May 2, 2017.
What is this “Viterbi Decoding”
William Stallings Computer Organization and Architecture
Pipelined Architectures for High-Speed and Area-Efficient Viterbi Decoders Chen, Chao-Nan Chu, Hsi-Cheng.
Trellis Codes With Low Ones Density For The OR Multiple Access Channel
Su Yi Babak Azimi-Sadjad Shivkumar Kalyanaraman
Architecture Synthesis
Registers.
ICS 252 Introduction to Computer Design
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Department University of Michigan

Digital ChannelEncoder Symbols Codes Decoder Codes Symbols TransmitterReceiver Noise Analog TransmitterAnalog Receiver Analog Channel Digital Communication System

… Convolutional Coding Input Symbols Digital Channel Encoder Symbols Codes Decoder Codes Symbols TransmitterReceiver O0O0 … … Rate 1/n Convolutional Encoder O n-1 … n-bit Codes

IS95 Convolutional Encoding Used in the reverse link of IS95 CDMA system 256 states (8 state registers) Rate 1/3 Maximum Free Distance coding S7S6 S5 S4 S3S2S1S0 I O0O0 O1O1 O2O2

Viterbi Decoding (VD) VD is optimal for convolutional codes.  Maximum likelihood decoding scheme.  Minimum error for additive white Gaussian noise channel. VD procedure.  Construction of a complex graph called trellis.  Computation of the shortest path.

… IS95 VD Trellis 256 nodes # of symbols 12 34

Challenge of Large-State VD Designs High computational complexity. VDs with hundreds of states require multiple Gops throughput, when symbol transfer rates reach Mbps. Parallel processing. High interconnect power dissipation. Complex routing among the processors. For large-state VDs, global data transfer and interconnect issues must be considered carefully

Viterbi Decoder Designs mW 10mW Number of states Throughput (Mbps) W 6mW 0.57W 3W 1.8W 0.35W 2W 0.66W 0.75W 0.66W 7.65mW Our design 20Mbps, 0.45W

Presentation Outline Viterbi decoding overview Our contributions Data transfer oriented hierarchical inter- processor optimization Intra-processor power optimization Chip data

0 0 0 O0O0 O1O1 I Encoding Example Inputs …, 0, 0, 1, 0 Outputs …, 01,11,11, 00 Rate ½ 3 bit 8 states

Viterbi Decoding … Transition Output Code received 00 Edge Weight

Vertex Weight Viterbi Decoding … Code received = Min{0+0,0+2} 0 Add-Compare-Select (ACS)

Viterbi Decoding … Code received

VD Summary Each decoded symbol requires a layer of similar computations: 2N edge weight computations (N = # of states). N add-compare-select (ACS) operations. Operations within each layer are independent.

Viterbi Decoder Architectures Design space: number of processors used Sequential architecture One ACS processor Low design complexity Very low throughput Parallel architecture One ACS processor per state Very complex routing problem High power dissipation due to long interconnects

Viterbi Decoder Architectures Intermediate solutions Design space: number of processors used ACS processor buses ACS processor ACS processor ACS processor

Key Issues  How many ACS processors?  Which ACS operations are executed in each processor?  Which ACS operations can be executed concurrently?  In what order are the operations executed?  Can processors be pipelined? Operation Partitioning Operation Packing Non-forwarding scheduling

Q: Which operations are executed in each ACS processor? A: Operation partitioning for global data transfer reduction

Operation Partitioning Example transfers transfers ACS1 ACS

Operation Partitioning Results  Obtain solution by iterative bi-partitioning (KL).  For 64+ partitions, >50% data transfers are global.  Largest absolute reduction: 4 to 32 partitions Number of partitions Global data transfers (%) Our partitioning approach C.-M. Wu et al, 2000

Q: Which operations are executed simultaneously? A: Operation packing for global bus minimization

Operation Packing Example 2 # buses required ,1,2,43,5,6,7 Global buses A slice of operations time # buses required time

Operation Packing Packing procedure for global bus minimization One operation from each partition in each slice Global data transfers within a slice done simultaneously Bus cost: the number of ACS units connected Our heuristic Distribute global transfers evenly in all slices

Operation Packing Results Comparison solution: one bus between any two ACS processors Global buses reduction: 31% on the average Most effective range: 8 to 32 partitions Partitions Bus cost (heuristic) Bus cost (comparison) Reduction (%)

Q: In what order should operations be executed? Q: Can ACS units be pipelined? A: Non-forwarding scheduling

Non-forwarding Scheduling layer n+1 layer n layer n+1 layer n time

Non-forwarding Scheduling Results Greedy heuristic: Pick slice with the least dependencies first. Iteratively pick the next slice such that the upper bound of the non-forwarding pipeline depth derived by the chosen slices is maximized. Architectures with 16 or more parallel processors allow very limited non-forwarding pipeline depth. Partitions Max pipeline depth

Q: How many ACS processors should be used? Large global data transfer reduction Large global bus reduction Deep non-forwarding pipeline

Viterbi Decoder Architecture Global Data Control ACS A/D Buses ACS CPU Interface Lock Control Backtrace Control Backtrace RAM Data Buffer code Program counter pc ACS Subcore buses

Processor Internal Architecture 16-bit datapath 8 pipeline stages Instruction ROM Data Fetch Overflow Detect ACS Unit Register File Branch Metric code lock address/data buses Backtrace RAM pc

Processor Level Power Reduction Combine precomputation and saturation arithmetic. If one or two operands overflow, ACS is partially shut off. No significant degradation of the decoding performance.

Chip Implementation Design: RTL Verilog Synthesis: Design Analyzer Placement: manual floorplan Routing: Silicon Ensemble Verification: gate level Verilog Power estimation: Primepower

Chip Summary Technology0.25umMetal layers5 Core size2.4x2.4mm 2 Transistors325K Core frequency640MHzSupply voltage2.5V Throughput20MbpsPower0.45W

Conclusion Design case study of a 256-state IS95 VD Hierarchical optimization methodology Global data transfer minimization Global bus reduction Non-forwarding scheduling Precomputation and saturation arithmetic Viterbi decoder 8 pipelined processors 4 global buses Throughput 20Mbps Power dissipation 450mW