A Scalable Low-power Architecture For Software Radio

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
Electronics’2004, Sozopol, September 23 Design of Mixed Signal Circuits and Systems for Wireless Applications V. LANTSOV, Vladimir State University
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
A Parameterized Dataflow Language Extension for Embedded Streaming Systems Yuan Lin 1, Yoonseo Choi 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali Chakrabarti.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Software Defined Radio – A High Performance Embedded Challenge Hyunseok Lee, Yuan Lin, Yoav Harel, Mark Woh, Scott Mahlke, Trevor Mudge, and 1 Krisztian.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
University of Michigan Electrical Engineering and Computer Science From SODA to Scotch: The Evolution of a Wireless Baseband Processor Mark Woh (University.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science High Performance.
DSI Division of Integrated Systems Design Proven experience in: Applications: Integrated Systems for Multimedia Processing Goals Our group of engineers.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
#7 1 Victor S. Frost Dan F. Servey Distinguished Professor Electrical Engineering and Computer Science University of Kansas 2335 Irving Hill Dr. Lawrence,
11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
11 1 SPEX: A Programming Language for Software Defined Radio Yuan Lin, Robert Mullenix, Mark Woh, Scott Mahlke, Trevor Mudge, Alastair Reid 1, and Krisztián.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
1 Summary of SDR Analog radio systems are being replaced by digital radio systems for various radio applications. SDR technology aims to take advantage.
INPUT/OUTPUT ARCHITECTURE By Truc Truong. Input Devices Keyboard Keyboard Mouse Mouse Scanner Scanner CD-Rom CD-Rom Game Controller Game Controller.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.
Low-Power Wireless Sensor Networks
Software Defined Radio
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
A Flexible Multi-Core Platform For Multi-Standard Video Applications Soo-Ik Chae Center for SoC Design Technology Seoul National University MPSoC 2009.
University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Introduction of Low Density Parity Check Codes Mong-kai Ku.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Chair MPSoC MPSoC Programming Solution “ CoreManager” hardware unit for:  Dependency checking  Task scheduling  Local memory management of PEs  C programmable.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.
RT-OPEX: Flexible Scheduling for Cloud-RAN Processing
A programmable communications processor for future wireless systems
Map-Scan Node Accelerator for Big-Data
Designing an LTE Baseband MPSoC With a Novel Multi-Core HW/SW Platform Concept Gerhard Fettweis, and Emil Matus, Torsten Limberg, Markus Winter, Reimund.
Introduction to cosynthesis Rabi Mahapatra CSCE617
CS294-1 Reading Aug 28, 2003 Jaein Jeong
A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering
DSPs in emerging wireless systems
DSP Architectures for Future Wireless Base-Stations
Presentation transcript:

A Scalable Low-power Architecture For Software Radio Scott Mahlke Collaboration between: University of Michigan, Arizona State University, and ARM Ltd.

Anatomy of a Cellular Phone Let’s first look at the overall architecture of a typical 3G cellular phone. First we have the user IO interface (such as camera and speakers) connected to the application processor. The general purpose processor and DSPs are used for user applications and multi-media processing, such as mp3 audio and mpeg video. Second, we have the various wireless protocol communication chipsets connected to the application processor. Looking into a wireless protocol, it can be broken up into two components: analog processing and digital processing. The analog component, consists of only ASICs, receives signals through the antenna, and filter out the carrier frequencies. Digital processing is broken up into multiple layers. from the Transport layer down to Physical layer. Currently, all except the physical layer is implemented in software, running on a general purpose processors. The physcal layer, on the other hand, currently is implemented with a combination of DSPs and ASIC solutions. Therefore, Software Defined Radio is defined to be the use of software routines instead of ASICs for physical layer processing. Given SDR, multiple wireless protocols can therefore be combined into one, requiring only separate antennas and analog processers.

Software Defined Radio (SDR) Use software routines instead of ASICs for the physical layer operations of wireless communication system ASICs (PHY) Software Routines Programmable Hardware Rest of the talk Characteristics of SDR algorithms SODA architecture for power-efficient SDR Compilation challenges and approach

Advantages of SDR Lower costs Time to market Adaptability Platform longevity, higher volume SW has lower development costs Time to market Future protocols will have complex implementations Overlap testing/development cycles Adaptability Standards change over time Multi-mode operation Sharing hardware resources Hit the key points What are some of the advantages of SDR Perhaps the most obvious one is that it enables multi-mode operations, requiring less hardwares for a device to process multiple protocols. But the biggest advantage is probably the lower cost. Implementations in software over hardware has many inherent advantages: it is easier to develop and debug, allowing rapid prototyping, lowering the development cost, faster time-to-market, and enables user-end bug fixes. The share of hardware also increases the chip volumes, driving down the cost. And finally, software enables future wireless communication innovations. One of which is cognitive radio, where the radio has the ability to sense its enviornments, and adjust its transmission behaviors accordingly.

Why is SDR Challenging? Why is SDR challenging? In this graph, let’s compare the power efficiency of different types of processors and the requirements of SDR. the x-axis is the power in Watts, and the y-axis is the peak performance in Gops. General purpose processors, consuming between 10 to 60W, can achieve power efficiency around 1Mops/mw, with Pentium shown as an example Domain specific DSPs achieves higher power efficiencies. In here, we shown both the embedded DSPs and high-performance DSPs, both achieving around 10Mops/mw However, these current domain specific DSPs cannot meet the requirements of SDR, as shown here, which generally requires power efficiency around 100Mops/mW. More specifically, for current generation of 3G and wireless protocols, the physical layer processing has a power budget between 100mW to 500mW with computational requirements around 40Gops.

The Anatomy of Wireless Protocols 1. Filtering: suppress signals outside frequency band In order to design a domain specific processor for SDR, we need to first understand the domain itself. Therefore, we have first implemented and analyzed W-CDMA and 802.11a physical layer processing. We found that there are 4 groups of algorithms that are essential for most wireless protocol. In this slide, we show the diagram of W-CDMA as an example case study. Filter, which suppress signals outside of the frequency band Modulation, which transforms source information into signal waveform Channel Estimation, which estimate channel condition and performance synchronization Error Correction, which correct errors induced by noisy channel Different protocols use different algorithms for these functionalities. For example, in 802.11a, FFT is used for modulation. 2. Modulation: map source information onto signal waveforms 3. Channel Estimation: Estimate channel condition for transceivers 4. Error Correction: correct errors induced by noisy channel

W-CDMA Workload Profile Workload profile varies according to operation state One operation is equivalent to one RISC instruction Searcher, Turbo decoder, and LPF are dominant workloads

SDR Kernel Characteristics 8 to 16-bit precision Vector operations long vectors constant vector size Static data movement patterns Scalar operations Kernels Type of Computation Vector Width W-CDMA Filter Vector 64 Modulation 2560 Channel Est. 320 Error Correction Mixed 8 or 256 802.11a 33 Modulation (FFT) 16 We have talked about system-level design considerations, now we are going to talk about algorithm-level design considerations for a SODA PE. We first analyzed the algorithm behaviors of W-CDMA and 802.11a Majority of the algorithms operates on 8 to 16 bit precisions In addition, the majority of the computations are vector operations with long vector width of constant vector size Data movements usually consist of inter- and intra vector shuffling operations with static permutation patterns. An example is the FFT butterfly permutation pattern In addition to vector operations, there are also some scalar operations, as shown in the table for some of the error correction and channel estimation algorithms

SODA System Architecture for 3G 4 PEs static kernel mapping and scheduling SIMD+Scalar units 1 ARM GPP controller scalar algorithms and protocol controls This is overall diagram of the SODA system, targeted toward 3G It includes 4 PEs, each running at 400MHz, and 1 ARM general purpose processor Static kernel mapping and scheduling Each PE consists of a SIMD and Scalar units protocols consists of both parallelizable and scalar operations

SODA System Architecture for 3G 2-Level scratchpad memories 12KB Local scratchpad memory for stream queues 64KB global scratchpad memory for large buffers Low-throughput shared bus 200MHz 32-bit bus inter-PE communication using DMA This is overall diagram of the SODA system, targeted toward 3G It includes 4 PEs, each running at 400MHz, and 1 ARM general purpose processor Static kernel mapping and scheduling Each PE consists of a SIMD and Scalar units protocols consists of both parallelizable and scalar operations

SODA PE Architecture Based on these algorithm characteristics, we design the SODA PE architecture as shown in this diagram. It consists of three pipelines: SIMD pipeline for vector processing, scalar pipeline for scalar computations, and AGU pipeline for memory addressing 2 issue LIW with two local memories, one for scalar, and another for SIMD pipeline

SODA PE SIMD Pipeline In designing the SIMD pipeline, we start from a very simple 16-bit datapath with a 2 read/1 write port register file and a 16-bit ALU with multiplier.

SODA PE SIMD Pipeline The 16-bit datapath is replicated 32-times for a 32-wide SIMD execution. The datapath are connected together with an interconnect network

SODA PE SIMD Shuffle Network The interconnect network, SSN (SIMD shuffle network), consists of 4 standard shuffle patterns, including a shuffle exchange pattern, inverse shuffle exchange pattern, exchange only pattern, and a feedback path for iterative shuffling. Shuffle exchange Inverse shuffle exchange Exchange only Iterative feedback

SODA PE Scalar Pipeline The scalar pipeline is also 16-bit datapath, but without a multiplier. The scalar and SIMD pipeline are tightly coupled to quickly exchange data between the two pipelines. The exchange is done with the Vector-to-scalar and scalar-to-vector units.

Power Consumption at 180nm We have implemented our design in Verilog, and synthesized it in 180nm technology. The power consumption is broken into multiple architectural components, shown on the x-axis. The blue and yellow bars are for W-CDMA and 802.11a. The y-axis is the power consumption in mW. The biggest power consumer is the SIMD pipeline registers and SIMD register file. This is because SODA has a 32-wide SIMD pipeline, which requires larger number of pipeline registers. 802.11a consumes higher power than W-CDMA, because - Higher number of computations - Higher number of multiplications (FFT) Finally, most of W-CDMA operations are 8-bit computations, whereas 802.11a requires mostly 16-bit computations. SODA currently only support 16-bit computation mode, but if we allow 8-bit computations as well, then we can reduce W-CDMA power even further. 180nm  ~ 3 W, 26.6 mm2 90nm (est)  ~ 0.5 W, 6.7 mm2

SDR Compilation Strategy Two level application description System-level: Concurrent tasks extracted from “C + channels + attributes” Kernel-level: Data parallelism extracted from “C + vectors + Matlab operators” System compilation – Task level parallelism Generates tasks, schedules, communication stubs, DMA requests, timing assertions, synchronization, debug support Kernel compilation – Data level parallelism Lower virtual DLP to physical implementation

Stylized Automatic Parallelization void t1() { for(int i=0; i<N; ++i) { int x = spread(w[i])@P0; putchannel(X,x); }} void t2() { int y = scramb(getchan(X))@P1; putchannel(Y,y); }} void t3() { fir(getchannel(Y)) @ P2; }} main() { fork (t1); fork (t2); fork (t3); } 2. Decompose tasks void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]) @P0; int y = scramb(x) @P1; fir(y) @P2; } 1. Task assignment void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]); int y = scramb(x); fir(y); }

Kernel Level Compilation int main() { FIR<int8, 64, BUF_SIZE> fir65r; ... } 1. object declaration template<class T, TAPS, BSIZE> kernel FIR { vector<T, TAPS> z; vector<T, TAPS> coeff; void set_coeff(vector<T, TAPS> c) { coeff = c; } void run(channel<T, BSIZE> inbuf, channel<T, BSIZE> outbuf) { T in, out; for (i = 0; i < BSIZE; i++) { in = inbuf.pop(); z += coeff * in; out = z[0]; outbuf.push(out); z = (z(1:TAPS-1),0); } }; // in = inbuf.pop(); sld(sin, inbuf); agu_incr(inbuf, sizeof(int8)); // z += coeff * in; vdup(vin, sin); vmul(vt1, _fir65r_coeff_v0, vin); vmul(vt2, _fir65r_coeff_v1, vin); vadd(_fir65r_z_v0, _fir65_z_v0, vt1); vadd(_fir65r_z_v1, _fir65_z_v1, vt2); 2. operation translation

Final Thoughts 2G and 3G SDR solutions are achievable in 90nm 3.9G  4-10x more performance with mW power consumption Core technologies for future networks OFDM  64 – 2048 point FFT MIMO – use of multiple antennas for transmission/reception Low density parity check codes Key insight: SDR requires innovation across algorithm, software and hardware SDR platforms offer low-cost, longevity, and adaptability In conclusion, we find that it is possible to support 2G and 3G wireless protocols in software defined radio. During our implementation, we found that there are many optimization opportunities: algorithms can be modified to better take advantage of wide SIMD computations, software compilation techniques can be used to overlap more computations, and architecture and VLSI design optimizations are also possible. And finally, for future work, we are looking at idle mode operations, understanding 4G protocols and their architectural implications, an application-specific language, as well as a compiler for SODA SDR.