Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable Low-power Architecture For Software Radio

Similar presentations


Presentation on theme: "A Scalable Low-power Architecture For Software Radio"— Presentation transcript:

1 A Scalable Low-power Architecture For Software Radio
Scott Mahlke Collaboration between: University of Michigan, Arizona State University, and ARM Ltd.

2 Anatomy of a Cellular Phone
Let’s first look at the overall architecture of a typical 3G cellular phone. First we have the user IO interface (such as camera and speakers) connected to the application processor. The general purpose processor and DSPs are used for user applications and multi-media processing, such as mp3 audio and mpeg video. Second, we have the various wireless protocol communication chipsets connected to the application processor. Looking into a wireless protocol, it can be broken up into two components: analog processing and digital processing. The analog component, consists of only ASICs, receives signals through the antenna, and filter out the carrier frequencies. Digital processing is broken up into multiple layers. from the Transport layer down to Physical layer. Currently, all except the physical layer is implemented in software, running on a general purpose processors. The physcal layer, on the other hand, currently is implemented with a combination of DSPs and ASIC solutions. Therefore, Software Defined Radio is defined to be the use of software routines instead of ASICs for physical layer processing. Given SDR, multiple wireless protocols can therefore be combined into one, requiring only separate antennas and analog processers.

3 Software Defined Radio (SDR)
Use software routines instead of ASICs for the physical layer operations of wireless communication system ASICs (PHY) Software Routines Programmable Hardware Rest of the talk Characteristics of SDR algorithms SODA architecture for power-efficient SDR Compilation challenges and approach

4 Advantages of SDR Lower costs Time to market Adaptability
Platform longevity, higher volume SW has lower development costs Time to market Future protocols will have complex implementations Overlap testing/development cycles Adaptability Standards change over time Multi-mode operation Sharing hardware resources Hit the key points What are some of the advantages of SDR Perhaps the most obvious one is that it enables multi-mode operations, requiring less hardwares for a device to process multiple protocols. But the biggest advantage is probably the lower cost. Implementations in software over hardware has many inherent advantages: it is easier to develop and debug, allowing rapid prototyping, lowering the development cost, faster time-to-market, and enables user-end bug fixes. The share of hardware also increases the chip volumes, driving down the cost. And finally, software enables future wireless communication innovations. One of which is cognitive radio, where the radio has the ability to sense its enviornments, and adjust its transmission behaviors accordingly.

5 Why is SDR Challenging? Why is SDR challenging?
In this graph, let’s compare the power efficiency of different types of processors and the requirements of SDR. the x-axis is the power in Watts, and the y-axis is the peak performance in Gops. General purpose processors, consuming between 10 to 60W, can achieve power efficiency around 1Mops/mw, with Pentium shown as an example Domain specific DSPs achieves higher power efficiencies. In here, we shown both the embedded DSPs and high-performance DSPs, both achieving around 10Mops/mw However, these current domain specific DSPs cannot meet the requirements of SDR, as shown here, which generally requires power efficiency around 100Mops/mW. More specifically, for current generation of 3G and wireless protocols, the physical layer processing has a power budget between 100mW to 500mW with computational requirements around 40Gops.

6 The Anatomy of Wireless Protocols
1. Filtering: suppress signals outside frequency band In order to design a domain specific processor for SDR, we need to first understand the domain itself. Therefore, we have first implemented and analyzed W-CDMA and a physical layer processing. We found that there are 4 groups of algorithms that are essential for most wireless protocol. In this slide, we show the diagram of W-CDMA as an example case study. Filter, which suppress signals outside of the frequency band Modulation, which transforms source information into signal waveform Channel Estimation, which estimate channel condition and performance synchronization Error Correction, which correct errors induced by noisy channel Different protocols use different algorithms for these functionalities. For example, in a, FFT is used for modulation. 2. Modulation: map source information onto signal waveforms 3. Channel Estimation: Estimate channel condition for transceivers 4. Error Correction: correct errors induced by noisy channel

7 W-CDMA Workload Profile
Workload profile varies according to operation state One operation is equivalent to one RISC instruction Searcher, Turbo decoder, and LPF are dominant workloads

8 SDR Kernel Characteristics
8 to 16-bit precision Vector operations long vectors constant vector size Static data movement patterns Scalar operations Kernels Type of Computation Vector Width W-CDMA Filter Vector 64 Modulation 2560 Channel Est. 320 Error Correction Mixed 8 or 256 802.11a 33 Modulation (FFT) 16 We have talked about system-level design considerations, now we are going to talk about algorithm-level design considerations for a SODA PE. We first analyzed the algorithm behaviors of W-CDMA and a Majority of the algorithms operates on 8 to 16 bit precisions In addition, the majority of the computations are vector operations with long vector width of constant vector size Data movements usually consist of inter- and intra vector shuffling operations with static permutation patterns. An example is the FFT butterfly permutation pattern In addition to vector operations, there are also some scalar operations, as shown in the table for some of the error correction and channel estimation algorithms

9 SODA System Architecture for 3G
4 PEs static kernel mapping and scheduling SIMD+Scalar units 1 ARM GPP controller scalar algorithms and protocol controls This is overall diagram of the SODA system, targeted toward 3G It includes 4 PEs, each running at 400MHz, and 1 ARM general purpose processor Static kernel mapping and scheduling Each PE consists of a SIMD and Scalar units protocols consists of both parallelizable and scalar operations

10 SODA System Architecture for 3G
2-Level scratchpad memories 12KB Local scratchpad memory for stream queues 64KB global scratchpad memory for large buffers Low-throughput shared bus 200MHz 32-bit bus inter-PE communication using DMA This is overall diagram of the SODA system, targeted toward 3G It includes 4 PEs, each running at 400MHz, and 1 ARM general purpose processor Static kernel mapping and scheduling Each PE consists of a SIMD and Scalar units protocols consists of both parallelizable and scalar operations

11 SODA PE Architecture Based on these algorithm characteristics, we design the SODA PE architecture as shown in this diagram. It consists of three pipelines: SIMD pipeline for vector processing, scalar pipeline for scalar computations, and AGU pipeline for memory addressing 2 issue LIW with two local memories, one for scalar, and another for SIMD pipeline

12 SODA PE SIMD Pipeline In designing the SIMD pipeline, we start from a very simple 16-bit datapath with a 2 read/1 write port register file and a 16-bit ALU with multiplier.

13 SODA PE SIMD Pipeline The 16-bit datapath is replicated 32-times for a 32-wide SIMD execution. The datapath are connected together with an interconnect network

14 SODA PE SIMD Shuffle Network
The interconnect network, SSN (SIMD shuffle network), consists of 4 standard shuffle patterns, including a shuffle exchange pattern, inverse shuffle exchange pattern, exchange only pattern, and a feedback path for iterative shuffling. Shuffle exchange Inverse shuffle exchange Exchange only Iterative feedback

15 SODA PE Scalar Pipeline
The scalar pipeline is also 16-bit datapath, but without a multiplier. The scalar and SIMD pipeline are tightly coupled to quickly exchange data between the two pipelines. The exchange is done with the Vector-to-scalar and scalar-to-vector units.

16 Power Consumption at 180nm
We have implemented our design in Verilog, and synthesized it in 180nm technology. The power consumption is broken into multiple architectural components, shown on the x-axis. The blue and yellow bars are for W-CDMA and a. The y-axis is the power consumption in mW. The biggest power consumer is the SIMD pipeline registers and SIMD register file. This is because SODA has a 32-wide SIMD pipeline, which requires larger number of pipeline registers. 802.11a consumes higher power than W-CDMA, because - Higher number of computations - Higher number of multiplications (FFT) Finally, most of W-CDMA operations are 8-bit computations, whereas a requires mostly 16-bit computations. SODA currently only support 16-bit computation mode, but if we allow 8-bit computations as well, then we can reduce W-CDMA power even further. 180nm  ~ 3 W, 26.6 mm2 90nm (est)  ~ 0.5 W, 6.7 mm2

17 SDR Compilation Strategy
Two level application description System-level: Concurrent tasks extracted from “C + channels + attributes” Kernel-level: Data parallelism extracted from “C + vectors + Matlab operators” System compilation – Task level parallelism Generates tasks, schedules, communication stubs, DMA requests, timing assertions, synchronization, debug support Kernel compilation – Data level parallelism Lower virtual DLP to physical implementation

18 Stylized Automatic Parallelization
void t1() { for(int i=0; i<N; ++i) { int x = putchannel(X,x); }} void t2() { int y = putchannel(Y,y); }} void t3() { P2; }} main() { fork (t1); fork (t2); fork (t3); } 2. Decompose tasks void main() { for(int i=0; i<N; ++i) { int x = int y = } 1. Task assignment void main() { for(int i=0; i<N; ++i) { int x = spread(w[i]); int y = scramb(x); fir(y); }

19 Kernel Level Compilation
int main() { FIR<int8, 64, BUF_SIZE> fir65r; ... } 1. object declaration template<class T, TAPS, BSIZE> kernel FIR { vector<T, TAPS> z; vector<T, TAPS> coeff; void set_coeff(vector<T, TAPS> c) { coeff = c; } void run(channel<T, BSIZE> inbuf, channel<T, BSIZE> outbuf) { T in, out; for (i = 0; i < BSIZE; i++) { in = inbuf.pop(); z += coeff * in; out = z[0]; outbuf.push(out); z = (z(1:TAPS-1),0); } }; // in = inbuf.pop(); sld(sin, inbuf); agu_incr(inbuf, sizeof(int8)); // z += coeff * in; vdup(vin, sin); vmul(vt1, _fir65r_coeff_v0, vin); vmul(vt2, _fir65r_coeff_v1, vin); vadd(_fir65r_z_v0, _fir65_z_v0, vt1); vadd(_fir65r_z_v1, _fir65_z_v1, vt2); 2. operation translation

20 Final Thoughts 2G and 3G SDR solutions are achievable in 90nm
3.9G  4-10x more performance with mW power consumption Core technologies for future networks OFDM  64 – 2048 point FFT MIMO – use of multiple antennas for transmission/reception Low density parity check codes Key insight: SDR requires innovation across algorithm, software and hardware SDR platforms offer low-cost, longevity, and adaptability In conclusion, we find that it is possible to support 2G and 3G wireless protocols in software defined radio. During our implementation, we found that there are many optimization opportunities: algorithms can be modified to better take advantage of wide SIMD computations, software compilation techniques can be used to overlap more computations, and architecture and VLSI design optimizations are also possible. And finally, for future work, we are looking at idle mode operations, understanding 4G protocols and their architectural implications, an application-specific language, as well as a compiler for SODA SDR.


Download ppt "A Scalable Low-power Architecture For Software Radio"

Similar presentations


Ads by Google