Download presentation
Presentation is loading. Please wait.
1
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian Flautner 2 1 Advanced Computer Architecture Laboratory University of Michigan 2 ARM, Ltd.
2
Advanced Computer Architecture Laboratory University of Michigan 2 SDR Design Challenges: Hardware design challenges High computational throughput (~40 Gops) Low power consumption (~200mW) Meet real-time requirements DSP programming support System-level development Inter-algorithm communication Algorithm-level development Efficient DSP representations
3
SDR Benchmark Design & Analysis
4
Advanced Computer Architecture Laboratory University of Michigan 4 W-CDMA Protocol: 2Mbps
5
Advanced Computer Architecture Laboratory University of Michigan 5 W-CDMA Characteristics Plenty of vector parallelism 8 & 16-bit DSP algorithms Multiplication is not dominant No floating-point operation Small instruction/data memory Has periodic real-time tasks
6
Advanced Computer Architecture Laboratory University of Michigan 6 802.11a Protocol: 24Mbps
7
Advanced Computer Architecture Laboratory University of Michigan 7 802.11a Characteristics Similar to W-CDMA Plenty of vector parallelism No floating-point operation Small instruction/data memory Different from W-CDMA Mostly 16-bit DSP algorithms Multiplication is more dominant No periodic real-time tasks
8
SDR Processor Architecture Design
9
Advanced Computer Architecture Laboratory University of Michigan 9 System Architecture Design Tradeoffs Amortized fetch Intra-processor Communication Inter-processor Communication
10
Advanced Computer Architecture Laboratory University of Michigan 10 System Architecture Design Tradeoffs Number of Processing Elements x SIMD width For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz
11
Advanced Computer Architecture Laboratory University of Michigan 11 Our SDR System Architecture Design Scalable system design Standardized SoC interface System interface supports multiple (potentially) heterogeneous PEs and memories For WCDMA & 802.11a 4 homogeneous processing elements (PEs) Dual pipelines: scalar pipeline & SIMD pipeline Local scratchpad memory (no data cache) Global scratchpad memory (64KB) Controller -- ARM general purpose processor
12
Advanced Computer Architecture Laboratory University of Michigan 12 System Architecture Design
13
Advanced Computer Architecture Laboratory University of Michigan 13 PE Design (Area < 1mm 2 Power<50mW)
14
Advanced Computer Architecture Laboratory University of Michigan 14 Mapping DSP Algorithms: Filters Z -1 b0b1b2b3 In Out In b z -1 spread V in, S in shift z, z, up mac z, V in, S in
15
Advanced Computer Architecture Laboratory University of Michigan 15 Mapping DSP Algorithm: Filter spread V in, S in shift z, z, up mac z, V in, S in
16
Advanced Computer Architecture Laboratory University of Michigan 16 Efficient Design Wide SIMD width Small register file with minimum ports Small memories Narrow system BUS Data-path optimized for 8bits Vector shuffle reduce memory ports
17
Advanced Computer Architecture Laboratory University of Michigan 17 Processing Element (PE) Design Scalar pipelines 16bit data path SIMD pipeline 8 bit data path 32x8 SIMD ALU Software controlled local scratchpad memory 4KB scalar memory 4KB SIMD memory Inter-PE communication through DMA
18
Advanced Computer Architecture Laboratory University of Michigan 18
19
Advanced Computer Architecture Laboratory University of Michigan 19
20
Advanced Computer Architecture Laboratory University of Michigan 20 802.11a PE Mapping
21
Advanced Computer Architecture Laboratory University of Michigan 21 Power Results Configuration 4 PEs, 1 ARM (Cortex M3) controller Global scratchpad memory (64Kb) 90nm (1V @ 400 MHZ), Synthesized conservatively
22
Advanced Computer Architecture Laboratory University of Michigan 22 Area Results
23
SDR Programming Language Support
24
Advanced Computer Architecture Laboratory University of Michigan 24 Software Development Flow
25
Advanced Computer Architecture Laboratory University of Michigan 25 SPEX (Signal Processing EXtension) Implemented as a library extension to C System-level development Support concurrent DSP kernel function definitions Channel variables for inter-kernel communications Algorithm-level development Native vector & matrix variables Explicit DSP variable attribute definition Native vector & matrix operations
26
Advanced Computer Architecture Laboratory University of Michigan 26 SPEX Overview
27
Advanced Computer Architecture Laboratory University of Michigan 27 SPEX Example Code: Viterbi ACS Concurrent DSP kernel definitions void* acs(void*) { /* variable declaration */ saturated char metrics1, metrics2; saturated char states; saturated char t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); }
28
Advanced Computer Architecture Laboratory University of Michigan 28 SPEX Example Code: Viterbi ACS Native SIMD variable definition with explicit attributes SPEX variable supports 1. saturated/overflow 2. various variable bit-width 3. vector & matrices void* acs(void*) { /* variable declaration */ saturated char metrics1, metrics2; saturated char states; saturated char t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); }
29
Advanced Computer Architecture Laboratory University of Michigan 29 SPEX Example Code: Viterbi ACS Inter-kernel communication through channel operations Channel types: 1. FIFO queue 2. Broadcast queue 3. Sync/control channel 4. Random-read FIFO queue void* acs(void*) { /* variable declaration */ saturated char metrics1, metrics2; saturated char states; saturated char t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); }
30
Advanced Computer Architecture Laboratory University of Michigan 30 SPEX Example Code: Viterbi ACS SPEX vector operations Supports (Matlab-like C code) 1.SIMD arithmetic operations 2. SIMD permutation 3. SIMD predication void* acs(void*) { /* variable declaration */ saturated char metrics1, metrics2; saturated char states; saturated char t1, t2; while (!viterbi.stop()) { /* receiving data from BMC */ metrics1 = bmc_to_acs.receive(); metrics2 = bmc_to_acs.receive(); /* add */ metrics1 += states; metrics2 += states; /* compare and select */ t1 = (metrics1(0,2,62),metrics2(0,2,62)); t2 = (metrics1(1,2,63),metrics2(1,2,63)); states(t1<t2) = t1; states(t1>=t2) = t2; /* sending data to TB */ acs_to_tb.send(states); }
31
Advanced Computer Architecture Laboratory University of Michigan 31 Summary - Hardware & software solutions for SDR - Hardware - 4 dual-issue asymmetric SIMD processing elements - Consumes 200~300mW for 90nm - Meets the performance requirements for WCDMA & 802.11a - Software - SPEX provides efficient DSP algorithm and system implementation
32
Advanced Computer Architecture Laboratory University of Michigan 32 Questions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.