RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Real-Time DSP Multiprocessor Implementation for Future Wireless Base-Station Receivers Bryan Jones, Sridhar Rajagopal, and Dr. Joseph Cavallaro.
Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications Multiple programmable VLIW processors arranged in a ring topology –Balances its.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
A bit-streaming, pipelined multiuser detector for wireless communications Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Software Defined Radio 長庚電機通訊組 碩一 張晉銓 指導教授 : 黃文傑博士.
RICE UNIVERSITY High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
TI DSPS FEST 1999 Implementation of Channel Estimation and Multiuser Detection Algorithms for W-CDMA on Digital Signal Processors Sridhar Rajagopal Gang.
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.
Programmable processors for wireless base-stations Sridhar Rajagopal December 9, 2003.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY A real-time baseband communications processor for high data rate wireless systems Sridhar Rajagopal ECE Department Ph.D.
RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.
Programmable processors for wireless base-stations Sridhar Rajagopal December 11, 2003.
RICE UNIVERSITY On the architecture design of a 3G W-CDMA/W-LAN receiver Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia.
Implementing Multiuser Channel Estimation and Detection for W-CDMA Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro and Behnaam Aazhang Rice.
DSP base-station comparisons. Second generation (2G) wireless 2 nd generation: digital: last decade: 1990’s Voice and low bit-rate data –~14.4 – 28.8.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
RICE UNIVERSITY Handset architectures Sridhar Rajagopal ASICsProgrammable The support for this work in.
Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal, Srikrishna Bhashyam, Joseph R. Cavallaro,
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
RICE UNIVERSITY SWAPs: Re-thinking mobile and base-station architectures Sridhar Rajagopal VLSI Signal Processing Group Center for Multimedia Communication.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
Low-power Digital Signal Processing for Mobile Phone chipsets
A programmable communications processor for future wireless systems
Sridhar Rajagopal April 26, 2000
Stream Architecture: Rethinking Media Processor Design
How to ATTACK Problems Facing 3G Wireless Communication Systems
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
Sridhar Rajagopal and Joseph R. Cavallaro Rice University
DSPs for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
Programmable processors for wireless base-stations
Sridhar Rajagopal COMP 625 April 17, 2000
Sridhar Rajagopal, Srikrishna Bhashyam,
DSPs in emerging wireless systems
DSP Architectures for Future Wireless Base-Stations
On-line arithmetic for detection in digital communication receivers
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
DSPs for Future Wireless Base-Stations
Presentation transcript:

RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – University of Rochester March 31, 2003 This work has been supported in part by NSF, Nokia and TI

RICE UNIVERSITY 2 Future wireless devices demand flexibility  High data rate mobile devices with multimedia  Multiple antennas w/ complex signal processing algorithms  High performance and low power needs  Multiple algorithms and environments supported in same device  Fast design time Bluetooth/ Home Networks Wireless Cellular Wireless LAN

RICE UNIVERSITY 3 Flexibility needed in different layers Physical Layer MAC Layer Network Layer Application Layer Support for multiple wireless environments and algorithms at high data rates Puppeteer project at Rice Analog RF

RICE UNIVERSITY 4 Research vision: Attain flexibility  Architectures:  Flexibility : support variety of sophisticated algorithms  High Performance: GOPs of computation (Mbps)  Low Power: < 500 mW  Algorithms:  Need efficient algorithms for mapping to architectures  Fast design exploration for efficient algorithms & architectures Design me

RICE UNIVERSITY 5 My contributions: Algorithms Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]  Matrix-inversions  Numerical techniques  conjugate-gradient descent for complexity reduction Multi-user detection: [ISCAS’01]  Block-based computation to streaming computations  Pipelining, lower memory requirements Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

RICE UNIVERSITY 6 My contributions: Architectures Heterogeneous DSP-FPGA system designs: [ICSPAT’00] Computer arithmetic:[Symp. On Comp. Arith’01] Dynamic truncation in ASICs using on-line arithmetic [Ph.D. Thesis] Scalable Wireless Application-specific Processors (SWAPs) Rapid architecture exploration for flexibility-performance tradeoffs

RICE UNIVERSITY 7 Scalable Wireless Application-specific Processors  Family of flexible programmable processors  Clusters of ALUs  High performance by supporting 100’s of ALUs  Can provide customization for various algorithms  Adapts (“swaps”) architecture dynamically for power + ? * * + * * + * * + * * … ??? Scale Clusters Scale ALUs

RICE UNIVERSITY 8 Rapid design exploration for SWAPs Low “complexity”, parallel, fixed point algorithms Architecture Exploration ASIC design apply DSP design apply SWAPs + ? * * + * * + * * + * * … ???

RICE UNIVERSITY 9 Research vision summary  Provide a framework to rapidly explore:  flexible, high performance, low power architectures (SWAPs)  Efficient algorithm design for mapping to SWAPs  Understanding of algorithms, DSPs and ASICs used  Flexibility-performance trade-off with increasing customization in SWAPs Inter-disciplinary research: Wireless communications, VLSI Signal Processing, Computer architecture, Computer arithmetic, CAD, Compilers

RICE UNIVERSITY 10 Talk Outline  Research vision  SWAPs - Background  Algorithm design for SWAPs  Architecture design for SWAPs  Current and Future Research Goals

RICE UNIVERSITY 11 SWAPs borrow from DSPs  DSPs use  Instruction Level Parallelism (ILP)  Subword Parallelism (MMX)  Current DSPs  Not enough functional units (ALUs) for GOPs of computation cannot extend to more ALUs TI C6x DSP has 8 ALUs -- Need 100’s of ALUs  Cannot support more registers (area,ports)  Difficult to find ILP as ALUs increase

RICE UNIVERSITY 12 SWAPs borrow from ASICs  Exploit data parallelism (DP) also  Available in many wireless algorithms  This is what ASICs do! int i,a[N],b[N],c[N]; // 32 bits short int d[N],e[N],f[N]; // 16 bits packed for (i = 0; i< 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; } ILP DP Subword

RICE UNIVERSITY 13 SWAPs borrow from stream processors Kernel Viterbi decoding Stream Input Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits  Kernels (computation) and streams (communication)  Operations on kernels use local data in clusters providing GOPs support  Streams expose data parallelism  Imagine stream processor at Stanford [Rixner’01] Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

RICE UNIVERSITY 14 SWAPs: multi-cluster DSPs * * * Internal Memory ILP Memory: Stream Register File (SRF) DSP (1 cluster) * * * * * * * * * * * * … ILP DP SWAPs adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters

RICE UNIVERSITY 15 Arithmetic clusters in SWAPs  ALUs (+,*,/)  Scratch-pad (Sp)  Indexed accesses  Comm. unit (CU)  Intercluster comm.  Distributed reg. Files  Support more ALUs Intercluster Network From/To SRF Cross Point Local Register File CU * * / + / * * / + / Sp SRF

RICE UNIVERSITY 16 Talk Outline  Research vision  SWAPs Background  Algorithm design for SWAPs  Architecture design for SWAPs  Current and Future Research Goals

RICE UNIVERSITY 17 SWAPs: Physical layer algorithms Antenna Channel estimation DetectionDecoding Higher (MAC/Network/ OS) Layers RF Front-end Baseband processing

RICE UNIVERSITY 18 SWAP mapping example: Viterbi decoding  Multiple antenna systems (MIMO systems)  Complexity exponential with transmit x receive antennas  Estimation: Linear MMSE, blind, conjugate gradient….  Detection: FFT, (blind) interference cancellation….  Decoding: Viterbi, Turbo, LDPC…. & joint schemes  SWAP flexibility lets you use the best algorithms for the situation Example for concept demonstration: Viterbi decoding

RICE UNIVERSITY 19 Parallel Viterbi Decoding for SWAPs  Add-Compare-Select (ACS) : trellis interconnect : computations  Parallelism depends on constraint length (#states)  Traceback: searching  Conventional Sequential (No DP) with dynamic branching Difficult to implement in parallel architecture  Use Register Exchange (RE) parallel solution ACS Unit Traceback Unit Detected bits Decoded bits

RICE UNIVERSITY 20 Parallel Viterbi needs re-ordering for SWAPs Exploiting Viterbi DP in SWAPs:  Use RE instead of regular traceback  Re-order ACS, RE X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) DP vector Regular ACSACS in SWAPs

RICE UNIVERSITY 21 Talk Outline  Research vision  SWAP Background  Algorithm design for SWAPs  Architecture design for SWAPs  Current and Future Research Goals

RICE UNIVERSITY 22 Designing the SWAP architecture More clusters better than more ALUs/per cluster 1.Decide how many clusters  Exploit DP 2.Decide what to put within each cluster  Maximize ILP with high functional unit efficiency  Search design space with “explore” tool Time-power-area characterization + ? * * + * * + * * + * * … ILP DP ???

RICE UNIVERSITY 23 Design a SWAP cluster: “Explore” Auto-exploration of adders and multipliers for “ACS" (Adder util%, Multiplier util%)

RICE UNIVERSITY 24 “Explore” tool benefits  Instruction count vs. ALU efficiency  What goes inside each cluster  Design customized application-specific units  Better performance with increased ALU utilization Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Chosen Architecture: 4 adders, 3 multipliers, 64 clusters  Explore multiple algorithms  turn off functional units not in use for given kernel

RICE UNIVERSITY 25 SWAP flexibility provides power savings  Multiple algorithms  Different ALU requirements  Different cluster requirements  Turning off ALUs  Use the right #ALUs for kernel from static code schedule  Turning off clusters  Data across SRF of all clusters  Each cluster does not have access to entire SRF  Next kernel may need data from SRF of other clusters  Reconfiguration support needs to be provided

RICE UNIVERSITY 26 SWAPs provide cluster scaling Use mux-demux buffers Latency hidden - Minimal loss in performance Can turn off clusters entirely SRF Clusters Mux-Demux buffers

RICE UNIVERSITY 27 Viterbi reconfiguration using SWAPs Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) DPCan be turned OFF

RICE UNIVERSITY bit Rate ½ Packet 1 K = 7 Packet 2 K = 9 Packet 3 K = 5 Kernels (Computation) No Data Memory accesses Execution Time (cycles) ClustersMemory Run-time SWAP flexibility

RICE UNIVERSITY 29 SWAP exploration for Viterbi decoding Number of clusters Frequency needed to attain real-time (in MHz) K = 9 K = 7 K = 5 Different SWAPs (Without reconfiguration) Same SWAP (With reconfiguration) DSP Ideal C64x (w/o co-proc) needs ~200 MHz for real-time Max DP

RICE UNIVERSITY 30 SWAPs : Salient features  1-2 orders of magnitude better than a DSP  Any constraint length  10 MHz at 128 Kbps  Same code for all constraint lengths  no need to re-compile or load another code  as long as parallelism/cluster ratio is constant  Power savings due to dynamic cluster scaling

RICE UNIVERSITY 31 Expected SWAP power consumption  Power model based on [Khailany’03]  64 clusters and 1 multiplier per cluster:  0.13 micron, 1.2 V  Peak Active Power: ~9 mW at 1 MHz (DSP ~1 mW at 1 MHz)  Area: ~53.7 mm 2  10 MHz, 128 Kbps with reconfiguration ( DSP ~200mW) Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, Active Clusters (max 64) Power (in mW) ViterbiClustersPeak Power K = 964~90 mW K = 716~28.57 mW K = 54~13.8 mW overhead0~8.1 mW

RICE UNIVERSITY 32 Multiuser Estimation-Detection+Decoding Real-time target : 128 Kbps per user Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

RICE UNIVERSITY 33 Expected SWAP power : base-station  32 user base-station with 3 X’s per cluster and 64 clusters:  0.13 micron, 1.2 V  Peak Active Power: ~18.19 mW for 1 MHz (increased X)  Area: ~93.4 mm 2  Total Peak Base-station power consumption:  ~18.19 W at 1 GHz for 32 users at 128 Kbps/user

RICE UNIVERSITY 34 Talk Outline  Research vision  SWAP Background  Algorithm design for SWAPs  Architecture design for SWAPs  Current and Future Research Goals

RICE UNIVERSITY 35 Current research: Flexibility vs. performance SWAPs: 128 Kbps at ~ mW for Viterbi  Borrow DP from ASICs!  suitable for base-stations  Flexibility more important than power  suitable for mobile devices  Power constraints tighter  can be customized for further power savings Handset SWAPs (H-SWAPs)  Borrow Task pipelining from ASICs!  Application-specific units and specialized comm. network

RICE UNIVERSITY 36 Handset SWAPs: H-SWAPs  Trade Data Parallelism for Task Pipelining SWAPs (max. clusters and reconfigure) * * * * Limited DP SWAPlet (limit clusters) * * * * Limited DP + + * + + * + + * + + * Limited DP Limited DP H-SWAPs (collection of customized SWAPlets)

RICE UNIVERSITY 37 Sample points in architecture exploration DSPs (1 cluster) ILP Subword ILP Subword DP SWAPs (multiple) H-SWAPs (optimized for handsets) ILP Subword DP Task Pipelining Custom ALUs Programmable solutions with increased customization Performance, Power benefits

RICE UNIVERSITY 38 Future research: Efficient algorithms Multiple Antenna Systems

RICE UNIVERSITY 39 Future research: Architectures Generalized framework and tools for evaluating algorithm- architecture and area-time-power-flexibility trade-offs Potential applications  Image processing:  Cameras : variety of compression algorithms  Biomedical applications:  Hearing aids: DSP running on body heat *  Sensor networks  Compression of data before transmission *Quote: Gene Frantz, TI Fellow

RICE UNIVERSITY 40 SWAPs: Flexibility, Performance, Power  Need flexible architectures for future wireless devices  Higher data rates, lower power, more complex algorithms  Rapid Exploration for Scalable, Wireless Application-specific Processors  Flexibility vs. performance trade-offs  SWAPs - flexibility, high performance and low power  Exploit data parallelism like ASICs  1-2 orders better performance than DSPs  Turn off unused clusters and unused ALUs for low power