Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner.

Slides:

Advertisements

Similar presentations

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

System Development. Numerical Techniques for Matrix Inversion.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1.

1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

Tejas Bhatt and Dennis McCain Hardware Prototype Group, NRC/Dallas Matlab as a Development Environment for FPGA Design Tejas Bhatt June 16, 2005.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.

© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Efficient FPGA Implementation of QR

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

1 C.H. Ho © Rapid Prototyping of FPGA based Floating Point DSP Systems C.H. Ho Department of Computer Science and Engineering The Chinese University of.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

Paper Review Avelino Zepeda Martinez High Performance Reconfigurable Pipelined Matrix Multiplication Module Designer.

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers Sridhar Rajagopal Srikrishna Bhashyam, Joseph R. Cavallaro,

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser

Electrical and Computer Engineering University of Cyprus LAB 1: VHDL.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Marilyn Wolf1 With contributions from:

Introduction to Computing Systems

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Dynamo: A Runtime Codesign Environment

Application-Specific Customization of Soft Processor Microarchitecture

Parallel Programming in C with MPI and OpenMP

Matlab as a Development Environment for FPGA Design

HIGH LEVEL SYNTHESIS.

Department of Electrical Engineering Joint work with Jiong Luo

Application-Specific Customization of Soft Processor Microarchitecture

DSPs for Future Wireless Base-Stations

Presentation transcript:

Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems Ali Irturk †, Bridget Benson †, Nikolay Laptev ‡, Ryan Kastner † † Department of Computer Science and Engineering University of California, San Diego {airturk, b1benson, 1 ‡ Department of Computer Science University of California, Los Angeles April 2009

Motivation  Matrix Decompositions are essential computations for wireless communications;  Matrix Decompositions are used for simplifying matrix inversion which are used in Equalization algorithms to remove the effect of the channel on the signal, Minimum mean square error algorithms for pre- coding in spatial multiplexing, Detection-estimation algorithms in space-time coding. QR, A -1 2

Motivation 3  There are a number of tools that translate Matlab algorithms to a hardware description language;  However, we believe that the majority of these tools take the wrong approach;  We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.

Computing Platforms 4 ASICsDSPsFPGAsGPUCELL BE  Exceptional Performance  Long Time to Market  Substantial Costs  Ease of Development  Fast Time to Market  Low Performance  Ease of Development  Fast Time to Market  ASIC-like Performance

Major Contributions 5  Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm;  Comparison of different matrix decomposition methods in terms of different matrix dimensions, bit widths and parallelism;  Thorough study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations;  A case study: Implementation of Adaptive Weight Calculation Core using QRD-RLS algorithm.

GUSTO General architecture design Utility and Synthesis Tool for Optimization GUSTO  an easy-to-use tool for more efficient design space exploration and development;  automatically generates and optimizes application specific architectures;  creates a prototype hardware system in just minutes instead of days or weeks. GUSTO Bit width (e.g. 19 bits of precision) Resource Allocation (e.g. 4 multipliers and 3 adders) Modes (e.g. Heterogeneous cores connected using hierarchical datapaths) Algorithm (e.g. QR decomposition) HDL files Error Analysis Number of bits used Average Error 6

Outline  Motivation  GUSTO: Design Tool and Methodology  Decomposition Methods  Results Inflection Point Analysis Architectural Design Alternatives  Conclusions 7

GUSTO Design Flow Algorithm Analysis Algorithm Instruction Generation Resource Allocation Type and # of Arithmetic Resources Design Library Error Analysis Architecture Generation Data Representation Collecting Scheduling Information Resource Trimming for Hardware Optimization Area, Latency and Throughput Results Simulation Results General Purpose Architecture Application Specific Architecture 8

GUSTO Design Flow Algorithm Analysis Algorithm Inst. Cont. A A A A M M M M Mem. Cont. Processing Element PE Software Defined Radio  GUSTO provides options to divide the given algorithm into smaller processing elements which are small in area and highly optimized for throughput. ? 9

GUSTO Design Flow Instruction Generation Resource Allocation Type and # of Arithmetic Resources Design Library + - */  GUSTO uses instruction scheduling for better resource utilization and provides different scheduling methods.  GUSTO generates resource constrained architectures, i.e. the user chooses the number and type of arithmetic units. Inst. Cont. A A A A M M M M Mem. Cont. Processing Element ? 10

GUSTO Design Flow Error Analysis  GUSTO employs fixed point arithmetic in generated architectures;  GUSTO performs error analysis to find an appropriate fixed point representation which provides results with the accuracy similar to that of a floating point implementation. GUSTO MATLAB Error Analysis Metrics: 1)Mean Error 2)Peak Error 3)Standard Deviation of Error 4)Mean Percentage Error User Defined Input Data Fixed Point Arithmetic Results (using variable bit width) Floating Point Arithmetic Results (Single/Double precision) 11

GUSTO Design Flow Architecture Generation  GUSTO generates a CPU like architecture with Dynamic Instruction Scheduling; Dynamic Memory Assignments; Full Connectivity between functional units. Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Full Connectivity Dynamic Instruction Scheduling Dynamic Memory Assignments 12

GUSTO Design Flow Collecting Scheduling Information Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Full Connectivity Static Instruction Scheduling Static Memory Assignments  GUSTO collects scheduling information from instruction and memory controllers.  GUSTO uses this information to eliminate unneeded resources, automatically creating a small, fast statically scheduled architecture. 13

GUSTO Design Flow Resource Trimming for Hardware Optimization  GUSTO simulates the architecture to define the usage of arithmetic units, multiplexers, register entries and input/output ports and trims away the unused components with their interconnects.  GUSTOs’ optimization provides tremendous silicon savings while ensuring the correctness of solution. Multiplier Adder Memory Full Connectivity Multiplier Adder Memory Required Connectivity 14

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 15

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 16

Outline  Motivation  GUSTO: Design Tool and Methodology  Decomposition Methods  Results Inflection Point Analysis Architectural Design Alternatives  Conclusions 17

M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix Orthogonal Matrix Upper Triangular Matrix 18 Lower Triangular Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix Given Matrix

M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix Full Matrix Inversion is costly! 19

Outline  Motivation  GUSTO: Design Tool and Methodology  Decomposition Methods  Results Inflection Point Analysis Architectural Design Alternatives  Conclusions 20

Results Inflection Point Analysis: Sequential 21

Results Inflection Point Analysis: Parallel 22

Results Finding the Optimal Hardware : Decomposition Methods General Purpose Architecture Application Specific Architecture QRLUCholesky Decrease in Area (Percentage) 94%83%86% 23

Results Finding the Optimal Hardware: Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLUCholesky Increase in Throughput (Percentage) 68% 16% 14% 24

Results Finding the Optimal Hardware: Matrix Inversion (using QR) average of 59% decrease in area 3X increase in throughput 25

Results Architectural Design Alternatives 26

Results Comparison with Previously Published Work: AWC Edman et al. Karkooti et al. Dick et al.GUSTO Application Matrix Inversion BeamformerAWC Method QR Matrix Size4 × 4 3 × 35 × 54 × 4 Bit width Data typefixedfloatingNRfixed Device type Virtex 2Virtex 4 Slices DSP48sNR BRAMsNR961 Throughput (10 6 ×s -1 ) F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005). C. Dick, F. Harris, M. Pajic, D. Vuletic, “Real-Time QRD-Based Beamforming on an FPGA Platform,” Asilomar Conference on Signals, Systems and Computers (2006). 27 Adaptive Weight Calculation (AWC) Core

Outline  Motivation  GUSTO: Design Tool and Methodology  Decomposition Methods  Results Inflection Point Analysis Architectural Design Alternatives  Conclusions 28

GUSTO General architecture design Utility and Synthesis Tool for Optimization  GUSTO is a tool to provide automatic generation and optimization of a variety of application specific processing elements (PEs) with different parameterization options;  Current Projects includes implementation of Short Preamble Processing unit for OFDM Receiver design. GUSTO Bit width (e.g. 19 bits of precision) Resource Allocation (e.g. 4 multipliers and 3 adders) Modes (e.g. Heterogeneous cores connected using hierarchical datapaths) Algorithm (e.g. QR decomposition) HDL files Error Analysis 29

Thank You 30