ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Slides:



Advertisements
Similar presentations
Address comments to FPGA Area Reduction by Multi-Output Sequential Resynthesis Yu Hu 1, Victor Shih 2, Rupak Majumdar 2 and Lei He 1 1.
Advertisements

DSPs Vs General Purpose Microprocessors
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
Graduate Computer Architecture I Lecture 16: FPGA Design.
High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
The Design Process Outline Goal Reading Design Domain Design Flow
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
More Realistic 16-Tap FIR Presented By Lihua, DONG Deyan, LIU.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
1 3/22/02 Benchmark Update u Carnegie Cell Library: “Free to all who Enter” s Need to build scaling model of standard cell library s Based on our open.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Petros OikonomakosBashir M. Al-Hashimi Mark Zwolinski Versatile High-Level Synthesis of Self-Checking Datapaths Using an On-line Testability Metric Electronics.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
March 8, 2006Spectral RTL ATPG1 High-Level Spectral ATPG for Gate-level Circuits Nitin Yogi and Vishwani D. Agrawal Auburn University Department of ECE.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Improving Timing, Area, and Power Speaker: 黃乃珊 Adviser: Prof.
1 Implementation in Hardware of Video Processing Algorithm Performed by: Yony Dekell & Tsion Bublil Supervisor : Mike Sumszyk SPRING 2008 High Speed Digital.
Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
CDA 4253 FPGA System Design RTL Design Methodology 1 Hao Zheng Comp Sci & Eng USF.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
CAD for VLSI Ramakrishna Lecture#2.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
Hiba Tariq School of Engineering
Pipelining and Retiming 1
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Vector Processing => Multimedia
Anne Pratoomtong ECE734, Spring2002
Pipelining and Vector Processing
Architecture Synthesis
CMSC 611: Advanced Computer Architecture
Implementation of a De-blocking Filter and Optimization in PLX
Presentation transcript:

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular Array Synthesis on FPGAs

ISSS 2001, Montréal2 Outline n Context and motivation n Space time transformations n Transformation flow n Experimental validation n Conclusion

ISSS 2001, Montréal3 High performance IP-Cores n High-level specifications Matlab, C, C++ or specific language (Alpha) Targeting nested loops Core must be formally correct n Hard/Soft co-generation Hardware RTL module (VHDL) Simple driver API (C) n Regular Processor Arrays High data through-put, specialized datapath Well suited for VLSI/FPGA

ISSS 2001, Montréal4 Targeting FPGAs n Poor clock speed Typical clock speed is 1/10 Asic speed Very design dependant Good at low precision arithmetic (8 bits) Really bad for complex operations (floats) n But high performance Optimized designs can compete with Asics Performance gain due to parallelism Pipeline comes for free (lots of DFFs)

ISSS 2001, Montréal5 Processor Array Synthesis For i:=1 to 3 For j:=1 to 3 For k:=1 to 3 C[i,j]:=C[i,j] +A[i,k]*B[k,j]; End for; Iteration domain extracted from loop bounds Data dependence vector between iterations Iteration domain is projected on the processor grid Matrix multiplication example Iteration are scheduled on their associated PE

ISSS 2001, Montréal6 PE Architecture Temporal registers act as local memory n Combinational datapath connected to registers n Unidirectional flow and pipelined connections n N classes of registers (N = loop dimension) n One critical path for each register class n Operating frequency set by worst critical path Spatio-temporal registers must be disambiguated Spatial registers serve as interconnect between PEs

ISSS 2001, Montréal7 Conclusion n Simplistic schedule inside a PE (no ILP) n Complex loop bodies induces poor performance Floating point Matrix mult operating at 12MHz 2D SOR on 16 bits operating at 40MHz n The PE architecture is not suited to FPGAs !! n Proposed solution : allowing pipelined data-paths, by altering the PE architecture through simple space-time transformations.

ISSS 2001, Montréal8 Retiming Tc= 1 logic level Tc= 2 logic level n Move registers to minimize clock period n Handled by most FPGA RTL synthesis tools n Efficient iff sufficient number of registers n We just need to add registers in the PE !!

ISSS 2001, Montréal9 Serialization (1/2) n Regroup PEs into clusters n Iterations in a cluster executed sequentially n Through-put is slowed down by cluster size n Local memory is duplicated Original PE array before clustering Array after clustering

ISSS 2001, Montréal10 Serialization (2/2) n Decomposed along each spatial dimension n Serialization impacts the PE according to simple transformation rules n Loop level Parallelism traded for Instruction Level Parallelism Temporal registers duplicated by serialization factor  i Feed-back loop are created for all spatial paths in the i th axis

ISSS 2001, Montréal11 Skewing Skewing by factor 2 along vertical PE axis n Affects latency, but not through-put. n Adds temporal registers along spatial axis n Skewing can be used before and after serialization n Cannot reduce original temporal critical path

ISSS 2001, Montréal12 Problem formulation n Find the optimal set of transformations parameters. n Minimize number of registers n Preserve loop-level parallelism T c = 86 ns, requires d j = 6 stages to obtain T c = 15ns T c = 70 ns, d i =5 stages to obtain T c = 15ns T c = 60 ns requires d t =4 stages to obtain T c = 15ns

ISSS 2001, Montréal13 1. Assumes  i given (partitioning step) 4. Determine all the skewing parameters 2. Sort PE space axis in ascending order of T c 2. For each PE axis i do i. Pre-serialization skewing i pre ii. Serialization  i 4. For each PE axis i do i. Post-serialization skewing i post Proposed heuristic

ISSS 2001, Montréal14 Transformation example 1. Pre-skew along axis y by factor y pre =1. 2. Serialisation along axis y axis by factor  y =2. 3. Pre-skew along axis x by factor x pre =2. 4. Serialisation along axis x by factor  x =2. 6. Apply retiming 5. Post skew along axis y by factor y post =1.

ISSS 2001, Montréal15 Experimental validation n Chosen benchmark Matrix multiplication (8,16 bits and floats) Adaptive filter (DLMS) (8,16 bits and floats) String matching (DNA, Protein) n Performance metrics A pe : PE area usage f pe : PE operating frequency Raw performance  =N pe.f pe N pe approximated by 1/A pe

ISSS 2001, Montréal16 Area overhead Area overhead decreases as combinational datapath area cost grows

ISSS 2001, Montréal17 Frequency improvement Speed improvment up to one order of magnitude (for floats)

ISSS 2001, Montréal18 Raw performance Speed improvment up to one order of magnitude (for floats)

ISSS 2001, Montréal19 Conclusion n Extract very fine grain ILP from the datapath as a whole n Simple space-time transformations but yield impressive results. n Preserve circuit correctness and control logic regularity and simplicity n Performance benefits are limited by the lack of place & route aware retiming tools.