Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Slides:



Advertisements
Similar presentations
ECE Synthesis & Verification - Lecture 2 1 ECE 667 Spring 2011 ECE 667 Spring 2011 Synthesis and Verification of Digital Circuits High-Level (Architectural)
Advertisements

High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
Aug. 24, 2007ELEC 5200/6200 Project1 Computer Design Project ELEC 5200/6200-Computer Architecture and Design Fall 2007 Vishwani D. Agrawal James J.Danaher.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
GanesanP91 Synthesis for Partially Reconfigurable Computing Systems Satish Ganesan, Abhijit Ghosh, Ranga Vemuri Digital Design Environments Laboratory.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
1 Chapter 7 Design Implementation. 2 Overview 3 Main Steps of an FPGA Design ’ s Implementation Design architecture Defining the structure, interface.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,
Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Automated Design of Custom Architecture Tulika Mitra
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
1 H ardware D escription L anguages Modeling Digital Systems.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Lithographic Aerial Image Simulation with FPGA based Hardware Acceleration Jason Cong and Yi Zou UCLA Computer Science Department.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.
® Java Debug Hardware Modules Using JBits by Jonathan Ballagh Eric Keller Peter Athanas Reconfigurable Architectures Workshop 2001.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
Computer Architecture And Organization UNIT-II General System Architecture.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Introduction to VHDL Simulation … Synthesis …. The digital design process… Initial specification Block diagram Final product Circuit equations Logic design.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
RTL Design Methodology Transition from Pseudocode & Interface
Graphical Design Environment for a Reconfigurable Processor IAmE Abstract The Field Programmable Processor Array (FPPA) is a new reconfigurable architecture.
Exploiting Parallelism
04/26/20031 ECE 551: Digital System Design & Synthesis Lecture Set : Introduction to VHDL 12.2: VHDL versus Verilog (Separate File)
VHDL and Hardware Tools CS 184, Spring 4/6/5. Hardware Design for Architecture What goes into the hardware level of architecture design? Evaluate design.
Approximate Computing on FPGA using Neural Acceleration Presented By: Mikkel Nielsen, Nirvedh Meshram, Shashank Gupta, Kenneth Siu.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Backprojection Project Update January 2002
RTL Design Methodology Transition from Pseudocode & Interface
SOFTWARE DESIGN AND ARCHITECTURE
Introduction to Programmable Logic
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Introduction to cosynthesis Rabi Mahapatra CSCE617
High Level Synthesis Overview
RTL Design Methodology
RTL Design Methodology Transition from Pseudocode & Interface
Digital Designs – What does it take
Presentation transcript:

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone Ergonaut Sc2 C-to-FPGA Compiler

Los Alamos National Lab Overview Language  C subset augmented with parallel communicating processes  FIFO-based streams to communicate data between processes  Signals and Parameters for coordination and flow control  Process located on hardware (FPGA) or software (Linux PC) Compiler  Based on Stanford University Intermediate Format (SUIF) library  Targets Linux PC based AMS Firebird board  Easily re-targetable: board architecture described in a file  Generates Register-Transfer-Level VHDL  Source Code available at Applications  Signal and image processing  Fixed point, use external memory and Block RAM

Los Alamos National Lab Sc2 Processes Process body (the code it contains) is described in a process function  ///PROCESS_FUN directive describes a process function  Process function header describes streams, signals, and parameters that the process function uses Each process is an independent unit  ///PROCESS directive describes the process Processes execute concurrently  Sc_initiate intrinsic is used to start a process  Any software process may initiate another software process or hardware process Arrays of processes can be defined

Los Alamos National Lab Example: Process Function directives /// PROCESS_FUN host1_run /// OUT_STREAM sc_uint32 output_stream /// PROCESS_FUN_BODY... /// PROCESS_FUN_END /// PROCESS_FUN controller_run /// IN_STREAM sc_uint32 input_stream /// OUT_STREAM sc_uint32 output_stream /// PROCESS_FUN_BODY... /// PROCESS_FUN_END Two process functions with input and output streams

Los Alamos National Lab Example: Process and Connect Directives /// PROCESS controller PROCESS_FUN controller_run TYPE HP ON PE0 /// PROCESS host1 PROCESS_FUN host1_run /// CONNECT host1.output_stream controller.input_stream host1 controller Connections can also be described graphically, and /// directives are generated

Los Alamos National Lab Streams and Signals Streams transmit data between processes Streams can be defined between software, hardware-software, and hardware processes Stream intrinsic functions are defined to  Read  Check for end of stream  Write Hardware streams are implemented as hardware FIFOs with user-defined FIFO depth in the Streams-C hardware library Software streams are managed by the thread-based Streams-C software runtime library Signals are used to synchronize processes and coordinate phases of processing Signal intrinsic functions are defined to  Post a signal, along with a single word of data  Wait for a signal and receive a single word of data Hardware and software signal implementation is similar to streams Parameters provide a mechanism for giving each newly initiated process a word of unique data.

Los Alamos National Lab Sc2 Code Example: Polyphase Filter while( !sc_stream_eos(input_stream) ) { #pragma SC pipeline for (i=0 ; i < 4; i++) { #pragma SC unroll 4 s[i] = sc_bit_extract(data, 0, 8); in1[i] = C1[i] * s[i]; in2[i] = C2[i] * s[i]; in3[i] = C3[i] * s[i]; in4[i] = C4[i] * s[i]; if (evenp) { y1[i] = e1[i] + (sc_uint8)sc_bit_extract(in1[i],12,8); e1[i] = e2[i] + (sc_uint8)sc_bit_extract(in2[i],12,8); e2[i] = e3[i] + (sc_uint8)sc_bit_extract(in3[i],12,8); e3[i] = sc_bit_extract(in4[i],12,8); sc_bit_insert(data_o, 0, 8, y1[i]); } else { y2[i] = o1[i] + (sc_uint8)sc_bit_extract(in4[i],12,8); o1[i] = o2[i] + (sc_uint8)sc_bit_extract(in3[i],12,8); o2[i] = o3[i] + (sc_uint8)sc_bit_extract(in2[i],12,8); o3[i] = sc_bit_extract(in1[i],12,8); sc_bit_insert(data_o, 0, 8, y2[i]); } /* end for loop */ sc_stream_write(output_stream, data_o); /* filter output */ data = sc_stream_read(input_stream);} evenp = !evenp; } Access to memory arrays automatically scheduled Loop is pipelined Loop is unrolled

Los Alamos National Lab Analysis & Scheduling Sequence info + Datapath Operations Synthesis Hardware Library Place and Route Place and Route VHDL Generator VHDL Generator RTL for Processing Element RTL for Processing Element RTL for Processing Element RTL for Processing Element RTL for Processing Element RTL for Processing Element Configuration Bit Stream Processing Element Configuration Bit Stream Processing Element Configuration Bit Stream Processing Element Configuration Bit Stream Processing Element Configuration Bit Stream Processing Element Configuration Bit Stream Process Run Methods Process Functions Host Processes Software processes Hardware processes Hardware Synthesis Compiler Functional Simulator Compiler Structure Software Library

Los Alamos National Lab Synthesis Compiler Features Uses the SUIF 1.3 library (suif.stanford.edu) Uses Tim Callahan’s inline pass to inline function calls Optimizations include  SUIF optimizations such as constant folding, common sub-expression elimination, dead code elimination  Loop pipelining of innermost loops  Loop unrolling (directive) Compiler schedules sequential code, performs fine-grained parallelization Compiler reads board architecture from a file  Easily retargetable Compiler source is available at rcc.lanl.gov

Los Alamos National Lab Board Definition File Memory Type EXTERNAL64 Data size 64 bits Read/Write Port OUT MAR width 32 bits BUFFER MDR width 64 bits OUT R_EN width 1 bit OUT W_EN width 1 bit Identify MAR_name MAR Identify MDR_name MDR Identify Read_enable_name R_EN Identify Write_enable_name W_EN Load latency 7 MAR, MDR, MDR, MDR, MDR, MDR, MDR, MDR Store latency 1 (MAR, MDR) Memcopy latency 8 MAR, MDR, MDR, MDR, MDR, MDR, MDR,(MAR,MDR) Architecture Firebird Board Virtex2000 Processor PE0 4 EXTERNAL64 memory mem_ size memory-number 0 controller Mem641 generics ( schedule = priority, LADbase=0x1000, LADinc=0x200, mem_component=EXTERNAL)

Los Alamos National Lab Applications Poly phase filter bank of four  Ppf_a: 32-bit stream input data, external memory for coefficients  Ppf_ab: 32-bit stream input data, block ram for coefficients  Ppf1: 64-bit external memory input data, registers for coefficients  Ppf: 32-bit stream input data, registers for coefficients K-Means Clustering  Unsupervised clustering of multi-spectral data  32-bit stream input data, block ram for centers Fast Folding  Modified butterfly FFT Performance evaluation in progress – automatically generated hardware ppf1 faster than GHz Pentium Applications source code available on web site

Los Alamos National Lab Summary Streams-C compiler - synthesizes hardware circuits that run on reconfigurable hardware from parallel C programs C-to-hardware tool with parallel programming model and efficient hardware libraries Functional Simulator runs on host workstation x faster development time over hand-coded Performance comparable to GHz Pentium Open Source – we welcome collaborations