SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)

Models of Computation for Embedded System Design Alvise Bonivento.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Computer Organization and Architecture

Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 11, 2009 Dataflow.

CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.

Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

What is Concurrent Programming? Maram Bani Younes.

Operating System A program that controls the execution of application programs An interface between applications and hardware 1.

CSET 4650 Field Programmable Logic Devices

An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Voicu Groza, 2008 SITE, HARDWARE/SOFTWARE CODESIGN OF EMBEDDED SYSTEMS Hardware/Software Codesign of Embedded Systems Voicu Groza SITE Hall, Room.

Gedae, Inc. Implementing Modal Software in Data Flow for Heterogeneous Architectures James Steed, Kerry Barnes, William Lundgren Gedae, Inc.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

SOFTWARE DESIGN AND ARCHITECTURE LECTURE 21. Review ANALYSIS PHASE (OBJECT ORIENTED DESIGN) Functional Modeling – Use case Diagram Description.

Automated Design of Custom Architecture Tulika Mitra

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 7: Wed 10/28/2009 (Compute.

Spring 2006ICOM 4036 Programming Laguages Lecture 2 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4036 Lecture 2.

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.

Chapter 7 Low-Level Programming Languages. 2 Chapter Goals List the operations that a computer can perform Discuss the relationship between levels of.

The Nature of Computing INEL 4206 – Microprocessors Lecture 2 Bienvenido Vélez Ph. D. School of Engineering University of Puerto Rico - Mayagüez.

Processor Architecture

Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

1 Technical & Business Writing (ENG-715) Muhammad Bilal Bashir UIIT, Rawalpindi.

Week 04 Object Oriented Analysis and Designing. What is a model? A model is quicker and easier to build A model can be used in simulations, to learn more.

1 Chapter 1 Programming Languages Evolution of Programming Languages To run a Java program: Java instructions need to be translated into an intermediate.

Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.

04/26/20031 ECE 551: Digital System Design & Synthesis Lecture Set : Introduction to VHDL 12.2: VHDL versus Verilog (Separate File)

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Chapter – 8 Software Tools.

MARC ProgramEssential Computing for Bioinformatics 1 The Nature of Computing Prof. Bienvenido Velez ICOM 4995 Lecture 3.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.

INTRODUCTION TO COMPUTER PROGRAMMING(IT-303) Basics.

Chapter 1 Introduction.

SOFTWARE DESIGN AND ARCHITECTURE

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 1 Introduction.

Introduction to cosynthesis Rabi Mahapatra CSCE617

From C to Elastic Circuits

Dynamically Scheduled High-level Synthesis

Principles of Programming Languages

ESE535: Electronic Design Automation

Design Yaodong Bi.

Chapter 2 Operating System Overview

Presentation transcript:

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John Wawrzynek U.C. Berkeley BRASS group

Outline Lecture 1 – Introduction – Related Work – SCORE Computational Model – Hardware Requirements – Language Instantiation Lecture 2 – Execution Example – SCORE Run-Time Environment – Example: JPEG – Results and Conclusion

Introduction Problem: Lack of unifying computational model which allows applications portability and longevity without sacrificing a substantial fraction of raw capabilities Solution: Stream based compute model. Divide computation into fixed “pages.” Time multiplex “pages” into hardware.

Introduction SCORE – Ease development, deployment, and range of RC applications – Efficient implementation maximizing resources

Introduction Current Issues? – Existing targets not portable Software for RC hardware tied to a particular device – Existing targets expose fixed resource limitations Impaired expressiveness Algorithms used restricted by available hardware No dynamic resource allocation Addressing Issues – Virtualize resources computations, communication, and memory resources – Convenient and efficient model

Introduction SCORE - Programming model is natural abstraction of communication between spatial, hardware blocks. Data flow communications graph captures the blocks of computation (operators) and the communication (streams) between them. Then capture and map to hardware efficiently

Related Work Villasenor et At circa 1995 – Motion-wavelet video coder – Hand-partitioning design into “pages” and manually reconfiguring each device Run on 1/3 as many machines Only experienced 10% overhead SCORE builds on: – Instruction Set Architecture, Data Flow, Disturbed and streaming computation models – PRISC, DISC, GARP

SCORE Computational Model Compute Model – Abstract model capturing essential semantics of computation Programming Model – Programming constructs providing convenient way to express computations in the compute model Execution Model – Low-level description of the computation and the semantics which the hardware is expected to provide when interpreting this description

Compute Model Graph of computation operators and memory blocks linked together by streams Streams – Provide node-to-node communication – Single source, single sink FIFO Queues Operators – Finite State Machine (FSM) node Interact via stream links – Turing Complete (TM) node Support resource allocation and stream operations

Compute Model Operations are fully deterministic – Determinism of individual operators – Timing independent communication – Operators cannot side-effect each other’s state 1. Communicate through streams which guarantee a timing independent order of execution 2. Memory segments have single unique owner (no multiple read-write hazards)

Programming Model Framework independent of device limits Guidelines for efficient execution on any hardware implementation Key Abstractions for Programming model – Operators – Streams – Memory Segments

Programming Model Operators – Represents an algorithmic transformation of input data to produce output data – Computation building blocks for computation (Multiplier, FIR, FFT) – Size of operator in hardware is implementation dependent, is not limited to programming model – Partitioning is integral part to automate the compilation process

Programming Model Streams – Communication uses streaming data flow – Producer connected to consumer via streams – Defines where data is logically routed – Acts as unbounded length queue for data tokens – Data Presence Signals Operators signal when producing data and consuming data

Programming Model Memory Segments – Contiguous block of memory – serves as the basic unit for memory management – used by giving a specific operating mode, then linking it into a data flow graph

Programming Model Dynamic Features – Dynamic rate operators Consume / produce tokens at data-dependent rates Efficient operators for tasks: – Data Compression (JPEG), decompression, searching, and filtering Scheduling decisions should be made at Run Time – Dynamic graph composition and instantiation Computational graphs can be created, extended or modified during execution – Dynamic handling of uncommon events (Exception Handling)

Execution Model 3 Key Components – Compute Page (CP) fixed size block of RC logic which is the basic unit of virtualization and scheduling – Memory Segment contiguous block of memory which is the basic unit for data page management – Stream Link logical connection between the output of one page and the input of another page

Hardware Virtualization Compute pages, segments, and streams fundamental units for – allocation – virtualization – management of hardware resources

Example of Stream Buffer Execution

Model Implications Advice for Programmers – Describe computations as spatial pipelines with multiple, independent computational paths – Avoid or minimize feedback cycles – Expose large data streams to SCORE operators

Hardware Requirements Sequential Processor and RC device RC Device divided into a number of equivalent and independent compute pages Multiple distributed memory blocks required to store intermediate data High bandwidth, Low Latency communication, among compute pages and memory, allowing memory pages to be used concurrently

Language Instantiation One could define – subsets of conventional HDLs – subsets of conventional programming languages (C++, Java) Instead they define – RTL language to describe SCORE operators TDF: Intermediate language

Language Requirements SCORE Operators are synchronous, single clock entities with their own state – Communicate only through designed I/O streams – Operation is gated by data presence on the I/O streams – Each operation is viewed as a FSM with associated Data Path SCORE does not have a global shared memory abstraction among operators – Remember memory segments (no two operators can share memory at same time)

TDF RTL Description with special syntax for handling input and output data dreams from the operator – Data Path operators similar to C To allow dynamic operators, basic form is FSM – Each State specifies the inputs which must be present before it can “fire” – When input arrives, operator consumes the inputs and the FSM may choose to change states

END PART 1 Tune in next week for exciting examples

Execution Example Reference Figure 16 – Shows example of C++ program which uses the merge and uniq operators * SCORE operator instantiation and composition can be performed from C++ code

Example - Assumptions Design consists of 3 behavioral operators – Fully implementation of each operator requires only one compute page The RC array contains one compute page and three configurable memory blocks – Each CMB partitioned into 4 segments (s0 - s3) s0 and s1 buffer computation data s2 and s3 store state / configuration for a compute page

Example - Assumptions CMB state maintained by controller – Details are not shown in this example Each compute page has 2 input 2 output FIFO buffers Scheduling and array reconfiguration are performed at the beginning of each timeslice

Execution Example Physical view of array at each point in timeline Single Letter identifiers assigned – A: merge (inputs i0, i1) – B: merge (inputs t1, t2) – C: uniq – Segments: S0, S1

Timeline for Execution Example

Step-by-Step Execution Example

SCORE Run-Time Environment Building Applications Run-Time Environment

Example: JPEG

Conclusion

Figure 18

Figure 19

Figure 20

Table 2

Figure 21

Figure 4