BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Slides:



Advertisements
Similar presentations
University of South Australia Distributed Reconfiguration Avishek Chakraborty, David Kearney, Mark Jasiunas.
Advertisements

Undoing the Task: Moving Timing Analysis back to Functional Models Marco Di Natale, Haibo Zeng Scuola Superiore S. Anna – Pisa, Italy McGill University.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
BRASS SCORE: Eylon Caspi, Randy Huang, Yury Markovskiy, Joe Yeh, John Wawrzynek BRASS Research Group University of California, Berkeley Stream Computations.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
CS294-6 Reconfigurable Computing Day 22 November 5, 1998 Requirements for Computing Systems (SCORE Introduction)
FunState – An Internal Design Representation for Codesign A model that enables representations of different types of system components. Mixture of functional.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Chapter 13 Embedded Systems
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Models of Computation for Embedded System Design Alvise Bonivento.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Hierarchical Reconfiguration of Dataflow Graphs Stephen Neuendorffer UC Berkeley Poster Preview May 10, 2004.
A Streaming Multi-Threaded Model Eylon Caspi,Randy Huang,Yury Markovskiy, Joe Yeh,André DeHon,John Wawrzynek BRASS Research Group University of California,
CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
SymCall: Symbiotic Virtualization Through VMM-to-Guest Upcalls John R. Lange and Peter Dinda University of Pittsburgh (CS) Northwestern University (EECS)
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Operating Systems for Reconfigurable Systems John Huisman ID:
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.
Full and Para Virtualization
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Philipp Gysel ECE Department University of California, Davis
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
CS184b: Computer Architecture (Abstractions and Optimizations)
ESE532: System-on-a-Chip Architecture
FPGA: Real needs and limits
Anne Pratoomtong ECE734, Spring2002
Introduction to cosynthesis Rabi Mahapatra CSCE617
CSCI1600: Embedded and Real Time Software
From C to Elastic Circuits
Akshay Tomar Prateek Singh Lohchubh
Dynamically Scheduled High-level Synthesis
Chi: A Scalable & Programmable Control Plane for Distributed Stream Processing Luo Mai, Kai Zeng, Rahul Potharaju, Le Xu, Steve Suh, Shivaram Venkataraman,
CSCI1600: Embedded and Real Time Software
Presentation transcript:

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael Chu, John Wawrzynek UC Berkeley BRASS Group André DeHon California Institute of Technology

BRASS February 26, 2002FPGA Outline  Hardware Virtualization  SCORE model  Run-time scheduler  Fully Dynamic  Quasi-Static  Results  7x reduction in scheduling overhead  App performance improved by a factor of 2-7.  Conclusion

BRASS February 26, 2002FPGA Hardware Virtualization  Traditional Mapping Tools  Expose resource constraints to designer  HW virtualization enables:  App compatibility/longevity across a device family  Automatic performance scaling on larger devices

BRASS February 26, 2002FPGA  Programming Model  Streaming dataflow graph of operators (FSM + datapath) Dynamic data-dependent behavior Arbitrary size operators Stream Computation Organized for Reconfigurable Execution (SCORE) (1)  Data-flow based framework  Programming Model  Execution Environment  Hardware Platform  Run-time representation  Graph of fixed size compute pages Akin to virtual memory pages  Run-time scheduling is required to handle dynamic page behavior

BRASS February 26, 2002FPGA Stream Computation Organized for Reconfigurable Execution (SCORE) (2)  Hardware Platform  uP/Reconfigurable array hybrid Array: compute pages (CP) and configurable memory blocks (CMB)  Stream interface between resources  Global Controller manages reconfiguration  Array Reconfiguration  Scheduler Operation  Temporal Partitioning  Buffer intermediate results  Resource Allocation/Mapping  Compute pages  Memory segments  Communication channels

BRASS February 26, 2002FPGA  Run-time scheduling (late binding of resources)  Benefit: automatic performance scaling  Extra burden: scheduler Complex optimization with multiple simultaneous constraints (CPs, CMBs, and network)  NP-hard problem Run-time Scheduler  What is the right timeslice size?  Depends on an application’s run-time behavior  Affected by the scheduler overhead (lower bound)  Space of scheduling solutions  Range in quality and complexity  T radeoffs: timeslice vs asynchronous or dynamic vs static

BRASS February 26, 2002FPGA Problem Statement  SCORE Micro-architecture  Parallel reconfiguration of independent CPs/CMBs  Reconfiguration time is thousands of cycles  Problem  Investigate scheduling cost  Reduce it to a minimum (comparable to reconfiguration time)  Understand its effect on application run-times.

BRASS February 26, 2002FPGA Initial Scheduling Solution  Version of priority-list scheduling Availability of input tokens and output space determines the priority Candidates are chosen by BFS  Fixed timeslice size  Fully Dynamic Scheduler  Perform scheduling operation each timeslice  Large critical loop

BRASS February 26, 2002FPGA Fully Dynamic Scheduler (1)  Two types of overhead:  Scheduler (avg. 124 Kcycles)  Reconfiguration [array global controller] (avg. 3.5 Kcycles)  Average overhead per timeslice > 127 Kcycles

BRASS February 26, 2002FPGA Fully Dynamic Scheduler (2)  Total Execution Time  Scheduler Overhead is on average 36% of execution time  Timeslice Size = 250Kcycles.

BRASS February 26, 2002FPGA Quasi-Static Scheduler  Small Run-time Critical Loop:  Query Array  Issue Script Commands  Pre-compute Schedule from  Graph topology  Back annotations (I/O rates)  Generate script of configuration commands.  Timeslice size  Dynamically controlled by array hardware stall detect.  Hardware continuously (or at small intervals) monitors array activity. Quasi Static

BRASS February 26, 2002FPGA Results (1)  A low overhead scheduling solution  Scheduler overhead (avg. 14Kcycles)  Reconfiguration (avg. 4Kcycles)  7x average reduction in overhead

BRASS February 26, 2002FPGA Results (2)  4.5x average application speedup  Reduction in overhead AND  Improvement in scheduling quality

BRASS February 26, 2002FPGA Results Summary  Tested applications:  Image de/compression – consist of both dynamic and static rate operators.  All demonstrate similar speedups under Quasi-Static scheduler.  Performance improvements can be attributed to:  Reduced scheduler overhead  Improved scheduling quality: Global rather than local (BFS) view as in dynamic scheduler  Reduction of the lower bound of timeslice size  Expands the space of apps well suited for execution under a virtualized hardware  Retained powerful semantics of dynamic data- dependent dataflow

BRASS February 26, 2002FPGA Conclusion  Run-time scheduler  Required for automatic scaling under hardware virtualization  Run-time overhead sets lower bound on the size of scheduling step (response time): Restricting applicability of virtualized hardware Makes this model impractical for some apps  Low overhead run-time scheduling is achievable:  Without semantic restrictions  With higher (or comparable) scheduling quality.  7x reduction in overhead and simultaneous  Performance improvement of 2-7x.  OS is a viable alternative to manual scheduling.

BRASS February 26, 2002FPGA Thank You  Thanks to:  DARPA, Xilinx and STMicro  For more information 