MAPLD 2005Ardini1 Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
© 2013 IBM Corporation Enabling easy creation of HW reconfiguration scenarios for system level pre-silicon simulation Erez Bilgory Alex Goryachev Ronny.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
MotoHawk Training Model-Based Design of Embedded Systems.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Traffic Management - OpenFlow Switch on the NetFPGA platform Chun-Jen Chung( ) SriramGopinath( )
Continuously Recording Program Execution for Deterministic Replay Debugging.
ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Multiprocessing Memory Management
Introduction to Operating Systems – Windows process and thread management In this lecture we will cover Threads and processes in Windows Thread priority.
Chapter 13 Embedded Systems
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Silberschatz, Galvin and Gagne  Operating System Concepts Multistep Processing of a User Program User programs go through several steps before.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Swapping and Contiguous Memory Allocation. Multistep Processing of a User Program User programs go through several steps before being run. Program components.
MICROPROCESSOR INPUT/OUTPUT
Automated Design of Custom Architecture Tulika Mitra
Chapter 4 Storage Management (Memory Management).
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
J. Christiansen, CERN - EP/MIC
Configurable, reconfigurable, and run-time reconfigurable computing.
CE Operating Systems Lecture 11 Windows – Object manager and process management.
Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
Processes Introduction to Operating Systems: Module 3.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Presentation by Tom Hummel OverSoC: A Framework for the Exploration of RTOS for RSoC Platforms.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
Department of Computer Science and Software Engineering
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Introduction to Operating Systems Concepts
Real-time Software Design
CSCI1600: Embedded and Real Time Software
Central Processing Unit
Multistep Processing of a User Program
Processor Fundamentals
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Lecture Topics: 11/1 General Operating System Concepts Processes
Threads Chapter 4.
Lecture 3: Main Memory.
Operating System Overview
OPERATING SYSTEMS MEMORY MANAGEMENT BY DR.V.R.ELANGOVAN.
CSCI1600: Embedded and Real Time Software
CSE 542: Operating Systems
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Presentation transcript:

MAPLD 2005Ardini1 Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini

MAPLD 2005Ardini2 Motivation For a given HW architecture, including reconfigurable components –Optimize performance in consideration of long reconfiguration times and current demands for processing Application in systems with unknown runtime processing demands –Cognitive systems –Multisensor systems –Systems with unknown data lengths Take advantage of ability to express hardware implementations in high-level language (C) common to processor and programmable devices

MAPLD 2005Ardini3 Related Work Li, Compton, Hauck [00], based on Young[94] –“credit” for RC unit is proportional to size of the unit –Penalty Algorithm for defragmentation –Scoring approach here, but “credit” is proportional to amount of “acceleration” achieved with decision threshold based on size Vuletić, Pozzi, Ienne [04] –HW/SW abstraction layer proposed for RC transparent programming model

MAPLD 2005Ardini4 Goals Examine possible RTL generators allowing one set of source code for an algorithm –Binds to processor or programmable device (FPGA) –Minimal changes (I/O only) required to source –However, scheduling approach is not dependent on the capability or C to RTL generators Show easy creation of processor and FPGA implementations of logic Assume task scheduling is unknown at build time and is based on service requests Allow each task to support SW only and hardware accelerated versions Define simple logic to make “best” use of hardware resources, assign ownership dynamically Show benefit of RC via DMA in algorithms that can be bound to HW or SW Define API for application threads Demonstrate concept in real hardware

MAPLD 2005Ardini5 Experimental Environment Worker thread, coproc DMA model setup in Windows using VC++ multithreaded app Coprocessor is FPGA on PCI AlphaData card Implemented algorithm execution with/without coproc Used DMA to help hide overhead of reconfiguration: SW only threads can execute during configuration Service requests initiated by adjustable timers to exercise RC logic Event logging for analysis Mgr thread Worker thread 2 Worker thread 1 dataset savings dataset coproc DMA config Worker thd registration Service request savings

MAPLD 2005Ardini6 Hardware Environment Alpha-Data VirtexII Pro card on PCI bus Simple bus wrapper gets coprocessor IP onto Alpha-Data local bus PC chosen for easy development and focus on unique logic FPGA wrapper IP Local bus to PCI bridge, PC

MAPLD 2005Ardini7 RTL Generator ImpulseC chose for this study –ANSI C - like –Simple modifications to algorithm to compile for processor Data I/O path Word types as simple #defines –High level of abstraction Small learning curve Give up low-level control of registers/signals Some control over max gate delay using #pragma –Desktop simulation for fast algorithm debug

MAPLD 2005Ardini8 Software Manager and application in VC++ –Easily implemented in C as well For demo, windows “worker thread” model used, but other static thread + messaging methods could be used as well

MAPLD 2005Ardini9 Test Algorithms Two tasks implemented –FIR –FFT HW implementation flow –Code in C –ImpulseC RTL generator –Synplify –Xilinx implementation tools SW flow –Change I/O in HW algorithm to use shared memory buffer

MAPLD 2005Ardini10 IP Development Outline Write Task coprocessor for HW using ImpulseC Modify I/O for processor implementation Quantify savings in clock cycles for HW accelerated version Wrap both implementations into “worker thread” that will use one of the implementations based on coprocessor ownership Need to check coprocessor ownership on thread start Worker thread registration not considered here –Could be defined on power up or –Dynamically registered

MAPLD 2005Ardini11 Worker Thread Control Block One instantiated per worker thread Contains information about the coprocessor bit stream Points to the HW resource it currently owns –Would be used in multiple coprocessor systems for faster manager logic Contains base address of its coprocessor –Maintained by the manager and is used as a semaphore for coprocessor use

MAPLD 2005Ardini12 RC Thread Control Block Control block for HW resource Holds information about the resource, e.g. the ID of the resource Member function to kick off bit stream load process via DMA –Target thread can continue to run SW only until configuration is complete Member function to gain coprocessor access on behalf of a worker thread based on ownership and state (is it done loading the bit stream?)

MAPLD 2005Ardini13 Coprocessor Ownership All service requests pass through the thread manager Manager uses “Scoring” logic Upon completion, worker threads report “savings” that were achieved, or, could have been achieved using a coprocessor Manager increments score for that thread Highest scoring threads receive a coprocessor Reassignment not done until a threshold is passed –Set based on relative time penalty of performing a reconfiguration, e.g. do reconfig when score delta exceeds 10x the reconfiguration time.

MAPLD 2005Ardini14 Scoring logic Need to bound scores –Bound should be greater than RC threshold –2x RC threshold used in these tests Need to maintain “relative” performance of competing tasks, i.e. can’t have most scores saturating Therefore, when updating scores at thread completion, subtract the current lowest score off of all registered threads

MAPLD 2005Ardini15 Scoring Details Simple subtraction of lowest score is not enough –One inactive thread would allow “integrator windup” on the remaining threads Slow response when the inactive thread comes back online Saturation logic would prevent the selection of coprocessor owners, i.e. they would all “collect” at the top of the score list –Prevents initial accumulation of scores Therefore, subtract score x from each task where –x is the lowest nonzero score for all tasks other than the top scoring m threads where m is the number of available coprocessors

MAPLD 2005Ardini16 Coproc Assignment Get highest scoring non-owner in top m tasks Compare score to lowest ranking owner If diff is greater than threshold, RC –If current owner is using the resource skip RC If RC is still the right decision after current owner finishes, RC will happen at that time More logic could be used to continue comparing against current coproc owners t3t3 t2t2 *t 1 *t 4 t5t5 Δ > thresh? Top m tasks eligible for coprocessor ownership Ranked task scores Lower ranking tasks will run in SW * = current owner

MAPLD 2005Ardini17 Reconfiguration Thread Created by manager Kicks off DMA process Waits for done event Sends reconfiguration complete message back to manager Manager can then give access the Worker thread owner

MAPLD 2005Ardini18 Test Configuration Single HW resource available Two competing threads, FFT, FIR processing Fixed HW block sizes Fixed data set sizes = fixed savings Adjust for mismatch in microprocessor vs. FPGA clock rates Service request rates for each thread adjustable to exercise RC logic

MAPLD 2005Ardini19 Results RC event No owner Thread 1 owns Thread 2 owns RC Threshold hysteresis score saturation Service request rates for two threads vary with time

MAPLD 2005Ardini20 Reconfiguration Detail RC DMA period

MAPLD 2005Ardini21 RC DMA with Higher Demand Rate RC DMA period

MAPLD 2005Ardini22 Conclusions Coprocessor ownership given based on best sustained use of the resource Provides hysteresis to prevent frequent reconfigurations Low-overhead logic RC decision logic Hardware and software implementations allow DMA to hide reconfiguration overhead IP description in C allows it to be created once, compiled for microprocessor and FPGA targets