Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.

Slides:



Advertisements
Similar presentations
purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
Advertisements

Memory.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Part IV: Memory Management
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
Internal Logic Analyzer Final presentation-part B
QUIZ What does ICAP stand for ? What is its main use ? Why is Partition Pin preferred over Bus Macro? 1.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Configurable System-on-Chip: Xilinx EDK
Chapter 13 Embedded Systems
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Virtual Memory Deung young, Moon ELEC 5200/6200 Computer Architecture and Design Lectured by Dr. V. Agrawal Lectured by Dr. V.
Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Benefits of Partial Reconfiguration Reducing the size of the FPGA device required to implement a given function, with consequent reductions in cost and.
ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.
Impulse Embedded Processing Video Lab Generate FPGA hardware Generate hardware interfaces HDL files HDL files FPGA bitmap FPGA bitmap C language software.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.
Automated Design of Custom Architecture Tulika Mitra
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Heng Tan Ronald Demara A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management.
SystemC and Levels of System Abstraction: Part I.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
NIOS II Ethernet Communication Final Presentation
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
EE3A1 Computer Hardware and Digital Design
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Operating System Structure A key concept of operating systems is multiprogramming. –Goal of multiprogramming is to efficiently utilize all of the computing.
Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.
Connecting EPICS with Easily Reconfigurable I/O Hardware EPICS Collaboration Meeting Fall 2011.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Performed By: Itamar Niddam and Lior Motorin Instructor: Inna Rivkin Bi-Semesterial. Winter 2012/2013 3/12/2012.
بسم الله الرحمن الرحيم MEMORY AND I/O.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Design with Vivado IP Integrator
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Computer System Structures
Maj Jeffrey Falkinburg Room 2E46E
Lab 4 HW/SW Compression and Decompression of Captured Image
System-on-Chip Design Homework Solutions
Prototyping SoC-based Gate Drive Logic for Power Convertors by Generating code from Simulink models. Researchers Rounak Siddaiah, Graduate Student-University.
Chapter 2 Memory and process management
ENG3050 Embedded Reconfigurable Computing Systems
FPGAs in AWS and First Use Cases, Kees Vissers
FPGA Implementation of Multicore AES 128/192/256
Anne Pratoomtong ECE734, Spring2002
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Introduction to cosynthesis Rabi Mahapatra CSCE617
ChipScope Pro Software
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
ChipScope Pro Software
THE ECE 554 XILINX DESIGN PROCESS
THE ECE 554 XILINX DESIGN PROCESS
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Presentation transcript:

Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz Shmueli

 SW execution can be slow. Most parallel algorithms can be executed faster by dedicated HW accelerators on an FPGA.  Executing suitable algorithms by HW accelerators will free the CPU for other tasks.  An OS may contain many SW processes, but An FPGA cannot contain as many HW accelerators.

Application Developers:  HLS compatible framework for synthesizing HW accelerators.  Extended flexibility for interfacing with the HW accelerators. System Users:  The architectural modification is transparent.  Only the improved performance is noticeable. Design and implement an innovative embedded system architecture to manage hardware accelerators in real time.

LM A[9:15] Func A LM B[0:2,8:15] Func B LM C[0:15] Func C LM D[0:2,7:15] Func D LM E[0:1] Func E LM F[0:15] Func F Available functions Func E in RP 0 Func C in RP 1 Func D in RP 7 Func B in RP 8 RP 2 RP 3 RP 4 RP 5 RP 6 RP 9 RP 10 RP 11 RP 12 RP 13 RP 14 RP 15 Programmable logic  LM = Loadable Module  RP = Reconfigurable Partition

Software function returns output data to the application Call the software function Application reads output data from the loaded module Do other tasks Send input data to the loaded module Load the module into its compatible partition Application requests an acceleratable function to process input data Worst Case Scenario: Input data is processed by the original (unaccelerated) function. Best Case Scenario:

Linux OS (on the PS) connects to the HW accelerators (on the PL) via an AXI interconnect. Interface Software Implementation of the management application for Linux OS. Hardware Design for a partially reconfigurable FPGA embedded on a Xilinx Zynq-7000 board.

 Handshaking Signals: start valid done idle  Input Data  Output Data  Handshaking Signals: start valid done idle  Input Data  Output Data  Unused/Optional Interrupt Signal  Clock Signal  Reset Signal  Clock Signal  Reset Signal Slave AXI Bus

 The top-level function is included in the bus bundle to generate the following handshaking signals: start valid done idle  The handshaking signals and I/O are bundled into a bus.  An IP-XACT adapter is generated for the AXI4-LiteS bus bundle.  Address maps are created for the IP-XACT adapter components. These addresses will be used by the PR management application.  The top-level function is included in the bus bundle to generate the following handshaking signals: start valid done idle  The handshaking signals and I/O are bundled into a bus.  An IP-XACT adapter is generated for the AXI4-LiteS bus bundle.  Address maps are created for the IP-XACT adapter components. These addresses will be used by the PR management application. HDL File in out control control=handshaking signals in=input data out=output data control=handshaking signals in=input data out=output data  C Synthesis  Utilization Estimates/Constraints Verification  RTL Exportation  C Synthesis  Utilization Estimates/Constraints Verification  RTL Exportation

Prime Number Fibonacci Number Greatest Common Divisor  These estimates do not include routing.

Reconfigurable Partition 0 Reconfigurable Partition 1 Reconfigurable Partition 2  Some IP core utilization estimates might meet the utilization constraints of an RP subset. As a result, the synthesized RMs will be compatible to this RP subset only.  It is up to the user to verify RM/RP utilization compatibility, and choose only the compatible RMs.  It is up to the PR management application to optimize resource utilization, and choose the best possible RP for a given RM set  Due to the FPGA fabric and Pblock geometric constraints, different resources are available to different RPs.

 Handshaking Signals: start valid done idle  Input Data  Output Data  Handshaking Signals: start valid done idle  Input Data  Output Data  Unused/Optional Interrupt Signal  Clock Signal  Reset Signal  Clock Signal  Reset Signal Slave AXI Bus Reconfigurable Module Internal Fragmentation Can be easily implemented in Vivado HLS

RP 0 RP 1 RP 2 RP 3 RP 15

RP 0 RP 1 RP 2 RP 3 RP 15 Considerations for determining the number of RPs: As shown in the latest Xilinx workshop, an RP's physical location on the FPGA fabric is an integral part of an RM design. Thus, a unique partial bitstream has to be created for every RP on the FPGA fabric. The size of a typical partial bin file (binary bitstream) is about 100KB. For example, a system with 10 different HW accelerators would mean that 10X16X100=16MB of Memory is used. Considerations for determining the sizes of the RPs: Unfortunately, we have a very limited FPGA design experience. Our custom IP cores are quite small in size. Therefore, the synthesized RMs fit easily in the large RPs on the FPGA fabric. These sizes were chosen empirically for testing purposes, and they will be later adapted for larger and more complex IP cores (floating point matrix multiplication, FIR filter, Sobel/Sepia filter, etc.). There is an obvious tradeoff between the number and the sizes of the RPs:  Increasing the number of the RPs will increase the number of HW accelerators that can operate simultaneously, but it will also create more routing and thus reduce the available FPGA resources, in addition to the increase in memory usage.  Increasing the sizes of the RPs will allow for larger and more complex HW accelerators to be utilized, but it will also increase the internal fragmentation for smaller HW accelerators and reduce the overall performance.

Since we restrict the C/C++ function interface, only the following data is used in the data structure: * (int) HW index (is linearly translated to HW address) * (int) Accelerator index (what is the index of the loaded data, -1 for empty) * (int) Number of inputs * (int) Number of outputs

Fix_16.tcl: Due to optimizations made by the synthesis, the number of address bits varies between accelerators, this causes issues with standard HW API. This script fixes this issue by expanding the address bits to 16. Make_dcp.tcl (runs fix_16.tcl): This script loads the accelerator made by HLS into the design and compiles it to bitstream files that can be loaded onto the FPGA using Xillinx tools. Make_bin.bat: This script converts the bitstream files to bin files that can be loaded to the FPGA using the driver on the embedded Linux.

Offset from Vivado HLS Base from Vivado IDE Offset from Vivado HLS Base from Vivado IDE Y is output (has control register). X is input Y is output (has control register). X is input

 void xillix_initialize (void);  int xillix_load (const char* hwa_repo_path, const int input_param_num, const int output_param_num);  void xillix_activate (const int rp_idx, const long* input_params);  bool xillix_check_result (const int rp_idx);  void xillix_get_result (const int rp_idx, long* output_params);  void xillix_unload (const int rp_idx);  void xillix_terminate (void); * The API is built to run in user space, holding consistency between processes using files.

 The API functions use memory-mapping to interface with the programmable logic directly from user space.

Open the generic HWA template in Vivado HLS Replace the HWA_func top-level function with your C/C++ function Bundle the top- level function and I/O for AXI4-LiteS and run C synthesis Verify that your utilization estimates meet the utilization constraints Export RTL and verify that your HDL files were created in the IP folder Execute the automation scripts Wait for your partial bin files (binary bitstreams) to be created Add the partial bin files to the SD card

Include the API’s header file in the source code Change direct calls to the targeted function into their equivalent (or other) API function calling sequence Compile your source code and add the elf files to the SD card That’s it !

3 algorithms tested using the default HLS settings on the XC702 board. The results: accelerator time SW (sec) time HW (sec) Difference (sec) inputsnotes Fibonacci prime GCD algorithm to fast to compare Fibonacci algorithm was solved by HLS in a way that benefits HW, from the results we can see that the HW is 60% faster. Prime algorithm is match faster on SW that on HW (HW is 4 times slower). This might be because HLS default solution is not optimized well for HW or the algorithm itself is faster on SW. GCD algorithm got results in single micro seconds on both SW and HW regardless of input size so it's not comparable.

This project lays the foundation for an actual HW DLL. Building a Linux kernel module to manage operation. This will enable interrupts and remove poling or busy wait, increasing performance. Building a GUI to see and manage the status of the reconfigurable system. Analyse commercial programs, build and optimize the functions in HLS and demonstrate the multitasking of the system in real HW DLL conditions. Define and build an interface and accelerator that can use more than one reconfigurable block.

For more detailed information, please read: final_report_ver1.0.docx Thank You.