Presentation is loading. Please wait.

Presentation is loading. Please wait.

Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman.

Similar presentations


Presentation on theme: "Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman."— Presentation transcript:

1

2 Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman Duration:1 year (2 semesters) Final project presentation 30/03/2009

3 RMI Processor

4 RMI – SW Programming Model

5 Agenda  Project description  Design considerations and schematics  System diagram and functionality  System flow  User & Kernel drivers API  Issues  Future progress

6 Project definition  An FPGA-based system.  Asymmetric multiprocessor system, with Master CPU and several slave Accelerators (modified softcore CPUs with RAM) with same or different OpCode.  Master CPU running single-processor Linux OS, with the Accelerators functionality provided to the applications in OS by driver API.

7  Platform  ML310 with PPC405  Accelerators  Based on uBlaze soft-core microprocessors.  Controllers  IRQ controller for each core. “Accelerator” refers to microprocessor + IRQ generator + RAM The Platform

8 HW Design considerations  Scalability – the design is CPU-independent.  Accelerator working with interrupts – no polling.  Modularity – stubs needs only Interrupt Generator.  OS not working with interrupts – generic HW compatibility.  Separate register space.  Single cycle transaction for accessing accelerator status.  Data mover stub init includes chunk size.  Partial bus separation.

9 SW Design considerations  Scalability – the design is CPU-independent.  Working with kernel 2.6 & glibc (libraries).  Modularity - SW build in layers, with API’s for communication.  Stub doesn’t know on which slave core it runs.  Kernel image is loaded to memory using CPIO FS.  Kernel driver is polling (single check).  User-Land driver gives the user an easy & intuitive API for launching tasks in Accelerator cores.  Separating driver into User and Kernel parts enables flexibility since feature changes can be done solely on the user land driver, and won’t require knowledge of kernel internals.  Stub code enables the target code to work with interrupts, supporting interrupt-handling applications.  Data Mover stub init includes chunk size – no character recognition needed.

10 Accelerator Data & Instr. Dual port RAM CPU (uBlaze) IRQ Generator General Purpose Registers Slave Master PLB v.4.6 IRQ Accelerator Schematics MEM Controller MEM Controller Instruction bus Data bus

11 HW Design Schematics PPC DDR MEM MMU 2 PLB v.4.6 bus Accelerator Data & Instr MEM Accelerator Data & Instr MEM PLB v.4.6 bus PLB to PLB 1. The PLB to PLB bridge is needed for communication between PPC and IRQ Generators. 2. The MMU has 2 PLB buses in order to present the main memory to Accelerator cores at non-zero address.

12 SW/HW layers Accelerated Software platform FPGA PPC 405 Accelerator DDR MEM MMU Linux (Kernel 2.6.25) Protocol communication Layer (Low-level SW) Instr MEM & Data MEM Software Stub (Data mover & executer) Virtual communication Layer (User SW) Kernel Land: kernel driver driver-allocated memory User Land: user land driver demo application

13 Partial SW Flow

14 Partial SW Flow - continue

15 Accelerator Stub API When target code finishes, it needs to pass control to the stub using the return_control_to_stub function: void return_control_to_stub(int * ret_val, int count) The function requires 2 paratemers: the amount of int to copy and a pointer to the first int (it doesn’t have to be int, just passed through as a predefined amount of int – in other words, using 32-bit data blocks).

16 User driver API  void FUD_open_device()  void FUD_close_device()  void FUD_allocate_and_load_to_memory(char *file_name, u_core_dsc *ret_st)  void FUD_run(int core_id)  void FUD_get_ret_values(int core_id, void* ret_ptr)  typedef struct _userland_core_dsc{ int valid; int core_id; int ret_size; } u_core_dsc;

17 Kernel driver API  int FPGA_drv_open(struct inode *inode, struct file *filp)  int FPGA_drv_release(struct inode *inode, struct file *filp)  static int FPGA_drv_ioctl(struct inode *inode,struct file *filp, unsigned int cmd, unsigned long arg)  IOCTL (I/O Control) commands: GET_CONTINUOUS_MEMORY FREE_CONTINUOUS_MEMORY READ_FROM_CORE WRITE_TO_CORE CHECK_IF_CORE_FINISHED  As mentioned before, kernel driver doesn’t perform the actual polling of the slave cores (this would be bad practice, since the driver runs in ring 0 and might block user applications).

18 Issues along the way  Linux on the ML310 isn’t easy.  Xilinx SystemACE (CF controller) issues.  Xilinx unstable memory controller.  Debug tools are bad.  PPC emulators don’t simulate the PPC405 well.

19 Future progress  Future projects:  Building a compiler for the system for simplifying and automatizing preparation of the target code.  Supporting data coherency in caches (maybe running the platform as a HW base for cache coherency project).  Improving the platform – switch to ALTERA (hopefully with a better memory controller).

20 Conclusions  Code immigration works.  Simulators are a good tool for SW debug.  Choosing the right simulator is important.  Don’t work with OLD/deprecated equipment.  Xilinx environment necessitates constant maintenance.  Embedded debug tools are very useful.  Visual step-by-step debug was our solves most of the issues.  Working in Xilinx EDK with predefined Ips speeds up generating a complex design, but imposes severe difficulties when trying to customize it.

21 System Flow – how to  HW is loaded on FPGA, Linux runs on central PPC core, accelerators are preloaded with client software stub.  Linux finishes loading, login into busybox (using “—login”).  SW driver is loaded in the memory (using the famous./1 script).  Play a bit with the marvelous shell to test liveness using “ls” or “cat 1”).  Demo application is launched (using./demo_app).  Select the desired functionality and smile.


Download ppt "Asymmetric FPGA-loaded hardware accelerators for FPGA- enhanced CPU systems with Linux Performed by:Avi Werner William Backshi Instructor:Evgeny Fiksman."

Similar presentations


Ads by Google