D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,

Slides:

Advertisements

Similar presentations

Copyright  2003 Dan Gajski and Lukai Cai 1 Transaction Level Modeling: An Overview Daniel Gajski Lukai Cai Center for Embedded Computer Systems University.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Torino, Italy – June 27th, 2013 A2B: AN I NTEGRATED F RAMEWORK FOR D ESIGNING H ETEROGENEOUS AND R ECONFIGURABLE S YSTEMS C. Pilato, R. Cattaneo, G. Durelli,

Torino, Italy – June 25, 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013) C. Pilato R. Cattaneo, C. Pilato, M. Mastinu, M.D. Santambrogio.

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

Politecnico di Milano, Italy

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

CS487 Software Engineering Omar Aldawud

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Berlin, Germany – January 21st, 2013 A2B: A F RAMEWORK FOR F AST P ROTOTYPING OF R ECONFIGURABLE S YSTEMS Christian Pilato, R. Cattaneo, G. Durelli, A.A.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

High-Level Constructors and Estimators Majid Sarrafzadeh and Jason Cong Computer Science Department

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.

End-to-End Design of Embedded Real-Time Systems Kang G. Shin Real-Time Computing Laboratory EECS Department The University of Michigan Ann Arbor, MI

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,

1 Embedded Computer System Laboratory RTOS Modeling in Electronic System Level Design.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

What is Business Analysis Planning & Monitoring?

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

1 © FASTER Consortium Proprietary Novel Design Methods and a Tool Flow for Unleashing Dynamic Reconfiguration Kyprianos Papadimitriou, Christian Pilato,

Design Space Exploration

Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.

1 © FASTER Consortium Catalin Ciobanu Chalmers University of Technology Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

OBJECT ORIENTED SYSTEM ANALYSIS AND DESIGN. COURSE OUTLINE The world of the Information Systems Analyst Approaches to System Development The Analyst as.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Testability and architecture. Design methodologies. Multiprocessor system-on-chip.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Heng Tan Ronald Demara A Device-Controlled Dynamic Configuration Framework Supporting Heterogeneous Resource Management.

Energy Aware Task Mapping Algorithm For Heterogeneous MPSoC Based Architectures Amr M. A. Hussien¹, Ahmed M. Eltawil¹, Rahul Amin 2 and Jim Martin 2 ¹Wireless.

Comp 15 - Usability & Human Factors Unit 8a - Approaches to Design This material was developed by Columbia University, funded by the Department of Health.

VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.

Configurable, reconfigurable, and run-time reconfigurable computing.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Advanced Computer Architecture & Processing Systems Research Lab Framework for Automatic Design Space Exploration.

1 of 16 April 25, 2006 System-Level Modeling and Synthesis Techniques for Flow-Based Microfluidic Large-Scale Integration Biochips Contact: Wajid Hassan.

1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Politecnico di Torino Component modes synthesis applied to a thermal transient analysis of a turbine disc Botto, D. - Politecnico di Torino - Mechanics.

POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.

Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.

WoPANets: Decision-support Tool for real-time Networks Design

Dipartimento di Elettronica, Informazione e Bioingegneria

ITEA3 Project: ACOSAR Advanced Co-Simulation Open System Architecture

Ph.D. in Computer Science

ELEC 7770 Advanced VLSI Design Spring 2016 Introduction

FPGA: Real needs and limits

ELEC 7770 Advanced VLSI Design Spring 2014 Introduction

Subject Name: Embedded system Design Subject Code: 10EC74

Matlab as a Development Environment for FPGA Design

Final Project presentation

Hossein Omidian, Guy Lemieux

EeEmbedded partners. eeEmbedded partners Content.

Presentation transcript:

D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†, Christian Pilato ∗, Tobias Becker†, Wayne Luk†, Marco D. Santambrogio ∗ * Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano † Department of Computing Imperial College of London R E C O S O C13 International Workshop on Reconfigurable Communication-centric Systems-on-Chip R E C O S O C13

D ARMSTADT, G ERMANY - 11/07/ Motivations  The design of heterogeneous, reconfigurable systems is a complex task Adequate computer-assisted design (CAD) tools required  One of the foreseen predominant platforms of the future is the MPSoC Lots of heterogeneous cores onto single chips  Typically, we want to accelerate an application or o class of applications onto the MPSoC Starting point should be the application, not the architecture alone  Decisions in the frontend phase may highly affect the backend implementation iterative exploration is a practical requirement This is an ongoing project at Politecnico di Milano to assist in the design of such complex systems

D ARMSTADT, G ERMANY - 11/07/ Contents  Framework Overview  Preliminary Results – Test Case  Conclusions and Future Work

D ARMSTADT, G ERMANY - 11/07/ Framework Overview  Inputs (single XML file): Information about the target device Application source files (.C) plus custom pragmas for additional information (e.g., task level parallelism/kernels) Architectural template to use  Application Analysis Task graph generation Dataflow Graph generation (per function)  High Level Analysis: Estimates of resource consumption for each node (DFG based)  Mapping and Scheduling Mapping, Scheduling Refinement of the architectural template  Output: Project files ready for the synthesis with back-end tools

D ARMSTADT, G ERMANY - 11/07/ XML Exchange Format  The entire project is contained inside an XML file Architecture: components’ characteristics (e.g., reconfigurable regions), … Applications: source code files and profiling information Library: task implementations with the characterization (time, resources,...) Partitions: task graph, mapping and scheduling, …  It allows a modular organization of the framework, but also the sharing of information among the different phases  Specific details of the target platform are taken into account only in the final phase (interaction with backend tools)

D ARMSTADT, G ERMANY - 11/07/ Task Graph Generation  Application source code files can be analyzed to extract the task graphs Profiling information can drive the generation of such solutions  Task graph will be then specified in the XML file as processing nodes connected by data transfers #pragma omp task void threshold(unsigned char *o1,unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int minH1 = p[1]; int maxH1 = p[2]; int minV1 = p[3]; int maxV1 = p[4]; for(v=minV1;v<maxV1;v++) for(h=minH1;h<maxH1;h++){ If(original1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; }

D ARMSTADT, G ERMANY - 11/07/ Library Generation: a collection of different implementations  LLVM-based compiler to extract the dataflow graph of each task Estimation of required resources (including bit-width analysis) Possibility to interact with HLS synthesis tools to obtain more accurate results (trading off design time with estimation accuracy)  Generated implementations are then stored into the XML file to offer opportunities to the mapper and floorplacer Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain

D ARMSTADT, G ERMANY - 11/07/ Mapping, Scheduling and Floorplacing  We generate one or more configurations where each task of the application is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to An available and admissible implementation A component of the architecture (GPP, IP or reconfigurable region)  This allows to “share” implementations across different tasks (hardware sharing) move a task implementation to another processing element at run-time (task relocation)

D ARMSTADT, G ERMANY - 11/07/ Architecture Exploration  During exploration, the target architecture can be refined Adding/removing processing elements (reconfigurable regions) Modifying their parameters Determining the proper interconnection topology  It can iteratively affect: mapping and scheduling: modification to the computational resources (especially the number of reconfigurable regions) floorplacing: resources might become more scarce or more available due to the presence of more or less components to floorplace  It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application E.g.: mapping and floorplacing can suggest which resources should be added

D ARMSTADT, G ERMANY - 11/07/ Supported Platforms  Virtex-5 XC5VLX110T (embedded) Two XCF32P Platform Flash PROMs (32Mbyte each) SystemACE™ Compact Flash configuration controller 64-bit wide 256Mbyte DDR2 small outline DIMM (SODIMM)  Maxeler MaxWorkstation (HPC system) Intel i7 16GB RAM, 500GB HDD Max3 dataflow engine (DFE) Virtex 6 SX475T FPGA, 24GB memory DFE connected to CPU via PCI Express XUPV5 Reconf. Area DDR2 (256MB) CPU0 CPU1 CPU MAX3 DFE DRAM (16GB) Interface FPGA Compute FPGA DRAM (24GB)

D ARMSTADT, G ERMANY - 11/07/ Backend Toolchains CPU Compiler.c.xml Bitstream Generation HLS (MaxJ-VHDL) -Source code for CPU -DFGs for HW tasks -Mapping configurations Bitstream Generation exec bin bit Manual VHDL Implementations DFG-C HLS (C-VHDL) Manual MaxJ Implementations FPGA-based embedded system MaxWorkstation The code can be always further optimized by hand; e.g., glue code for data transfers MaxIDE DFG-MaxJ

D ARMSTADT, G ERMANY - 11/07/ Helper Graphical User Interface  Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies

D ARMSTADT, G ERMANY - 11/07/ Preliminary results: edge detection  Edge detection application: 4 stages of computation C + custom #pragmas based description Extracted taskgraph and corresponding DFG of first stage (Scale, 1x parallelism)  We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application “parallelism X”: X pixels processed at once  Maxeler Backend

D ARMSTADT, G ERMANY - 11/07/ Experimental Results / 1  Static vs reconfigurable design (both extracted using the framework) R0: S,T R1: B,E Task Name Area Occupation S664 B64 E7680 T7376 Region NameFinal Area Occupation R0max(664,64)=664 R1max(7680,7376)=7680 Total area consumption =8344  Reconfigurable (parallelism 8) Task Name Area Occupation S332 B32 E3840 T3688 Region NameFinal Area Occupation Total area consumption = 7876  Static (parallelism 4) IP0: S IP1: B IP2: E IP3: T  We limit the available area to 10klut and implement the most performing design

D ARMSTADT, G ERMANY - 11/07/ Experiment Results / 2  Reconfiguration time is automatically masked (when possible)  Partial Reconfiguration improves performance of application via automatic resource multiplexing Automatic due to exploration of different schedulings

D ARMSTADT, G ERMANY - 11/07/ Experiment Results / 3  HLA estimates are fairly accurate, given that they are extracted in a matter of seconds on a commodity desktop machine. Average values over the set of tasks  Average accuracy is > 85%

D ARMSTADT, G ERMANY - 11/07/ Conclusions and Future Work  We presented a modular framework to design heterogeneous, reconfigurable systems Easy to plug alternative methods for each of the phase Possibility to perform progressive refinement of both application and architecture  Critical part: multi-objective optimization strategy. Different experiments with different heuristics or possibly different algorithms Easy to plug in different components  This is becoming part of a larger project (ASAP – Advanced Synthesis of Applications and Platforms) SystemC TLM backend for (co-)simulation and early validation More architectural templates Closer interaction with actual synthesis (e.g., high-level synthesis) Automated methodologies to accelerate the design

D ARMSTADT, G ERMANY - 11/07/2013 Thank you! Riccardo Cattaneo Research partially funded by the European Community’s Seventh Framework Programme, FASTER project.