Presentation is loading. Please wait.

Presentation is loading. Please wait.

D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,

Similar presentations


Presentation on theme: "D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,"— Presentation transcript:

1 D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†, Christian Pilato ∗, Tobias Becker†, Wayne Luk†, Marco D. Santambrogio ∗ * Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano † Department of Computing Imperial College of London R E C O S O C13 International Workshop on Reconfigurable Communication-centric Systems-on-Chip R E C O S O C13

2 D ARMSTADT, G ERMANY - 11/07/2013 2 Motivations  The design of heterogeneous, reconfigurable systems is a complex task Adequate computer-assisted design (CAD) tools required  One of the foreseen predominant platforms of the future is the MPSoC Lots of heterogeneous cores onto single chips  Typically, we want to accelerate an application or o class of applications onto the MPSoC Starting point should be the application, not the architecture alone  Decisions in the frontend phase may highly affect the backend implementation iterative exploration is a practical requirement This is an ongoing project at Politecnico di Milano to assist in the design of such complex systems

3 D ARMSTADT, G ERMANY - 11/07/2013 3 Contents  Framework Overview  Preliminary Results – Test Case  Conclusions and Future Work

4 D ARMSTADT, G ERMANY - 11/07/2013 4 Framework Overview  Inputs (single XML file): Information about the target device Application source files (.C) plus custom pragmas for additional information (e.g., task level parallelism/kernels) Architectural template to use  Application Analysis Task graph generation Dataflow Graph generation (per function)  High Level Analysis: Estimates of resource consumption for each node (DFG based)  Mapping and Scheduling Mapping, Scheduling Refinement of the architectural template  Output: Project files ready for the synthesis with back-end tools

5 D ARMSTADT, G ERMANY - 11/07/2013 5 XML Exchange Format  The entire project is contained inside an XML file Architecture: components’ characteristics (e.g., reconfigurable regions), … Applications: source code files and profiling information Library: task implementations with the characterization (time, resources,...) Partitions: task graph, mapping and scheduling, …  It allows a modular organization of the framework, but also the sharing of information among the different phases  Specific details of the target platform are taken into account only in the final phase (interaction with backend tools)

6 D ARMSTADT, G ERMANY - 11/07/2013 6 Task Graph Generation  Application source code files can be analyzed to extract the task graphs Profiling information can drive the generation of such solutions  Task graph will be then specified in the XML file as processing nodes connected by data transfers #pragma omp task void threshold(unsigned char *o1,unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int minH1 = p[1]; int maxH1 = p[2]; int minV1 = p[3]; int maxV1 = p[4]; for(v=minV1;v<maxV1;v++) for(h=minH1;h<maxH1;h++){ If(original1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; }

7 D ARMSTADT, G ERMANY - 11/07/2013 7 Library Generation: a collection of different implementations  LLVM-based compiler to extract the dataflow graph of each task Estimation of required resources (including bit-width analysis) Possibility to interact with HLS synthesis tools to obtain more accurate results (trading off design time with estimation accuracy)  Generated implementations are then stored into the XML file to offer opportunities to the mapper and floorplacer Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain

8 D ARMSTADT, G ERMANY - 11/07/2013 8 Mapping, Scheduling and Floorplacing  We generate one or more configurations where each task of the application is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to An available and admissible implementation A component of the architecture (GPP, IP or reconfigurable region)  This allows to “share” implementations across different tasks (hardware sharing) move a task implementation to another processing element at run-time (task relocation)

9 D ARMSTADT, G ERMANY - 11/07/2013 9 Architecture Exploration  During exploration, the target architecture can be refined Adding/removing processing elements (reconfigurable regions) Modifying their parameters Determining the proper interconnection topology  It can iteratively affect: mapping and scheduling: modification to the computational resources (especially the number of reconfigurable regions) floorplacing: resources might become more scarce or more available due to the presence of more or less components to floorplace  It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application E.g.: mapping and floorplacing can suggest which resources should be added

10 D ARMSTADT, G ERMANY - 11/07/2013 10 Supported Platforms  Virtex-5 XC5VLX110T (embedded) Two XCF32P Platform Flash PROMs (32Mbyte each) SystemACE™ Compact Flash configuration controller 64-bit wide 256Mbyte DDR2 small outline DIMM (SODIMM)  Maxeler MaxWorkstation (HPC system) Intel i7 2600s@2.8GHz, 16GB RAM, 500GB HDD Max3 dataflow engine (DFE) Virtex 6 SX475T FPGA, 24GB memory DFE connected to CPU via PCI Express XUPV5 Reconf. Area DDR2 (256MB) CPU0 CPU1 CPU MAX3 DFE DRAM (16GB) Interface FPGA Compute FPGA DRAM (24GB)

11 D ARMSTADT, G ERMANY - 11/07/2013 11 Backend Toolchains CPU Compiler.c.xml Bitstream Generation HLS (MaxJ-VHDL) -Source code for CPU -DFGs for HW tasks -Mapping configurations Bitstream Generation exec bin bit Manual VHDL Implementations DFG-C HLS (C-VHDL) Manual MaxJ Implementations FPGA-based embedded system MaxWorkstation The code can be always further optimized by hand; e.g., glue code for data transfers MaxIDE DFG-MaxJ

12 D ARMSTADT, G ERMANY - 11/07/2013 12 Helper Graphical User Interface  Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies

13 D ARMSTADT, G ERMANY - 11/07/2013 13 Preliminary results: edge detection  Edge detection application: 4 stages of computation C + custom #pragmas based description Extracted taskgraph and corresponding DFG of first stage (Scale, 1x parallelism)  We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application “parallelism X”: X pixels processed at once  Maxeler Backend

14 D ARMSTADT, G ERMANY - 11/07/2013 14 Experimental Results / 1  Static vs reconfigurable design (both extracted using the framework) R0: S,T R1: B,E Task Name Area Occupation S664 B64 E7680 T7376 Region NameFinal Area Occupation R0max(664,64)=664 R1max(7680,7376)=7680 Total area consumption 7376+64=8344  Reconfigurable (parallelism 8) Task Name Area Occupation S332 B32 E3840 T3688 Region NameFinal Area Occupation Total area consumption 332+32+3840+3688= 7876  Static (parallelism 4) IP0: S IP1: B IP2: E IP3: T  We limit the available area to 10klut and implement the most performing design

15 D ARMSTADT, G ERMANY - 11/07/2013 15 Experiment Results / 2  Reconfiguration time is automatically masked (when possible)  Partial Reconfiguration improves performance of application via automatic resource multiplexing Automatic due to exploration of different schedulings

16 D ARMSTADT, G ERMANY - 11/07/2013 16 Experiment Results / 3  HLA estimates are fairly accurate, given that they are extracted in a matter of seconds on a commodity desktop machine. Average values over the set of tasks  Average accuracy is > 85%

17 D ARMSTADT, G ERMANY - 11/07/2013 17 Conclusions and Future Work  We presented a modular framework to design heterogeneous, reconfigurable systems Easy to plug alternative methods for each of the phase Possibility to perform progressive refinement of both application and architecture  Critical part: multi-objective optimization strategy. Different experiments with different heuristics or possibly different algorithms Easy to plug in different components  This is becoming part of a larger project (ASAP – Advanced Synthesis of Applications and Platforms) SystemC TLM backend for (co-)simulation and early validation More architectural templates Closer interaction with actual synthesis (e.g., high-level synthesis) Automated methodologies to accelerate the design

18 D ARMSTADT, G ERMANY - 11/07/2013 Thank you! Riccardo Cattaneo rcattaneo@elet.polimi.it Research partially funded by the European Community’s Seventh Framework Programme, FASTER project.


Download ppt "D ARMSTADT, G ERMANY - 11/07/2013 A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo ∗, Xinyu Niu†,"

Similar presentations


Ads by Google