Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

Similar presentations


Presentation on theme: "A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and."— Presentation transcript:

1 A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and Computer Engineering Department

2 Overview Background Profiling Model The Profiler Case Studies Conclusions Future Work

3 How Do We Program This System? Lets look at what traditional clusters use and try to port it to these type of machines User FPGA User FPGA User FPGA User FPGA Ctrl FPGA

4 Traditional Clusters MPI is a de facto standard for parallel HPC MPI can also be used to program a cluster of FPGAs

5 The TMD Heterogeneous multi-core multi-FPGA system developed at UofT Uses message passing (TMD-MPI)

6 TMD-MPI Subset of the MPI standard Allows an independence between the application and the hardware TMD-MPI functionality is also implemented in hardware (TMD-MPE)

7 TMD-MPI – Rendezvous Protocol This implementation uses the Rendezvous protocol, a synchronous communication mode Req. to Send Acknowledge Data

8 The TMD Implementation on BEE2 Boards PPCMB PPC MB PPC MB NoC User FPGA Ctrl FPGA

9 How Do We Profile This System? Lets look at how it is done in traditional clusters and try to adapt it to hardware

10 MPICH - MPE Collects information from MPI calls and defined user states through embedded calls Includes a tool to view all log files (Jumpshot)

11 Goals Of This Work Implement a hardware profiler capable of extracting the same data as the MPE Make it less intrusive Make it compatible with the API used by MPE Make it compatible with Jumpshot

12 Tracers PPC Processor’s Computation Tracer Receive Tracer Send Tracer TMD MPE Receive Tracer Send Tracer TMD MPE Engine’s Computation Tracer The Profiler interacts with the computation elements through tracers that register important events TMD-MPE requires two tracers due to its parallel nature PPC Processor’s Computation Tracer

13 Tracers - Hardware Engine Computation MUX R0 Tracer for Hardware Engine Cycle Counter 32

14 Tracers - TMD-MPE R0R1 R2 R3 R4 MPE Data Reg MUX Tracer for TMD-MPE Cycle Counter TMD MPE 32

15 Tracers – Processors Computation Register Bank (9 x 32 bits) MUX Register Bank (5 x 32 bits) Stack MPI Calls StatesUser Define States Tracer for PowerPC/MicroBlaze Cycle Counter PPC 32

16 Profiler’s Network Tracer...... Gather CollectorDDR User FPGAControl FPGA

17 Synchronization Synchronization within the same board  Release reset of the cycle counters simultaneously Synchronization between boards  Periodically exchange of messages between the root board and all other boards

18 Visualize with Jumpshot Profiler’s Flow Collect Data Dump to Host Convert To CLOG2 Convert To SLOG2 After Execution Back End Front End

19 Case Studies Barrier  Sequential vs Binary Tree TMD-MPE - Unexpected Message Queue  Unexpected Message Queue addressable by rank The Heat Equation  Blocking Calls vs Non-Blocking Calls LINPACK Benchmark  16 Node System Calculating a LU Decomposition of a Matrix

20 Barrier Synchronization call – No node will advance until all nodes have reached the barrier 0 1 2 3 4 56 7 0 12345 6 7

21 Barrier Implemented Sequentially Send Receive

22 Barrier Implemented as a Binary Tree Send Receive

23 TMD-MPE – Unexpected Messages Queue All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.

24 TMD-MPE – Unexpected Messages Queue Send Receive Queue Search and Reorganization

25 TMD-MPE – Unexpected Messages Queue Send Receive Queue Search and Reorganization

26 TMD-MPE – Unexpected Messages Queue Send Receive

27 The Heat Equation Application Partial differential equation that describes the temperature change over time

28 The Heat Equation Application

29 Send Receive Computation

30 The Heat Equation Application Send Receive Computation

31 The LINPACK Benchmark Solves a system of linear equations LU factorization with partial pivoting

32 The LINPACK Benchmark assigned to Rank 0 assigned to Rank 1 assigned to Rank 2 01 n-3n-2n-1 2345

33 The LINPACK Benchmark Send Receive Computation

34 The LINPACK Benchmark Send Receive Computation

35 Profiler’s Overhead BlockLUTsFlip-FlopsBRAMs Collector3856 (5%)1279 (1%)0 (0%) Gather187 (0%)53 (0%)0 (0%) Engine Computation Tracer 396 (0%)701 (1%)0 (0%) TMD-MPE Tracer526 (0%)1000 (1%)0 (0%) Processors Computation Tracer without MPE 1196 (1%)1521 (2%)0 (0%) Processors Computation Tracer with MPE 855 (1%)1200 (1%)0 (0%)

36 Conclusions All major features of the MPE were implemented The profiler was successfully used to study the behavior of the applications Less intrusive More events available to profile Can profile network components Compatible with existing profiling software environments

37 Future Work Reduce the footprint of the profiler’s hardware blocks. Profile the Microblaze and PowerPC in a non-intrusive way. Allow real-time profiling

38 Thank You (Questions?)

39 Off-Chip Communications Node The TMD (2) Off-Chip Communications Node FSL PPC TMD- MPE InterChip FSL XAUI Computation Node Network Interface Hardware Engine Network On-chip

40 Profiler (2) TMD-MPE Tracer RXTracer TXTracer Comp To Gather From Cycle Counter PPC PLB TMD-MPE Tracer RXTracer TX DCR2FSL Bridge Tracer Comp To Gather DCR From Cycle Counter GPIO Processor Profiler Architecture Engine Profiler Architecture

41 Profiler (1) XAUI PPC μBμB Collector IC PPC μBμB Gather IC DDR Control FPGA User FPGA 1 User FPGA 4 Board 0 Board N Switch Gather Cycle Counter Network On-chip Network On-chip

42 Profiler (2) TMD-MPE Tracer RXTracer TXTracer Comp To Gather From Cycle Counter PPC PLB TMD-MPE Tracer RXTracer TX DCR2FSL Bridge Tracer Comp To Gather DCR From Cycle Counter GPIO Processor Profiler Architecture Engine Profiler Architecture

43 Hardware Profiling Benefits Less intrusive More events available to profile Can profile network components Compatible with existing profiling software environments

44 MPE PROTOCOL


Download ppt "A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and."

Similar presentations


Ads by Google