A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
NGS computation services: API's,
6.1 Synchronous Computations ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Distributed Processing, Client/Server, and Clusters
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
L4-1-S1 UML Overview © M.E. Fayad SJSU -- CmpE Software Architectures Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I.
A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,
A CHAT CLIENT-SERVER MODULE IN JAVA BY MAHTAB M HUSSAIN MAYANK MOHAN ISE 582 FALL 2003 PROJECT.
Department of Electrical and Computer Engineering Texas A&M University College Station, TX Abstract 4-Level Elevator Controller Lessons Learned.
Review of “Embedded Software” by E.A. Lee Katherine Barrow Vladimir Jakobac.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
CS 550 Amoeba-A Distributed Operation System by Saie M Mulay.
Configurable System-on-Chip: Xilinx EDK
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
1 I/O Management in Representative Operating Systems.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009.
With RTAI, MPICH2, MPE, Jumpshot, Sar and hopefully soon OProfile or VTune Dawn Nelson CSC523.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
SHAPES scalable Software Hardware Architecture Platform for Embedded Systems Hardware Architecture Atmel Roma, INFN Roma, ST Microelectronics Grenoble,
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Parallel Computing Through MPI Technologies Author: Nyameko Lisa Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
Message Oriented Communication Prepared by Himaja Achutha Instructor: Dr. Yanqing Zhang Georgia State University.
L6-S1 UML Overview 2003 SJSU -- CmpE Advanced Object-Oriented Analysis & Design Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I College.
1 TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH- THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Modeling with Parallel DEVS Serialization in DEVS models Select function Implicit serialization of parallel models E-DEVS: internal transition first,
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
1 Chap. 2 Protocol. 2 Communication model Simplified communication model  source node  gather data from sensor or switch using ADC (analog-to-digital.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
Parallel Computing Presented by Justin Reschke
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Roy Taragan Shaham Kenat
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Message Passing Models
Computer Structure S.Abinash 11/29/ _02.
Overview of Computer Architecture and Organization
Chapter 01: Introduction
An Implementation of User-level Distributed Shared Memory
Presentation transcript:

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and Computer Engineering Department

Overview Background Profiling Model The Profiler Case Studies Conclusions Future Work

How Do We Program This System? Lets look at what traditional clusters use and try to port it to these type of machines User FPGA User FPGA User FPGA User FPGA Ctrl FPGA

Traditional Clusters MPI is a de facto standard for parallel HPC MPI can also be used to program a cluster of FPGAs

The TMD Heterogeneous multi-core multi-FPGA system developed at UofT Uses message passing (TMD-MPI)

TMD-MPI Subset of the MPI standard Allows an independence between the application and the hardware TMD-MPI functionality is also implemented in hardware (TMD-MPE)

TMD-MPI – Rendezvous Protocol This implementation uses the Rendezvous protocol, a synchronous communication mode Req. to Send Acknowledge Data

The TMD Implementation on BEE2 Boards PPCMB PPC MB PPC MB NoC User FPGA Ctrl FPGA

How Do We Profile This System? Lets look at how it is done in traditional clusters and try to adapt it to hardware

MPICH - MPE Collects information from MPI calls and defined user states through embedded calls Includes a tool to view all log files (Jumpshot)

Goals Of This Work Implement a hardware profiler capable of extracting the same data as the MPE Make it less intrusive Make it compatible with the API used by MPE Make it compatible with Jumpshot

Tracers PPC Processor’s Computation Tracer Receive Tracer Send Tracer TMD MPE Receive Tracer Send Tracer TMD MPE Engine’s Computation Tracer The Profiler interacts with the computation elements through tracers that register important events TMD-MPE requires two tracers due to its parallel nature PPC Processor’s Computation Tracer

Tracers - Hardware Engine Computation MUX R0 Tracer for Hardware Engine Cycle Counter 32

Tracers - TMD-MPE R0R1 R2 R3 R4 MPE Data Reg MUX Tracer for TMD-MPE Cycle Counter TMD MPE 32

Tracers – Processors Computation Register Bank (9 x 32 bits) MUX Register Bank (5 x 32 bits) Stack MPI Calls StatesUser Define States Tracer for PowerPC/MicroBlaze Cycle Counter PPC 32

Profiler’s Network Tracer Gather CollectorDDR User FPGAControl FPGA

Synchronization Synchronization within the same board  Release reset of the cycle counters simultaneously Synchronization between boards  Periodically exchange of messages between the root board and all other boards

Visualize with Jumpshot Profiler’s Flow Collect Data Dump to Host Convert To CLOG2 Convert To SLOG2 After Execution Back End Front End

Case Studies Barrier  Sequential vs Binary Tree TMD-MPE - Unexpected Message Queue  Unexpected Message Queue addressable by rank The Heat Equation  Blocking Calls vs Non-Blocking Calls LINPACK Benchmark  16 Node System Calculating a LU Decomposition of a Matrix

Barrier Synchronization call – No node will advance until all nodes have reached the barrier

Barrier Implemented Sequentially Send Receive

Barrier Implemented as a Binary Tree Send Receive

TMD-MPE – Unexpected Messages Queue All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.

TMD-MPE – Unexpected Messages Queue Send Receive Queue Search and Reorganization

TMD-MPE – Unexpected Messages Queue Send Receive Queue Search and Reorganization

TMD-MPE – Unexpected Messages Queue Send Receive

The Heat Equation Application Partial differential equation that describes the temperature change over time

The Heat Equation Application

Send Receive Computation

The Heat Equation Application Send Receive Computation

The LINPACK Benchmark Solves a system of linear equations LU factorization with partial pivoting

The LINPACK Benchmark assigned to Rank 0 assigned to Rank 1 assigned to Rank 2 01 n-3n-2n

The LINPACK Benchmark Send Receive Computation

The LINPACK Benchmark Send Receive Computation

Profiler’s Overhead BlockLUTsFlip-FlopsBRAMs Collector3856 (5%)1279 (1%)0 (0%) Gather187 (0%)53 (0%)0 (0%) Engine Computation Tracer 396 (0%)701 (1%)0 (0%) TMD-MPE Tracer526 (0%)1000 (1%)0 (0%) Processors Computation Tracer without MPE 1196 (1%)1521 (2%)0 (0%) Processors Computation Tracer with MPE 855 (1%)1200 (1%)0 (0%)

Conclusions All major features of the MPE were implemented The profiler was successfully used to study the behavior of the applications Less intrusive More events available to profile Can profile network components Compatible with existing profiling software environments

Future Work Reduce the footprint of the profiler’s hardware blocks. Profile the Microblaze and PowerPC in a non-intrusive way. Allow real-time profiling

Thank You (Questions?)

Off-Chip Communications Node The TMD (2) Off-Chip Communications Node FSL PPC TMD- MPE InterChip FSL XAUI Computation Node Network Interface Hardware Engine Network On-chip

Profiler (2) TMD-MPE Tracer RXTracer TXTracer Comp To Gather From Cycle Counter PPC PLB TMD-MPE Tracer RXTracer TX DCR2FSL Bridge Tracer Comp To Gather DCR From Cycle Counter GPIO Processor Profiler Architecture Engine Profiler Architecture

Profiler (1) XAUI PPC μBμB Collector IC PPC μBμB Gather IC DDR Control FPGA User FPGA 1 User FPGA 4 Board 0 Board N Switch Gather Cycle Counter Network On-chip Network On-chip

Profiler (2) TMD-MPE Tracer RXTracer TXTracer Comp To Gather From Cycle Counter PPC PLB TMD-MPE Tracer RXTracer TX DCR2FSL Bridge Tracer Comp To Gather DCR From Cycle Counter GPIO Processor Profiler Architecture Engine Profiler Architecture

Hardware Profiling Benefits Less intrusive More events available to profile Can profile network components Compatible with existing profiling software environments

MPE PROTOCOL