Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,
Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
1: Operating Systems Overview
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
RaPTEX: Rapid Prototyping of Embedded Communication Systems Dr. Alex Dean & Dr. Mihai Sichitiu (ECE) Dr. Tom Wolcott (MEAS) Motivation  Existing work.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Bulk Synchronous Parallel Processing Model Jamie Perkins.
A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics Presented by: Chris Comis September 23,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel Computer Architecture and Interconnect 1b.1.
© 2007 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Hardware Design INF3430 MicroBlaze 7.1.
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
“DECISION” PROJECT “DECISION” PROJECT INTEGRATION PLATFORM CORBA PROTOTYPE CAST J. BLACHON & NGUYEN G.T. INRIA Rhône-Alpes June 10th, 1999.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
July 31 - August 4, 1999 SCI'99 / ISAS'99 Performance Prediction for Data Intensive Applications on Large Scale Parallel Systems Yuhong Wen and Geoffrey.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.
6/29/1999PDPTA'991 Performance Prediction for Large Scale Parallel Systems Yuhong Wen and Geoffrey C. Fox Northeast Parallel Architecture Center (NPAC)
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.
Background Computer System Architectures Computer System Software.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Parallel Programming By J. H. Wang May 2, 2017.
Grid Computing.
Constructing a system with multiple computers or processors
Multi-Processing in High Performance Computer Architecture:
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Message Passing Models
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Types of Parallel Computers
Presentation transcript:

Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow Electrical and Computer Engineering, University of Toronto SHARCNET Symposium on GPU and CELL Computing 2008

 Many scientific applications can be accelerated by targeting parallel machines 2  This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power  Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes  FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU

Interconnection Network MEM CPU … Interconnection Network CPU … MEM Interconnection Network ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) ASP (GPU/ FPGA) … MEM Interconnection Network CPU FPGA GPU … MEM 3

 FPGAs can speed up applications, however...  High barrier of entry for designing digital hardware  Developing monolithic FPGA designs is very daunting  How does one easily take advantage of FPGAs for accelerating HPC applications? 5

 Toronto Molecular Dynamics machine is an investigation into high performance computing based on a scalable network of FPGAs  Applications are defined as a simple collection of computing tasks  A task is roughly equivalent to a software process/thread  Major focus is facilitating transition from cluster-based applications to TMD machine 6

7  Step 1: Application Prototyping Software prototype of application developed Profiling identifies compute-intensive routines  Step 2: Application Refinement Partitioning into tasks communicating using MPI Communication patterns analyzed to determine network topology  Step 3: TMD Prototyping Tasks are ported to soft-processors on TMD On-chip communication network verified  Step 4: TMD Optimization Intensive tasks replaced with hardware engines MPE handles communication for hardware engines Hardware engines easily moved, replicated Application Prototype Process AProcess BProcess C MPI CPU Cluster FPGA Network AC TMD-MPI B B

 Use essential subset of MPI standard  Software library for tasks run on processors  Hardware Message Passing Engine (MPE) for hardware-based tasks  Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines  MPI isolation of tasks facilitates C-to-gates compilers 8

9  The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket  Direct access to FSB  CPU and FPGA are both peers in system  Equal priority main memory access

 CPU does not have to orchestrate activity of FPGA  CPU does not have to relay data to and from FPGAs  FPGA not on slow connection to CPU  All tasks can run independently 10

11       F  U =

12 FSB Quad Core CPU MEM Xilinx ACP Module User FPGA 2 User FPGA 1 Comm FPGA NBE 1 NBE 2 NBE 3 NBE 4 Comm Xilinx ACP Module User FPGA 4 User FPGA 3 Comm FPGA NBE 5 NBE 6 NBE 7 NBE 8 Comm Xilinx ACP Module User FPGA 5 Comm FPGA Ewald Comm Ewald User FPGA 6

 Target system is a combination of software running on CPUs and FPGA hardware accelerators  Key to performance is in identifying hotspots and adding corresponding hardware acceleration  Hardware engineer must focus only on small part of overall application  MPI facilitates hardware/software isolation, collaboration 13

SOCRN 1: Molecular Structure and Function, The Hospital for Sick Children 2: Department of Biochemistry, University of Toronto Prof. Paul Chow Prof. Régis Pomès 1,2 David Chui Christopher Comis Sam Lee Daniel Ly Lesley Shannon Mike Yan Danny Gupta Alireza Heiderbarghi Alex Kaganov Daniel Ly Chris Madill 1,2 Daniel Nunes Emanuel Ramalho David Woods Arun Patel Manuel Saldaña Arches Computing: TMD Group: Past Members:

15

16 Application Hardware MPI Application Interface Point-to-Point MPI Functions Send/Receive Implementation FSL Hardware Interface Layer 4: MPI Interface All MPI functions implemented in TMD-MPI that are available to the application. Layer 3: Collective Operations Barrier synchronization, data gathering and message broadcasts. Layer 2: Communication Primitives MPI_Send and MPI_Recv methods are used to transmit data between processes. Layer 1: Hardware Interface Low level methods to communicate with FSLs for both on and off-chip communication.

 Communication links are based on Fast Simplex Links (FSL) Unidirectional Point-to-Point FIFO Provides buffering and flow-control Can be used to isolate different clock domains  FSLs simplify component interconnects Standardized interface, used by both hardware engines and processors Can assemble system modules rapidly  Application-specific network topologies can be defined 17

 Inter-FPGA communication uses abstracted communication links  Communication is independent of physical link Single serial transceivers (FSL-over-Aurora) Bonded serial transceivers (FSL-over-XAUI) Parallel Busses (FSL-over-Wires) FSL-over-10GbE coming soon…