MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Slides:



Advertisements
Similar presentations
Operating Systems Components of OS
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Chap 2 System Structures.
Introduction CS 524 – High-Performance Computing.
© ABB Group Jun-15 Evaluation of Real-Time Operating Systems for Xilinx MicroBlaze CPU Anders Rönnholm.
Chapter 2: Operating-System Structures
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
Performance Analysis of Processor Characterization Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor:
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
Threads, Thread management & Resource Management.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 14, 2005 Operating System.
RUNNING RECONFIGME OS OVER PETA LINUX OS MUHAMMED KHALID RAHIM DR. GRANT WIGLEY ID:
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems David Goldschmidt, Ph.D.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: Operating-System Structures.
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Somervill RSC 1 125/MAPLD'05 Reconfigurable Processing Module (RPM) Kevin Somervill 1 Dr. Robert Hodson 1
2.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition System Programs (p73) System programs provide a convenient environment.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Chapter 4: Multithreaded Programming
Chapter 4: Threads.
The Mach System Sri Ramkrishna.
Instructor: Dr. Phillip Jones
Improving java performance using Dynamic Method Migration on FPGAs
Chapter 3: Windows7 Part 1.
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Chapter 4: Threads.
Memory Opportunity in Multicore Era
Chapter 2: System Structures
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Outline Chapter 2 (cont) OS Design OS structure
(Computer fundamental Lab)
Chapter 4: Threads & Concurrency
Outline Operating System Organization Operating System Examples
System calls….. C-program->POSIX call
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006

Project Proposal  Port uClinux to work on Microblaze  Add MPI implementation on top of uClinux  Configure NAS parallel benchmarks and port them to work on RAMP

What is Microblaze?  Soft core processor, implemented using general logic primitives  32-bit Harvard RISC architecture  Supported in the Xilinx Spartan and Virtex series of FPGAs  Customizability of the core makes it challenging while opening up vistas for kernel configurations

Components  uClinux - kernel v2.4  MPICH2 - portable, high performance implementation of the entire MPI-2 standard Communication via different channels - sockets, shared memory, etc. MPI port for Microblaze communication is over FSL

Components (contd.)  NASPB v2.4 - MPI-based source code implementations written and distributed by NAS 5 kernels 3 pseudo-applications

Porting uClinux to Microblaze  Done by Dr. John Williams - Embedded Systems group, University of Queensland in Brisbane, Australia  Part of their reconfigurable computing research program. The work on this is still going on  uclinux

Challenge in porting uClinux to Microblaze  Linux derivative for microprocessors that lack a memory management unit (MMU) No memory protection No virtual memory For most user applications, the fork() system call is unavailable malloc() function call needs to be modified

MPI implementation  MPI – Message Passing Interface  Standard API used to create parallel applications  Designed primarily to support the SPMD (single program multiple data) model  Advantage over older message passing libraries Portability Fast as each implementation is optimized for the hardware it runs on

Interactions between Application and MPI Initiating applicationApplication on other processors MPI process manager MPI interface Communication Channel Other processors ………………………….

NAS parallel benchmarks  Set of 8 programs intended to aid in evaluating the performance of parallel supercomputers  Derived from computational fluid dynamics (CFD) applications, 5 kernels 3 pseudo-applications  Used NPB2.4 version – MPI-based source code implementation

Phases  Studied the uClinux and found the initial port done for Microblaze  Latest kernel (2.4) and distribution from uClinux.org  Successful compilation for Microblaze architecture  MPI - MPICH2 out of many versions of MPI  Investigated the MPICH2 implementation available from Argonne National Laboratory  Encountered challenges in porting MPI onto uClinux

Challenges in porting MPI to uClinux  Use of fork and a complex state machine  Default process manager for unix platforms is MPD written in Python and uses a wrapper to call fork  Simple fork->vfork is not possible as the function is called deep inside other functions and require a lot of stack unwinding  Alternate Approaches  Port SMPD, written in C It will involve a complex state machine and stack unwinding after the fork  Use pthreads Might involve a lot of reworking of code as the current implementation is not using pthreads Need to ensure thread safety

NAS Parallel Benchmark  Used NAS PB v2.4  Compiled and executed it on a desktop and Millennium Cluster  Obtained information about MOPS Type of operation Execution time Number of nodes involved Number of processes and iterations

NAS PB simulation result (Millennium cluster, Class A)

Simulation result (cont.)

Estimated statistics for the floating point group  4 test benches use floating point op heavily are: BT, CG, MG, and SP Very few fp comparison ops in all BT (Block Tridiagonal) all fp ops are add, subtract, and multiply. About 5% of all ops is division CG (Conjugate Gradient) has the highest % of ops that is sqrt, 30%. Add, mult is about 60%, divide is about 10%. MG (Multigrid) about 5% is sqrt, 20% is division. The rest is add, subtract, and multiply For SP (Scalar Pentadiagonal) almost all ops are add, 10% is division

Floating Point Operation Frequency

Most frequently used MPI functions in NASPB v2.4

Observations about NASPB  NASPB suite – 6 out of 8 benchmarks are predictive of parallel performance EP – little/negligible communication between processors. IS – high communication overhead.

Project status  Compiled uClinux and put it on Microblaze  Worked on the porting of MPI but not completed  Compiled and executed NASPB on desktop and Millennium (which currently uses 8 computing nodes)