PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Operating Systems Components of OS
Threads, SMP, and Microkernels
Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.
879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Neural NetworksNN 11 Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Reference: Message Passing Fundamentals.
COSC 120 Computer Programming
Reference: Getting Started with MPI.
Computer Network Project Computer Network Project Efficient handling of messages with multimedia attachments.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Learning Objectives Understanding the difference between processes and threads. Understanding process migration and load distribution. Understanding Process.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Chapter 6: An Introduction to System Software and Virtual Machines
1 The development of modern computer systems Early electronic computers Mainframes Time sharing Microcomputers Networked computing.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Scheduling Parallel Task
Copyright 2003 Scott/Jones Publishing Brief Version of Starting Out with C++, 4th Edition Chapter 1 Introduction to Computers and Programming.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Christopher Jeffers August 2012
L21: “Irregular” Graph Algorithms November 11, 2010.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Introduction to Data Structures. Definition Data structure is representation of the logical relationship existing between individual elements of data.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
1 Comp 104: Operating Systems Concepts Java Development and Run-Time Store Organisation.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Parallel Computing.
FOUNDATION IN INFORMATION TECHNOLOGY (CS-T-101) TOPIC : INFORMATION SYSTEM – SOFTWARE.
A System to Generate Test Data and Symbolically Execute Programs Lori A. Clarke Presented by: Xia Cheng.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Sequencers SQO,SQC,SQL.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Compilation of XSLT into Dataflow Graphs for Web Service Composition Peter Kelly Paul Coddington Andrew Wendelborn.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
About the Capability of Some Parallel Program Metric Prediction Using Neural Network Approach Vera Yu. Goritskaya Nina N. Popova
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
INTRODUCTION TO COMPUTER PROGRAMMING(IT-303) Basics.
Chapter 1: Introduction to Computers and Programming.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
TensorFlow– A system for large-scale machine learning
CSCI-235 Micro-Computer Applications
COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE
Parallel Programming By J. H. Wang May 2, 2017.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Java programming lecture one
Parallel and Multiprocessor Architectures
Software Programming J. Holvikivi 2014.
Java Applets.
COS 151 Bootcamp – Week 4 Department of Computer Science
Presentation transcript:

PARUS: a parallel programming framework for heterogeneous multiprocessor systems Alexey N. Salnikov (salnikov cs.msu.su) Moscow State University Faculty of Computational Mathematics and Cybernetics PARUS is intended for automatically building a parallel program from a data-dependency graph and fragments of source code in C. The algorithm is represented as an oriented graph where vertices are marked with series of instructions, and edges correspond to data dependencies. Vertices are assigned to MPI-processes according to the performance of both the processors and communications. Brief description Steps to parallel program Data dependency graph Graph to C++ with MPI code converter (graph2c++) C++ source code Compiling, linking with libparus and execution Source fragments of C programming language PARUS implies declaration of code fragments in C, then declaration of data dependencies between variables and pieces of arrays based on those fragments. Information about vertices and data dependencies (edges) is stored in a text file. The graph2c++ utility generates C++ code with MPI calls from a data dependency graph. This code is compiled with mpiCC compiler and is linked with libparus. The result is a binary executable for a multiprocessor system. The MPI-processes execute vertices of data dependency graph on the schedule. PARUS supports two types of scheduling: static (built by graph2sch) and dynamic (built in runtime by libparus). Both types of schedule are generated in view of the perfomance of both the processors and the communication system. System architecture Kernel of system (libparus) Utilities Tests results visualizerntr_viewer GUI for graph editinggraph_editor Noise level meter (in communications)network_test_noise Communications performance testernetwork_test Processors performance testerproc_test Graph verifiergraph_touch Graph to C++ with MPI calls convertergraph2c++ Schedule generatorgraph2sch C source code data dependency analyzerparser Results We successfully ran PARUS on the following multi-processor machines MVS-1000m, MVS ( IBM eServer pSeries 690 Regatta, Fujitsu Sun PRIMEPOWER 850. Several popular scientific tasks have been implemented using PARUS, which showed competitive perfomance. speedup of an associative operation to all elements of an array on MVS-1000m number of perceptron enters speedup Speedup of three layer perceptron on 16 processors of Regatta according number of perceptron enters. Two perceptron layers are pojected onto a set of processors. Sourceforge PARUS is registered on SourceForge ( that allows developers to facillitate their work, and enables anyone easily download latest source code and documentation. Implemented tasks: three layer perceptron with a maximum of neurons recursive algorithm that computes the result of an associative operation to all elements of an array (10 9 ) algorithm of multiple sequence alignment ( The PARUS consists of the kernel and a set of utilities. The kernel controls the program execution. The utilities manipulate with data dependency graph and concentrate information about multi-processor componets beahavior. Graph structure Vertices sources Internal verticies Drain Verticies The algorithm is represented as a directed graph. Each edge is directed from the vertex where the data are sent from to the vertex that receives the data. Thereby, the program may be represented as a network that has The graph description is stored in text file with HTML like structure. int a[] int c[] Arrays to be transmitted between vertices are split into sets of non-overlapping chunks The recieving vertices can join chunks in any order Any vertex is capable of both recieving and sending simultaniously Sender Receivers Perfomance data acquisition subsystem The processor and communication perfomance data acquisition is perfomed to spread the load evenly across the componets of a multi-processor system while executing a program. We determine processor perfomance by the time taken to solve a particular scientific task of fixed dimension. The next step is to analize the state of the communication system. This is done via several MPI-tests that measure the duration of data transfer between MPI-processes. All the communication tests within this subsystem build a set of matrices where each matrix corresponds to a message length. Position in such a matrix corresponds to the delay in data transfer between a pair of MPI-processes. The information gathered by these tests is than used by the kernel and graph2sch. Some communication tests: one_to_one async_one_to_one send_recv_and_recv_send all_to_all Structure of vertices and edges Head Data reception Data packing Body Tail Code Data int b[] number 1 type 1 weight 100 layer 2 num_input_edges 1 edges ( 1 ) num_output_edges 1 edges ( 2 ) head "head" body "node" tail "" number of processors speedup source vertices (they usually serve for reading input files), internal vertices (where the data are processed), and drain vertices, where the data are saved to the output files and the execution terminates.