PContainerARMI Communication Library Oil well logging simulation MPIOpenMPPthreadsNative pAlgorithmspContainers User Application Code pRange STAPL Overview.

Slides:



Advertisements
Similar presentations
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
MPI Message Passing Interface Portable Parallel Programs.
MPI Message Passing Interface
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.
TAXI Code Overview and Status Timmie Smith November 14, 2003.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Introduction in algorithms and applications Introduction in algorithms and applications Parallel machines and architectures Parallel machines and architectures.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Data Parallel Algorithms Presented By: M.Mohsin Butt
SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas.
OOP in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu Faculty Mentor: Dr. Nancy Amato Supervisor: Dr. Mauro Bianco.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
STAPL The C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
The Design Discipline.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Data Structures Using C++ 2E
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
An Introduction to Software Architecture
chap13 Chapter 13 Programming in the Large.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Advanced Programming Collage of Information Technology University of Palestine, Gaza Prepared by: Mahmoud Rafeek Alfarra Lecture 2: Major Concepts of Programming.
STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
SE: CHAPTER 7 Writing The Program
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Lab 2 Parallel processing using NIOS II processors
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Intro to the C++ STL Timmie Smith September 6, 2001.
Programmability Hiroshi Nakashima Thomas Sterling.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
CSCI-383 Object-Oriented Programming & Design Lecture 25.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
A Parallel Communication Infrastructure for STAPL
Parallel Databases.
Computer Engg, IIT(BHU)
MPI Message Passing Interface
CSCE569 Parallel Computing
Introduction to parallelism and the Message Passing Interface
MPI Message Passing Interface
Presentation transcript:

pContainerARMI Communication Library Oil well logging simulation MPIOpenMPPthreadsNative pAlgorithmspContainers User Application Code pRange STAPL Overview Distributed data structure with parallel methods. l Provides a shared-memory view of distributed data. l Deploys an Efficient Design l Base classes implement basic functionality. l New pContainers can be derived from Base classes with extended and optimized functionality. l Easy to use defaults provided for basic users; advanced users have the flexibility to specialize and/or optimize methods. l Supports multiple logical views of the data. l For example, a pMatrix can be accessed using a row based view or a column based view. l Views can be used to specify pContainer (re)distribution. l Common views provided (e.g. row, column, blocked, block cyclic for pMatrix); users can build specialized views. l pVector, pList, pHashMap, pGraph, pMatrix provided. normal misfold Provides high performance, RMI style communication between threads in program async_rmi, sync_rmi, standard collective operations (i.e., broadcast and reduce). Transparent execution using various lower level protocols such as MPI and Pthreads – also, mixed mode operation. Controllable tuning – message aggregation. Effect of Aggregation in ARMI The Standard Template Adaptive Parallel Library ( STAPL ) is a framework for parallel C++ code. Its core is a library of ISO Standard C++ components with interfaces similar to the (sequential) ISO C++ standard library (STL). The goals of STAPL are: Ease of use Shared Object Model provides consistent programming interface, regardless of a actual system memory configuration (shared or distributed). Efficiency Application building blocks are based on C++ STL constructs that are extended and automatically tuned for parallel execution. Portability ARMI runtime system hides machine specific details and provides an efficient, uniform communication interface. STAPL: Standard Template Adaptive Parallel LibraryApplications using STAPL l Particle Transport Computation l Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. l Motion Planning l Probabilistic Roadmap Methods for motion planning with application to protein folding, intelligent CAD, animation, robotics, etc. l Seismic Ray Tracing l Simulation of propagation of seismic rays in earth’s crust. STAPL: An Adaptive, Generic, Parallel C++ Library Tao Huang, Alin Jula, Jack Perdue, Tagarathi Nageswar Rao, Timmie Smith, Yuriy Solodkyy, Gabriel Tanase, Nathan Thomas, Anna Tikhonova, Olga Tkachyshyn, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Department of Computer Science, Texas A&M University, Prion Protein Adaptive Framework pRange l Provides a shared view of a distributed work space l Subranges of the pRange are executable tasks l A task consists of a function object and a description of the data to which it is applied l Supports parallel execution l Clean expression of computation as parallel task graph l Stores Data Dependence Graphs used in processing subranges Application data stored in pGraph Thread 1 Thread 2 Subrange 6 Subrange 5 Subrange 1 Subrange 3 Function Thread 1 Thread 2 Subrange 4 Function Subrange 2 Function pRange defined on a pGraph across two threads. Run-time System ARMI Communication Library SchedulerExecutorPerformance Monitor Thread 0 Thread 1 Thread 2 Thread 3 Th0Th1Th2Th3 Row Based View Aligned with the distribution Column Based View Not aligned with the distribution Non-partitioned Shared Memory View of Data pContainer Run-time System and ARMI Data Shared Memory Data Distributed Memory Partitioned Shared Memory View of Data Data Distributed Memory

References l “A Framework for Adaptive Algorithm Selection in STAPL,” N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. Amato, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June l “Parallel Protein Folding with STAPL,” S. Thomas, G. Tanase, L. Dale, J. Moreira, L. Rauchwerger, N. Amato. Journal of Concurrency and Computation: Practice and Experience, l “ARMI: An Adaptive, Platform Independent Communication Library,” S. Saunders, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June l “STAPL: An Adaptive, Generic Parallel C++ Library,” P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. Amato and L. Rauchwerger. Workshop on Languages and Compilers for Parallel Computing (LCPC), Aug l “SmartApps: An Application Centric Approach to High Performance Computing,” L. Rauchwerger, N. Amato, J. Torrellas. Workshop on Languages and Compilers for Parallel Computing (LCPC), Aug Installation Benchmarks Architecture & Environment Algorithm Performance Model User Code Parallel Algorithm Choices Data Characteristics Runtime Tests Selected Algorithm Data Repository STAPL Adaptive Executable Model Adaptive Algorithm Selection Framework STAPL has a library of multiple functionally equivalent solutions for many important problems. While they produce the same end result, their performance differs depending on: System architecture Number of processing elements Memory hierarchy Input characteristics Data type Size of input Others (i.e. presortedness for sort) Adaptive Sort Input size 120mil random integers Performance Database Handle various algorithms/problems with different profiling needs. Model Generation / Installation Benchmarking Occurs once per platform, during STAPL installation Choose parameters that may affect performance (i.e., input size, algo specific, etc.) Run a sample of experiments, insert timings into performance database Create a model to predict the “winning” algorithm in each case Runtime Algorithm Selection Gather parameters Query model Execute the chosen algorithm pAlgorithms pAlgorithms are parallel equivalents of algorithms. pAlgorithms are sets of parallel task objects which provide basic functionality, bound with the pContainer by pRange. STAPL provides parallel STL equivalents (copy, find, sort, etc.), as well as graph and matrix algorithms. Example algorithm: Nth Element (Selection Problem) The Nth Element algorithm partially orders a range of elements: it arranges elements such that the element located in the nth position is the same as it would be if the entire range of elements had been sorted. Additionally, none of the elements in the range [nth, last) is less than any of the elements in the range [first, nth). There is no guarantee regarding the relative order within the sub-ranges [first, nth) and [nth, last). nthfirstlast < n>= n p_nth_element(pRange &pr, pContainer &pcont, Iterator nth) { Select a sample of s elements. Select m evenly spaced elements, called splitters. Sort the splitters and select k final splitters. Splitters determine the ranges of virtual “buckets”. Total the number of elements in each “bucket”. Traverse totals to find bucket B containing the nth element. Recursively call p_nth_element(B.pRange(), B, nth). } BucketElement Counts Per Subrange Our framework automatically chooses an implementation that maximizes performance. Example code (main): typedef stapl::pArray pcontainerType; typedef pcontainerType::PRange prangeType; void stapl_main(int argc, char **argv) { // Parallel container to be partially sorted: pcontainerType pcont(nElements); // Fill the container with values … // Declare a pRange on your parallel container: prangeType pr(pcont); //parallel function call p_nth_element(pr, pcont, nth); // synchronization barrier stapl::rmi_fence(); } Example (distribute elements into virtual “buckets”): template class distribute_elements_wf : public work_function_base { pContainer *splitters; nSplitters = splitters->size(); vector bucket_counts(nSplitters); distribute_elements_wf(pContainer &sp) : splitters(&sp) {} void operator() (Boundary &subrange_data) { typename Boundary::iterator_type first1 = subrange_data.begin(); while (first1 != subrange_data.end()) { int dest; pContainer::value_type val = *first1; if (nSplitters > 1) { //If at least two splitters pContainer::value_type *d = std::upper_bound(&splitters[0], &splitters[nSplitters], val); dest = (int)(d-(&splitters[0])); } else { if (nSplitters == 2){ //one splitter if(val < splitters[0]) dest = 0; } else dest = 1; } else { dest = 0; //No splitter, send to self }} bucket_counts[dest]++; ++first1; // Increment counter for the appropriate bucket }}};