Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato stapl-support@tamu.edu Parasol Lab, Department of Computer Science and Engineering, Texas A&M University, http://parasol.tamu.edu/ Oil well logging simulation STAPL Overview normal misfold STAPL is a framework for developing parallel C++ code. Its core is a library of C++ components with interfaces similar to the (sequential) C++ Standard Template Library (STL). Standard Template Adaptive Parallel Library Applications using STAPL Particle Transport Computation Efficient Massively Parallel Implementation of Discrete Ordinates Particle Transport Calculation. Protein & RNA Folding Probabilistic Roadmap Methods from motion planning adapted to protein and RNA folding Seismic Ray Tracing Simulation of propagation of seismic rays in earth’s crust. Prion Protein References “A Framework for Adaptive Algorithm Selection in STAPL,” N. Thomas, G. Tanase, O. Tkachyshyn, J. Perdue, N. M. Amato, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2005. “Parallel Protein Folding with STAPL,” S. Thomas, G. Tanase, L. Dale, J. Moreira, L. Rauchwerger, N. M. Amato. Journal of Concurrency and Computation: Practice and Experience, 2005 “ARMI: An Adaptive, Platform Independent Communication Library,” S. Saunders, L. Rauchwerger. Symposium on Principles and Practice of Parallel Programming (PPOPP), June 2003. “STAPL: An Adaptive, Generic Parallel C++ Library,” P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N. M. Amato and L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), Aug. 2001. This research supported in part by NSF Grants EIA-0103742, ACR-0081510, ACR- 0113971, CCR-0113974, ACI-0326350, CRI-0551685, by the DOE, IBM, and Intel. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02- 05CH11231. Project Goals Ease of use Shared Object Programming Model provides consistent interface across shared or distributed memory systems. Efficiency Application building blocks are based on C++ STL constructs and extended and automatically tuned for parallel execution. Portability ARMI runtime system hides machine specific details and provides an efficient, uniform communication interface. User Application Code Adaptive Framework Pthreads, OpenMP, MPI, Native, … Run-time System ARMI Communication Library Scheduler Executor Performance Monitor pAlgorithms pRange Views pContainers “Design for Interoperability in STAPL: pMatrices and Linear Algebra Algorithms,” A. Buss, G. Tanase, N. Thomas, T. Smith, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5335 p304-315 August 2008. “Associative Parallel Containers in STAPL,” G. Tanase, C. Raman, M. Bianco, N. M. Amato, L. Rauchwerger. Languages and Compilers for Parallel Computing (LCPC), LNCS 5234 p156-171, 2008. “The STAPL pArray,” G. Tanase, M. Bianco, N. M. Amato, L. Rauchwerger. MEmory performance: DEaling with Applications, systems and architecture (MEDEA), 2007. pRange pContainers and Views Performance Evaluation pContainer - a distributed collection of generic elements with methods to access and maintain the collection that provides a shared-object view to the user. The shared-object view provides uniform access to data independent of the physical location where it is stored. pContainer interfaces are equivalent to their STL counterpart (e.g., pVector and STL vector). View – an abstract data type that allows decoupling of a container interface from the underlying storage. Views allow a data set to be filtered and traversed in multiple ways. Row- or Column-wise traversal of a pMatrix, for example. Location 0Location 1 Component 0 Domain={0,1} Component 1 Domain={2,3} Component 2 Domain={4,5} p_array[5 ] p_array[2] pArray p_array(6); - Domain =[0,6) - blocked_partitioned(2), - cyclic_maping Example of pArray on 2 processors Location 3 Location 2 Location 1 Location 0 Loc 3 Loc 2 Loc 1 Loc 0 Row Based View Aligned with distribution Column Based View Not aligned with distribution Dynamic, composable task graph of a parallel computation. A graph vertex is a work function and the data (represented by a partition of a view) to be processed. Tasks that need to access the same data in a particular order have a graph edge between them to enforce the execution order when the graph is processed by the Executor in the runtime. pRanges that are constructed using Factories that describe the pattern of the computation (e.g., Map-Reduce) are data-parallel. pRanges can be composed to form task-parallel computations. Task View Work Function Example of Map-Reduce Task Graph p_unique Scalability 10 million integers in the container. Results obtained on an IBM Cluster 1600 here at TAMU. Initial set of results, improvements planned. pAlgorithm: p_unique p_unique Input: A sequence of elements and a binary relation. Output: A sequence of elements consisting of only the first element of each group of consecutive duplicate elements. The binary relation is used to determine the duplicate elements. ex: ‘=‘ {1, 1, 2, 2, 3, 3, 4, 4} ---> {1, 2, 3, 4} Sequentially, there’s only one way to do this; in parallel, there are multiple cases. Case 1, symmetric + transitive: ex: “=“: Compare, in parallel, each element to the next element in the sequence using the relation and keep or remove it based on the result. If we have enough processors, all of the comparisons can happen simultaneously. Constant number of remote accesses. Case 2, transitive + not symmetric: ex: ‘<‘ {60, 100, 70, 20, 50, 40, 30, 10} ---> {60, 20, 10} Requires parallel prefix algorithm before decisions. We “prefix” each comparison with the appropriate initial element, then we identify the duplicate elements. log p remote accesses. Case 3, not transitive: Sequential; comparisons must be done in order, one at a time. ex: ‘a is the father of b’ p_unique_copy, Case 1 (symmetric and transitive), 100% of elements copied p_unique_copy, Case 2 (transitive but not symmetric), 100% of elements copied p_unique_copy, Case 2, 0% of elements copied p_unique_copy, Case 1, 50% of elements copied. Slowdown occurs as the amount of communication increases due to compaction of elements.

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Similar presentations

Presentation on theme: "Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Similar presentations

Presentation on theme: "Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback