June 13-15, 2010SPAA 20101 Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.

Slides:



Advertisements
Similar presentations
How a Domain-Specific Language Enables the Automation of Optimized Code for Dense Linear Algebra DxT – Design by Transformation 1 Bryan Marker, Don Batory,
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing with OpenMP
1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multi-core Processors Muthu Baskaran 1 Naga Vydyanathan 1 Uday Bondhugula.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
PACC2011, Sept [Soccer] is a very simple game. It’s just very hard to play it simple. - Johan Cruyff Dense Linear Algebra subjectmake RvdG.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN R UNTIME D ATA F LOW S CHEDULING OF M ATRIX C OMPUTATIONS E RNIE C HAN C HOL 0 A.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
THE UNIVERSITY OF TEXAS AT AUSTIN Programming Dense Matrix Computations Using Distributed and Off-Chip Shared-Memory on Many-Core Architectures Ernie Chan.
May 31, 2010Final defense1 Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations Ernie Chan.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
March 22, 2010Intel talk1 Runtime Data Flow Graph Scheduling of Matrix Computations Ernie Chan.
B ERKELEY P AR L AB 1 Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar  Berkeley, CA  March.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.
1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.
NFV Compute Acceleration APIs and Evaluation
Ioannis E. Venetis Department of Computer Engineering and Informatics
Prabhanjan Kambadur, Open Systems Lab, Indiana University
libflame optimizations with BLIS
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Hybrid Programming with OpenMP and MPI
Programming with Shared Memory Specifying parallelism
Programming with Shared Memory Specifying parallelism
Presentation transcript:

June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan

June 13-15, 2010SPAA Motivation Solving Linear Systems  Solve for x A x = b  Factorize AO(n 3 ) P A = L U  Forward and Backward substitutionO(n 2 ) L y = P b U x = y

June 13-15, 2010SPAA Goals Programmability  Use tools provided by FLAME Parallelism  Directed acyclic graph (DAG) scheduling

June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Formal Linear Algebra Method Environment (FLAME)  High-level abstractions for expressing linear algebra algorithms  Application programming interfaces (APIs) for seamlessly implementing algorithms in code  Library of commonly used linear algebra operations in libflame

June 13-15, 2010SPAA 20106

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 A 22 A 12 A 11

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 LUpiv A 22 A 12

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 PIV A 11

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 A 22 TRSMA 11

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 1 A 21 GEMM A 12 A 11

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 A 22 A 01 A 02

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 LUpiv A 12 A 22 A 01 A 02

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 PIV A 11 A 21 PIV A 01 A 02

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 TRSM A 22 A 01 A 02

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 2 A 00 A 10 A 20 A 11 A 21 A 12 GEMM A 01 A 02

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 A 10 A 01 A 11

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 A 10 A 01 LUpiv

June 13-15, 2010SPAA LU Factorization with Partial Pivoting Blocked Algorithm  Iteration 3 A 00 PIV A 01 A 11

June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

June 13-15, 2010SPAA Algorithm-by-Blocks FLASH  Storage-by-blocks

June 13-15, 2010SPAA FLA_Part_2x2( A, &ATL, &ATR, &ABL, &ABR, 0, 0, FLA_TL ); FLA_Part_2x1( p, &pT, &pB, 0, FLA_TOP ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ) { FLA_Repart_2x2_to_3x3( ATL, /**/ ATR, &A00, /**/ &A01, &A02, /* ******** */ /* **************** */ &A10, /**/ &A11, &A12, ABL, /**/ ABR, &A20, /**/ &A21, &A22, 1, 1, FLA_BR ); FLA_Repart_2x1_to_3x1( pT, &p0, /* ** */ /* ** */ &p1, pB, &p2, 1, FLA_BOTTOM ); /* */ FLA_Merge_2x1( A11, A21, &AB1 ); FLASH_LU_piv( AB1, p1 ); FLA_Merge_2x1( A10, A20, &AB0 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB0 ); FLA_Merge_2x1( A12, A22, &AB2 ); FLASH_Apply_pivots( FLA_LEFT, FLA_NO_TRANSPOSE, p1, AB2 ); FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, FLA_UNIT_DIAG, FLA_ONE, A11, A12 ); FLASH_Gemm( FLA_NO_TRANSPOSE, FLA_NO_TRANSPOSE, FLA_MINUS_ONE, A21, A12, FLA_ONE, A22 ); /* */ FLA_Cont_with_3x3_to_2x2( &ATL, /**/ &ATR, A00, A01, /**/ A02, A10, A11, /**/ A12, /* ********** */ /* ************* */ &ABL, /**/ &ABR, A20, A21, /**/ A22, FLA_TL ); FLA_Cont_with_3x1_to_2x1( &pT, p0, p1, /* ** */ /* ** */ &pB, p2, FLA_TOP ); }

June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 1 PIV 1 TRSM 3 LUpiv 0 PIV 2 TRSM 4 PIV 1 GEMM 5 PIV 2 GEMM 7 PIV 1 GEMM 6 LUpiv 0 PIV 2 GEMM 8

June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 2 LUpiv 9 PIV 11 TRSM 12 LUpiv 9 PIV 10 PIV 11 GEMM 13

June 13-15, 2010SPAA Algorithm-by-Blocks LU Factorization with Partial Pivoting  Iteration 3 PIV 16 PIV 15 LUpiv 14

June 13-15, 2010SPAA PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15

June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

June 13-15, 2010SPAA SuperMatrix Runtime System Separation of Concerns  Analyzer Decomposes subproblems into component tasks Store tasks in global task queue sequentially Internally calculates all dependencies between tasks, which form a directed acyclic graph (DAG), only using input and output parameters for each task  Dispatcher Spawn threads Schedule and dispatch tasks to threads in parallel

June 13-15, 2010SPAA … SuperMatrix Runtime System Dispatcher – Single Queue  Set of all ready and available tasks  FIFO, priority PE 1 PE 0 PE p-1

June 13-15, 2010SPAA PIV 1 TRSM 4 GEMM 5 LUpiv 9 GEMM 13 LUpiv 0 TRSM 12 GEMM 8 GEMM 6 PIV 11 PIV 2 TRSM 3 LUpiv 14 GEMM 7 PIV 10 PIV 16 PIV 15

June 13-15, 2010SPAA SuperMatrix Runtime System Lookahead  Schedule GEMM 5 and GEMM 6 tasks first so LUpiv 9 can be “computed ahead” in parallel with GEMM 7 and GEMM 8  Implemented directly within the code which increases the complexity and detracts from programmability High-Performance LINPACK

June 13-15, 2010SPAA SuperMatrix Runtime System Scheduling  Sorting tasks by height of each task in DAG mimics lookahead  Multiple queues Data affinity Work stealing Macroblocks  Tasks overwriting more than one block at a time

June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

June 13-15, 2010SPAA Performance Implementations  SuperMatrix + serial BLAS Partial and incremental pivoting  LAPACK dgetrf + multithreaded BLAS  Multithreaded dgetrf  Double precision real floating point arithmetic  Tuned block size per problem size

June 13-15, 2010SPAA Performance Target Architecture – Linux  4 socket 2.3 GHz AMD Opteron Quad-Core ranger.tacc.utexas.edu 3936 SMP nodes 16 cores per node 2 MB shared L3 cache per socket  OpenMP Intel compiler 10.1  BLAS GotoBLAS2 1.00, MKL 10.0

June 13-15, 2010SPAA Performance

June 13-15, 2010SPAA Performance

June 13-15, 2010SPAA Performance Target Architecture – Windows  4 socket 2.4 GHz Intel Xeon E7330 Quad-Core Windows Server 2008 R2 Enterprise 16 core UMA machine Two 3 MB shared L2 cache per socket  OpenMP Microsoft Visual C  BLAS GotoBLAS2 1.00, Intel MKL 10.2

June 13-15, 2010SPAA Performance

June 13-15, 2010SPAA Performance

June 13-15, 2010SPAA Performance Results  SuperMatrix is competitive with GotoBLAS and MKL  Incremental pivoting ramps up in performance faster but partial pivoting provides better asymptotic performance  Linux and Windows platforms attain similar performance curves

June 13-15, 2010SPAA Outline LU Factorization with Partial Pivoting Algorithm-by-Blocks SuperMatrix Runtime System Performance Conclusion P A = L U

June 13-15, 2010SPAA Conclusion Separation of Concerns  Programmability  Allows us to experiment with different scheduling algorithms

June 13-15, 2010SPAA Acknowledgements Andrew Chapman, Robert van de Geijn  I thank the other members of the FLAME team for their support Funding  Microsoft  NSF grants CCF– CCF–

June 13-15, 2010SPAA Conclusion More Information Questions?