1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory.

1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

2 1960’s  1961 –IBM Stretch delivered to LANL  1962 –Virtual memory from U Manchester, T. Kilburn  1964 –AEC urges manufacturers to look at ``radical new'' machine structures. »This leads to CDC Star-100, TI ASC, and Illiac-IV. –CDC 6600; S. Cray's design »Functional parallelism, leading to RISC »(3 times faster than the IBM Stretch)  1965 –DEC ships first PDP-8 & IBM ships 360 –First CS PhD U of Penn. Richard Wexelblat –Wilkinson’s The Algebraic Eigenvalue Problem published  1966 –DARPA contract with U of I to build the ILLIAC IV –Fortran 66  1967 –Forsythe & Moler published »Fortran, Algol and PLI  1969 –DARPANET work begins »4 computers connected  UC-SB,UCLA, SRI and U of Utah –CDC 7600 »Pipelined architecture; 3.1 Mflop/s –Unix developed Thompson and Ritchie.....

3 Wilkinson-Reinsch Handbook  “In general we have aimed to include only algorithms that provide something approaching an optimal solution, this may be from the point of view of generality, elegance, speed, or economy of storage.” –Part 1 Linear System –Part 2 Algebraic Eigenvalue Problem  Before publishing Handbook –Virginia Klema and others at ANL began translation of the Algol procedures into Fortran subroutines

4 1970’s  1970 –NATS Project conceived »Concept of certified MS and process involved with production –Purdue Math Software Symposium –NAG project begins  1971 –Handbook for Automatic Computation, Vol II »Landmark in the development of numerical algorithms and software »Basis for a number of software projects EISPACK, a number of linear algebra routines in IMSL and the F chapters of NAG –IBM 370/195 »Pipelined architecture; Out of order execution; 2.5 Mflop/s –Intel 4004; 60 Kop/s –IMSL founded  1972 –Cray Research founded –Intel 8008 –1/4 size Illiac IV installed NASA Ames »15 Mflop/s achieved; 64 processors –Paper by S. Reddaway on massive bit- level parallelism –EISPACK available »150 installations; EISPACK Users' Guide »5 versions, IBM, CDC, Univac, Honeywell and PDP, distributed free via Argonne Code Center –M. Flynn publishes paper on architectural taxonomy –ARPANet »37 computers connected  1973 –BLAS report in SIGNUM »Lawson, Hanson, &Krogh.....

5 NATS Project  National Activity for Testing Software (NSF, Argonne, Texas and Stanford)  Project to explore the problems of testing, certifying, disseminating and maintaining quality math software. –First EISPACK, later FUNPACK »Influenced other “PACK”s  ELLPACK, FISKPACK, ITPACK,MINPACK,PDEPACK, QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK,LINPACK, LAPACK, ScaLAPACK...  Key attributes of math software –reliability, robustness, structure, usability, and validity

6 EISPACK  Algol versions of the algorithms written in Fortran –Restructured to avoid underflow –Check user’s claims such as if matrix positive definite –Format programs in a unified fashion »Burt Garbow –Field test sites  1971 U of Michigan Summer Conference –JHW algorithms & CBM software  1972 Software released via Argonne Code Center –5 versions  Software certified in the sens e that re ports of poor or incorrect performance “would gain the immediate attention from the developers”

7 EISPACK  EISPAC Control Program, Boyle –One interface allowed access to whole package on IBM  Argonne’s interactive RESCUE system –Allowed us to easily manipulate 10s of 1000’s of lines  Generalizer/Selector –Convert IBM version to general form –Selector extract appropriate version  1976 Extensions to package ready –EISPACK II

8 1970’s continued  1974 –Intel 8080 –Level 1 BLAS activity started by community; Purdue 5/74 –LINPACK meeting in summer ANL  1975 –First issue of TOMS –Second LINPACK meeting ANL »lay the groundwork and hammer out what was and was not to be included in the package. Proposal submitted to the NSF  1976 –Cray 1 - model for vector computing »4 Mflop/s in 79; 12 Mflop/s in 83; 27 Mflop/s in 89 –LINPACK work during summer at ANL –EISPACK Second edition of User's Guide  1977 –DEC VAX 11/780; super mini ».14 Mflop/s; 4.3 GB virtual memory –LINPACK test software developed &sent –EISPACK second release –IEEE Arithmetic standard meetings »paper by Palmer on INTEL std for fl pt  1978 –Fortran 77 –LINPACK software released »Sent to NESC and IMSL for distribution  1979 –John Cocke designs 801 –ICL DAP delivered to QMC, London –Level 1 BLAS published/released –LINPACK Users' Guide »Appendix: 17 machines PDP-10 to Cray-1....

9 Basic Linear Algebra Subprograms  BLAS 1973-1977  Consensus on: –Names –Calling sequences –Functional Descriptions –Low level linear algebra operations  Success results from –Extensive Public Involvement –Careful consideration of implications  A design tool software in numerical linear algebra  Improve readability and aid documentation  Aid modularity and maintenance, and improve robustness of software calling the BLAS  Improve portability, without sacrificing efficiency, through standardization

10 CRAY and VAX  VAXination of groups and departments  Cray’s introduction of vector computing  Both had a significant impact on scientific computing

11 LINPACK  June 1974, ANL –Jim Pool’s meeting  February 1975, ANL –groundwork for project  January 1976, – funded by NSF and DOE: ANL/UNM/UM/UCSD  Summer 1976 - Summer 1977  Fall 1977 –Software to 26 testsites  December 1978 –Software released, NESC and IMSL

LINPACK  Research into mechanics of software production.  Provide a yearstick against which future software would be measured.  Produce library used by people and those that wished to modify/extend software to handle special problems.  Hope would be used in classroom  Machine independent and efficient –No mix mode arithmetic

13 LINPACK Efficiency  Condition estimator  Inner loops via BLAS  Column access  Unrolling loops (done for the BLAS)  BABE Algorithm for tridiagonal matrices  TAMPR system allowed easy generation of versions  LINPACK Benchmark –Today reports machines from Cray T90 to Palm Pilot »(1.0 Gflop/s to 1.4 Kflop/s) 0 0

14 1980’s Vector Computing (Parallel Processing)  1980 –Total computers in use in the US exceeds 1M –CDC Introduces Cyber 205  1981 –IBM introduces the PC »Intel 8088/DOS –BBN Butterfly delivered –FPS delivers FPS-164 »Start of mini-supercomputer market –SGI founded by Jim Clark and others –Loop unrolling at outer level for data locality and parallelism »Amounts to matrix-vector operations –Cuppen's method for Symmetric Eigenvalue D&C published »Talk at Oxford Gatlinburg (1980)  1982 –Illiac IV decommissioned –Steve Chen's group at Cray produces X- MP –First Denelcor HEP installed (.21 Mflop/s) –Sun Microsys, Convex and Alliant founded  1983 –Total computers in use in the US exceed 10 M –DARPA starts Strategic Computing Initiative »Helps fund Thinking Machines, BBN, WARP –Cosmic Cube hypercube running at Caltech »John Palmer, after seeing Caltech machine, leaves Intel to found Ncube –Encore, Sequent, TMC, SCS, Myrias founded –Cray 2 introduced –NEC SX-1 and SX-2, Fujitsu ships VP- 200 –ETA System spun off from CDC –Golub & Van Loan published.........

15 1980’s continued  1984 –NSFNET; 5000 computers;56 Kb/s lines –MathWorks founded –EISPACK third release –Netlib begins 1/3/84 –Level 2 BLAS activity started »Gatlinburg(Waterloo), Purdue, SIAM –Intel Scientific Computers started by J. Rattner »Produce commerical hypercube –Cray X-MP 1 processor, 21 Mflop/s Linpack –Multiflow founded by J. Fisher; VLIW architecture –Apple introduces Mac &IBM introduces PC AT –IJK paper  1985 –IEEE Standard 754 for floating point –IBM delivers 3090 vector; 16 Mflop/s Linpack, 138 Peak –TMC demos CM1 to DARPA –Intel produces first iPSC/1 Hypercube »80286 connected via Ethernet controllers –Fujitsu VP-400; NEC SX-2; Cray 2; Convex C1 –Ncube/10 ;.1 Mflop/s Linpack 1 processor –FPS-264; 5.9 Mflop/s Linpack 38 Peak –IBM begins RP2 project –Stellar, Poduska & Ardent, Michels Supertek Computer founded –Denelcor closes doors.....

16 1980’s (Lost Decade for Parallel Software)  1986 –# of computers in US exceeds 30 M –TMC ships CM-1; 64K 1 bit processors –Cray X-MP –IBM and MIPS release first RISC WS  1987 –ETA Systems family of supercomputers –Sun Microsystems introduces its first RISC WS –IBM invests in Steve Chen's SSI –Cray Y-MP –First NA-DIGEST –Level 3 BLAS work begun –LAPACK: Prospectus Development of a LA Library for HPC  1988 –AMT delivers first re-engineered DAP –Intel produces iPSC/2 –Stellar and Ardent begin delivering single user graphics workstations –Level 2 BLAS paper published  1989 –# of computers in the US > 50M –Stellar and Ardent merge, forming Stardent –S. Cray leaves Cray Research to form Cray Computer –Ncube 2nd generation machine –ETA out of business –Intel 80486 and i860 ; 1 M transistors »i860 RISC & 64 bit floating point......

17 EISPACK 3 and BLAS 2 & 3  Machine independence for EISPACK  Reduce the possibility of overflow and underflow.  Mods to the Algol from S. Hammarling  Rewrite reductions to tridiagonal form to involve sequential access to memory  “Official” Double Precision version  Inverse iteration routines modified to reduce the size for reorthogonalization.  BLAS (Level 1) vector operations provide for too much data movement.  Community effort to define extensions –Matrix-vector ops –Matrix-matrix ops

18 Netlib - Mathematical Software and Data  Began in 1985 –JD and Eric Grosse, AT&T Bell Labs  Motivated by the need for cost-effective, timely distribution of high-quality mathematical software to the community.  Designed to send, by return electronic mail, requested items.  Automatic mechanism for electronic dissemination of freely available software. –Still in use and growing –Mirrored at 9 sites around the world  Moderated collection / Distributed maintenance  NA-DIGEST and NA-Net –Gene Golub, Mark Kent and Cleve Moler

19 Netlib Growth Just over 6,000 in 1985 Over 29,000,000 total Over 9 Million hits in 1997 5.4 Million so far in 1998 LAPACK Best Seller 1.6 M hits

20 1990’s  1990 –Internet –World Wide Web –Motorola introduces 68040 –NEC ships SX-3; First parallel Japanese parallel vector supercomputer –IBM announces RS/6000 family »has FMA instruction –Intel hypercube based on 860 chip »128 processors –Alliant delivers FX/2800 based on i860 –Fujitsu VP-2600 –PVM project started –Level 3 BLAS Published  1991 –Stardent to sell business and close –Cray C-90 –Kendall Square Research delivers 32 processor KSR-1 –TMC produces CM-200 and announces CM-5 MIMD computer –DEC announces the Alpha –TMC produces the first CM-5 –Fortran 90 –Workshop to consider Message Passing Standard, beginnings of MPI »Community effort –Xnetlib running  1992 –LAPACK software released & Users' Guide Published....

22 Parallel Processing Comes of Age  "There are three rules for programming parallel computers. We just don't know what they are yet." -- Gary Montry  “Embarrassingly Parallel”, Cleve Moler  “Humiliatingly Parallel”

23 Memory Hierarchy and LAPACK  ijk - implementations Effects order in which data referenced; some betterat allowing data to keep in higher levels of memory hierarchy.  Applies for matrix multiply, reductions to condensed form –May do slightly more flops –Up to 3 times faster

24 Why Higher Level BLAS?  Can only do arithmetic on data at the top of the hierarchy  Higher level BLAS lets us do this  Development of blocked algorithms important for performance Level 3 BLAS Level 2 BLAS Level 1 BLAS

25 History of Block Partitioned Algorithms  Ideas not new.  Early algorithms involved use of small main memory using tapes as secondary storage.  Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory.

26 LAPACK  Linear Algebra library in Fortran 77 –Solution of systems of equations –Solution of eigenvalue problems  Combine algorithms from LINPACK and EISPACK into a single package  Block algorithms –Efficient on a wide range of computers »RISC, Vector, SMPs  User interface similar to LINPACK –Single, Double, Complex, Double Complex  Built on the Level 1, 2, and 3 BLAS  HP-48G to CRAY T-90 Mflop/s

28 1990’s continued  1993 –Intel Pentium system start to ship –ScaLAPACK Prototype software released »First portable library for distributed memory machines »Intel, TMC and workstations using PVM –PVM 3.0 available  1994 –MPI-1 Finished  1995 –Templates project  1996 –Internet; 34M Users –Nintendo 64 »More computing power than a Cray 1 and much much better graphics  1997 –MPI-2 Finished –Fortran 95  1998 –Issues of parallel and numerical stability –Divide time –DSM architectures –"New" Algorithms »Chaotic iteration »Sparse LU w/o pivoting »Pipeline HQR »Graph partitioning »Algorithmic bombardment....

29 Templates Project  Iterative methods for large sparse systems –Communicate to HPC community “State of the Art” algorithms –Subtle algorithm issues addressed, i.e. convergence, preconditions, data structures –Performance and parallelism considerations –Gave the computational scientists algorithms in form they wanted.

30 ScaLAPACK  Library of software for dense & banded »Sparse direct being developed  Distributed Memory - Message Passing –PVM and MPI  MIMD Computers, Networks of Workstations, and Clumps of SMPs  SPMD Fortran 77 with object based design  Built on various modules –PBLAS (BLACS and BLAS) »PVM, MPI, IBM SP, CRI T3, Intel, TMC »Provides right level of notation.

31 High-Performance Computing Directions  Move toward shared memory –SMPs and Distributed Shared Memory –Shared address space w/deep memory hierarchy  Clustering of shared memory machines for scalability –Emergence of PC commodity systems »Pentium based, NT or Linux driven  At UTK cluster of 14 (dual) Pentium based 7.2 Gflop/s  Efficiency of message passing and data parallel programming –Helped by standards efforts such as PVM, MPI and HPF  Complementing “Supercomputing” with Metacomputing  Computational Grid

32 Heterogeneous Computing  Heterogeneity introduces new bugs in parallel code  Slightly different fl pt can make data dependent branches go different ways when we expect identical behavior.  A “correct” algorithm on a network of identical workstations may fail if a slightly different machine is introduced.  Some easy to fix (compare s < tol on 1 proc and broadcast results)  Some hard to fix (handling denorms; getting the same answer independent of # of procs)

33 Java - For Numerical Computations?  Java likely to be a dominant language.  Provides for machine independent code.  C++ like language  No pointers, goto’s, overloading arith ops, or memory deallocation  Portability achieved via abstract machine  Java is a convenient user interface builder which allows one to quickly develop customized interfaces.

34 Network Enabled Servers  Allow networked resources to be integrated into the desktop.  Many hosts, co-existing in a loose confederation tied together with high-speed links.  Users have the illusion of a very powerful computer on the desk.  Locate and “deliver” software or solutions to the user in a directly usable and “conventional” form.  Part of the motivation software maintenance

35 Future: Petaflops ( fl pt ops/s)  A Pflop for 1 second  a typical workstation computing for 1 year.  From an algorithmic standpoint –concurrency –data locality –latency & sync –floating point accuracy  May be feasible and “affordable” by the year 2010 –dynamic redistribution of workload –new language and constructs –role of numerical libraries –algorithm adaptation to hardware failure

36 Summary  As a community we have a lot to be proud of in terms of the algorithms and software we have produced. –generality, elegance, speed, or economy of storage  Software still being used in many cases 30 years after

38 End of talk

39 BLAS Were Intended To:  Be a design tool for the development of software in numerical linear algebra  Improve readability and aid documentation  Aid modularity and maintenance, and improve robustness of software calling the BLAS  Promote efficiency  Improve portability, without sacrificing efficiency, through standardization

40 LINPACK Benchmark  Appeared in the Users’ Guide –Dense system of linear equations »LU decomposition via Gaussian elimination. –Users’ Guide Appendix  Reported AXPY op time  Has a life of its own  Today reports machines from Cray T90 to Palm Pilot –(1.0 Gflop/s to 1.4 Kflop/s)

41 Future Research Directions and Challenges: “ Prediction is hard.”

42 ScaLAPACK Structure ScaLAPACKBLASLAPACK BLACSPVM/MPI/...PBLAS Global Local

43 Metacomputing Objectives  Flexibility and extensibility  Site autonomy  Scalable architecture  Single global namespace  Easy-to-use, seamless environment  High performance through parallelism  Security  Management / exploitation of heterogeneity  Multilanguage interoperability  Fault-tolerance

44 Grid Based Computations  Long running computations  Grid based computation  Network of Workstations  Fault tolerance  Reproducibility of solution  Auditability of computation

45 Heterogeneous Conclusions  Defensive programming  Machine parameters  Communication and representation between processors  Controlling processor  Additional communication  Testing strategies

46 Metacomputing Issues  Logically only one system  System should determine where the computation executes  Fault-tolerance should be transparent  Applications freed for mechanics of distrib. programming  Self configuring with new resources added automatically  Just-in-time binding of software, data, and hardware.  User shouldn’t have to decide where computation performed  User would be able to walk up to a machine and have their program and data follow them

47 Grid Computing and Numerical Computations  With the Computational Grid certain challenges from a numerical standpoint.  Some users would like/want/expect/demand the same results they obtained yesterday the next time they run the application.  How can we guarantee reproducibility?

48 Heterogeneous Computing  Software intended to be used in this context  Machine precision and other machine specific parameters  Communication of ft. pt. numbers between processors  Repeatability - –run to run differences, e.g. order in which quantities summed  Coherency - –within run differences, e.g. arithmetic varies from proc to proc  Iterative convergence across clusters of processors  Defensive programming required  Important for the “Computational Grid”

49 Machine Precision  Smallest floating point number u such that 1 + u > 1.  Used in: –Convergence tests -- has an iteration converged? –Tolerance tests -- is a matrix numerically singular? –Error analysis -- how does floating point arithmetic affect results?

50 Heterogeneous Machine Precision  Return the maximum relative machine precision over all the processors  LAPACK --> DLAMCH  ScaLAPACK --> PDLAMCH  Note that even on IEEE machines, the machine precision can vary slightly. For double precision: e = 2**-53 + q where, q=2**-105 or q=2**- 64 + 2**-105.

51 Grid Enabled Math Software  Predictability and robustness of accuracy and performance.  Run-time resource management and algorithm selection.  Support for a multiplicity of programming environments.  Reproducibility, fault tolerant, and auditability of the computations.  New algorithmic techniques for latency tolerant applications.

52 Heterogeneous Computing  Software intended to be used in this context  Machine precision and other machine specific parameters  Communication of ft. pt. numbers between processors  Repeatability - –run to run differences, e.g. order in which quantities summed  Coherency - –within run differences, e.g. arithmetic varies from proc to proc  Iterative convergence across clusters of processors  Defensive programming required  Important for the “Computational Grid”

53 Automatically Tuned Numerical Software  Automatic generation of computational kernels for RISC architectures.  A package that adapts itself to differing architectures via code generation coupled with timing  Code generator takes about 1-2 hours to run. –Done once for a new architecture. –Written in ANSI C (generate C )  Today Level 2 and 3 BLAS  Extension to higher level operations. –SMPs –SGI/Vector –DSP –FFTs

54 Why Such a System is Needed  BLAS require many man-hours / platform –Only done if financial incentive is there »Many platforms will never have an optimal version –Lags behind hardware –May not be affordable by everyone  Allows for portably optimal codes  Package contains: –Code generators –Sophisticated timers –Robust search routines

55 Future Research Directions and Challenges: “ Prediction is hard.

56 Future Research Directions and Challenges:  Numerical behavior in a heterogeneous environment.  Efficient software for core routines.  Fault tolerant aspects  "New" Algorithms –Chaotic iteration –Divide and Conquer –Sparse LU w/o pivoting “ Prediction is hard. Especially the future...” Yogi Berra, philosopher in the Baseball Hall of Fame –Pipeline HQR –Graph partitioning –Algorithmic bombardment

57 Research Directions  Grid enabled strategies  Parameterizable libraries  Annotated libraries  Hierarchical algorithm libraries A new division of labor between compiler writers, library writers, and algorithm developers and application developers will emerge.

58 Motivation for NetSolve User’s site and responsibility debug run Result User’s Program link Search, download, install, learn Library X Library Y Library Z Software Repositories Internet

59 Motivation for NetSolve User’s site and responsibility debug run Result User’s Program link Search, download, install, learn Library X Library Y Library Z Software Repositories Internet Library X Library Y Library Z Software Repositories NetSolve User’s site and responsibility Result APIs, GUIs, Web

60 Motivation for NetSolve  Client-Server Design  Non-hierarchical system  Load Balancing and Fault Tolerance  Heterogeneous Environment Supported  Multiple and simple client interfaces Design an easy-to-use tool to provide efficient and uniform access to a variety of scientific packages on UNIX platforms Basics http://www.cs.utk.edu/netsolve/

61 NetSolve - The Big Picture Client Request Agen t Choice Computational Resources Reply http://www.cs.utk.edu/netsolve/

62 Network Computing Classifications  Code shipping (JAVA model) –Program: Server  Client –Data: Client –Result: Client  Remote Computing (NetSolve model) –Program: Server –Data: Client  Server –Result: Server  Client http://www.cs.utk.edu/netsolve/ DATA Program Client DATA Server Program solution Client

http://www.cs.utk.edu/netsolve NetSolve - Typical Utilizations Intranet, Extranet, Internet Proprietary, Collaborative, Open, Controlled,... Used by scientists as a computational engine Customized local servers Operating environment for higher level applications Client/Server/Agent could be on the same machine.

http://www.cs.utk.edu/netsolve NetSolve - The Client Multiple interfaces C, FORTRAN, MATLAB, & Mathematica Java : GUI and API Natural problem specification some input objects some output objects a name (example : ‘LinearSystem’)

65 Architecture features  Cache –Goes back to the ATLAS computer  pipelining –CDC 7600  vector instructions –chaining –overlapping  Look unrolling

66  Linpack benchmark –LINPACK is a benchmark, but it is really a collection of software routines for solving systems of linear equations. –Some vendors, I am told, have been accused of having a switch which is called the "LINPACK switch", a program device to recognize the LINPACK program and try to optimize based on that.

67  Data locality  Give Cyber 205 example

68 Problem Solving Environments & Computational Grid  Many hosts, co-existing in a loose confederation tied together with high-speed links.  Users have the illusion of a very powerful computer on the desk.  System provides all computational facilities to solve target class of problems  Automatic or semiautomatic selection of solution methods; Ways to easily incorporate novel solution methods.  Communicate using language of target class of problems.  Allow users to do computational steering.

69 The Importance of Standards - Software  Writing programs for MPP is hard...  But... one-off efforts if written in a standard language  Past lack of parallel programming standards... –has restricted uptake of technology (to "enthusiasts") –reduced portability (over a range of current architectures and between future generations) –lost decade  Now standards exist: (PVM, MPI & HPF), which... –allows users & manufacturers to protect software investment –encourage growth of a "third party" parallel software industry & parallel versions of widely used codes –Others POSIX, COBRA,...

70 LAPACK Linear Algebra Library in F77  Solution of systems of equations Ax=b for general dense, symmetric, banded, tridiagonal matrix A,  Solution of linear least squares problems min x || Ax-b || 2 for general dense matrix A,  Singular value decomposition A = U  V T for general dense, banded matrix A,  Solution for eigenvalue problems Ax = x for general dense, symmetric, banded matrix A,  Solution of general linear least squares, (GSEP), (GNEP), (GSVD),  Error bounds, condition estimation for (almost) everything,  Real and complex, single and double precision,  Easy-to-use drivers plus computational kernels.

71 LAPACK Ongoing Work  Add functionality –updating/downdating, divide and conquer least squares,bidiagonal bisection, bidiagonal inverse iteration, band SVD, Jacobi methods,...  Move to new generation of high performance machines –IBM SPs, CRAY T3E, SGI Origin, clusters of workstations  New challenges –New languages: FORTRAN 90, HPF,... –(CMMD, MPL, NX...) »many flavors of message passing, need standard (PVM, MPI): BLACS  Highly varying ratio  Many ways to layout data,  Fastest parallel algorithm sometimes less stable numerically.

72 LAPACK Blocked Algorithms DO 10 J = 1, N, NB CALL STRSM( 'Left', 'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A, LDA, $ A( 1, J ), LDA ) CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE, A( 1, J ), LDA, ONE, $ A( J, J ), LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10 CONTINUE On Y-MP, L3 BLAS squeezes a little more out of 1 proc, but makes a large improvement when using 8 procs.

73 Challenges in Developing Distributed Memory Libaries  How to integrate software? –Until recently no standards –Many parallel languages –Various parallel programming models –Assumptions about the parallel environment »granularity »topology »overlapping of communication/computation »development tools  Where is the data –Who owns it? –Opt data distribution  Who determines data layout –Determined by user? –Determined by library developer? –Allow dynamic data dist. –Load balancing

74 Heterogeneous Computing  Software intended to be used in this context  Machine precision and other machine specific parameters  Communication of ft. pt. numbers between processors  Repeatability - –run to run differences, e.g. order in which quantities summed  Coherency - –within run differences, e.g. arithmetic varies from proc to proc  Iterative convergence across clusters of processors  Defensive programming required

75 Portability of Numerical Software  Numerical software to work portability on different computers with good numerical accuracy, stability and robustness of solution. –EISPACK - DEC, IBM, CDC (early 70’s) –LINPACK - Vector, pipelined (late 70’s) –LAPACK - Vector, RISC, SMPs (late 80’s early 90’s) –ScaLAPACK (late 90’s) »MPP »DSM »Clumps »Heterogeneous NOWs

76 ScaLAPACK  LAPACK software expertise/quality –Numerical methods –Library of software dealing with dense & banded routines  Distributed Memory - Message Passing  MPP Computers and NOWs  SPMD Fortran 77 with object based design  Built on various modules –PBLAS Interprocessor communication –BLAS/BLACS »PVM, MPI, IBM SP, CRI T3, Intel, TMC »Provides right level of notation.

77 Choosing a Data Distribution  Main issues are: –Load balancing –Use of the Level 3 BLAS  1D block and cyclic column distributions  1D block-cycle column and 2D block-cyclic distribution  2D block-cyclic used in ScaLAPACK for dense matrices

78 Programming Style  SPMD Fortran 77 with object based design  Built on various modules –PBLAS Interprocessor communication –BLACS »PVM, MPI, IBM SP, CRI T3, Intel, TMC »Provides right level of notation. –BLAS  LAPACK software expertise/quality –software approach –numerical methods

79 Parallelism in ScaLAPACK  Level 3 BLAS block operations –All the reduction routines  Pipelining –QR Algorithm, Triangular Solvers, classic factorizations  Redundant computations –Condition estimators  Static work assignment –Bisection  Task parallelism –Sign function eigenvalue computations  Divide and Conquer –Tridiagonal and band solvers, symmetric eigenvalue problem and Sign function  Cyclic reduction –Reduced system in the band solver  Data parallelism –Sign function

ScaLAPACK - What’s Included  Timing and Testing routines for almost all, these are a large component of the package  Prebuilt libraries available for SP, PGON, HPPA, DEC, Sun, RS6K

81 Direct Sparse Solvers  CAPSS is a package to solve Ax=b on a message passing multiprocessor; the matrix A is SPD and associated with a mesh in 2 or 3D. (Version for Intel only)  SuperLU and UMFPACK - sequential implementation of Gaussian elimination with partial pivoting.

82 Parallel Iterative Solvers  Many packages for message passing systems –PETSc, AZTEC, BLOCKSOLVE95, P- Sparselib,...  PETSc set of libraries and structures that assist in the solution of PDEs and related processes. –Parallel algebraic data structures, solvers, and related infrastructure required for solving PDEs using implicit methods involving finite elements, finite differences, or finite volumes.

83 Parallel Sparse Eigenvalue Solvers  P_ARPACK (D. Sorensen et al) –Designed to compute a few values and corresponding vectors of a general matrix. –Appropriate for large sparse or structured matrices A –This software is based Arnoldi process called the Implicitly Restarted Arnoldi Method. –Reverse Communication Interface.

84 Java for Numerical Computations?  Few months ago 1 Mflop/s on a 600 Mflop/s processor.  Top performer today is 46 Mflop/s for a P6 using MS Explorer 3.0 JIT (62 Mflop/s in Fortran).

85 LAPACK to JAVA  Allows Java programmers to access to BLAS/LAPACK routines.  Working on translator to go from LAPACK to Java Byte Code –lap2j: formal compiler of a subset of f77 sufficient for BLAS & LAPACK  Plan to enable all of LAPACK –Compiler provides quick, reliable translation.  Focus on LAPACK Fortran –Simple - no COMMON, EQUIVALANCE, SAVE, I/O

86 Parameterized Libraries  Architecture features for performance –Cache size –Latency –Bandwidth  Latency tolerant algorithms –Latency reduction –Granularity management  High levels of concurrency  Issues of precision

87 Motivation for Network Enabled Solvers  Client-Server Design  Non-hierarchical system  Load Balancing  Fault Tolerance  Heterogeneous Environment Supported Design an easy-to-use tool to provide efficient and uniform access to a variety of scientific packages on UNIX platforms Clien t Request Agen t Choice Computational Resources Reply

88 NetSolve-Future Paradigm NetSolve Agent NetSolve Client Netlib Software Repository NetSolve Server Dynamic Hardware/Software Binding Software lookup Hardware lookup

89 On Mathematical Software  Labor intensive activity  Enabling Technology –Helped many computational scientists –Provides a basis for many commercial efforts  Often software people not given credit –“Lip Service” given to work  Funding to universities and laboratories –NSF, DOE, DARPA, AFOSR,..

90 References  Copy of slides in: http://www.netlib.org/utk/people/JackDongarra/siam- 797/  Annotated table of freely available software in: http://www.netlib.org/utk/people/JackDongarra/la- sw.html  http://www.netlib.org/  http://www.nhse.org/  http://www.nhse.org/hpc-netlib/

91  Design started in 1955 by IBM when lost bid for U of C Radiation Lab (LLNL) –Univac won with LARC  Was to be 100 X faster than what was available  At IBM Fred Brooks, John Cocke, Erich Bloch,...  Look ahead unit  Memory operand prefetch  Out of order execution  Speculative execution based on branch prediction  Partitioned function units

92 Stretch  about

93 double precision function pythag(a,b) double precision a,b c c finds dsqrt(a**2+b**2) without overflow or destructive underflow c double precision p,r,s,t,u p = dmax1(dabs(a),dabs(b)) if (p.eq. 0.0d0) go to 20 r = (dmin1(dabs(a),dabs(b))/p)**2 10 continue t = 4.0d0 + r if (t.eq. 4.0d0) go to 20 s = r/t u = 1.0d0 + 2.0d0*s p = u*p r = (s/u)**2 * r go to 10 20 pythag = p return end

94  Nintendo 64 –more computing power than a Cray 1 and also much better graphics

95 double precision function epslon (x) double precision x c c estimate unit roundoff in quantities of size x. c double precision a,b,c,eps c c this program should function properly on all systems c satisfying the following two assumptions, c 1. the base used in representing floating point c numbers is not a power of three. c 2. the quantity a in statement 10 is represented to c the accuracy used in floating point variables c that are stored in memory. c the statement number 10 and the go to 10 are intended to c force optimizing compilers to generate code satisfying c assumption 2. c under these assumptions, it should be true that, c a is not exactly equal to four-thirds, c b has a zero for its last bit or digit, c c is not exactly equal to one, c eps measures the separation of 1.0 from c the next larger floating point number. c the developers of eispack would appreciate being informed c about any systems where these assumptions do not hold. c c this version dated 4/6/83. c a = 4.0d0/3.0d0 10 b = a - 1.0d0 c = b + b + b eps = dabs(c-1.0d0) if (eps.eq. 0.0d0) go to 10 epslon = eps*dabs(x) return end

96 Outline  Architectural Opportunities and Challenges.  Focus on High- Performance Computers –MPP/Clusters/NOWs  Language –Fortran 77, 90 –HPF –C, C++, Java  Dense direct solvers –ScaLAPACK, PLAPACK  Sparse direct solvers  Iterative solvers

97 LAPACK Motivation  LAPACK using high level BLAS  Use the manufacturer’s BLAS for high performance.  Portable Fortran 77 (766 K lines)

98 LAPACK Motivation  Example: Cray C-90, 1 and 16 processors  LINPACK low performance due to excess data movement through the memory hierarchy

99 Derivation of Blocked Algorithms Cholesky Factorization A = U T U Equating coefficient of the j th column, we obtain Hence, if U 11 has already been computed, we can compute u j and u jj from the equations:

100 LINPACK Implementation  Here is the body of the LINPACK routine SPOFA which implements the method : DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF( JM1.LT.1 ) GO TO 20 DO 10 K = 1, JM1 T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 ) T = T / A( K, K ) A( K, J ) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A( J, J ) - S C...EXIT IF( S.LE.0.0E0 ) GO TO 40 A( J, J ) = SQRT( S ) 30 CONTINUE

102 LAPACK from HP-48G to CRAY T-90  (Trans)portable FORTRAN 77 (+BLAS), 766K lines (354K library, 412K testing and timing)  Targets workstations, vector machines, and shared memory computers  Initial release February 1992  Linear systems, least squares, eigenproblems  Manual available from SIAM translated into Japanese  Fifth release later this year.

103 Physical and Logical Views of PVM  Physical  Logical

104 Performance Today in Terms of LINPACK Benchmark  ASCI Red at 1.3 Tflop/s  Sun Sparc Ultra 2 at 110 Mflop/s  Thinkpad 600 at 50 Mflop/s  Palm Pilot at 1.4 Kflop/s

Standards - The MPI Forum  Created using the HPF model  A group of 30-40 “experts” in message- passing: –MPP vendors, –CS researchers, –Application developers.  Met 3 days every 6 weeks for 1.5 years and created the MPI-1 specification draft.  Communicators - communication context needed for safe parallel library design.  Derived Datatypes - user defined datatypes for non-homogeneous, non-contiguous data.  Process Groups - collective operations can operate over a subset of the processes.

Super-LU  Supernodes: –Permit use of higher level BLAS. –Reduce inefficient indirect addressing. –Reduce symbolic time by traversing supernodal graph.  Algorithm: PA = LU, A sparse and unsymmetric.  Exploit dense submatrices in the L & U factors of PA = LU.

107 Performance on Computational Kernels  New implementations of architectures every few months.  How to keep up with the changing design so as to have optimized numerical kernels.  Effort to aid the compiler  Pre-compiled software for machine specific implementations

108 Grid Enabled Math Software  Predictability and robustness of accuracy and performance.  Run-time resource management and algorithm selection.  Support for a multiplicity of programming environments.  Reproducibility, fault tolerant, and auditability of the computations.  New algorithmic techniques for latency tolerant applications.

1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory.

Similar presentations

Presentation on theme: "1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory.

Similar presentations

Presentation on theme: "1 A Look at Library Software for Linear Algebra: Past, Present, and Future Jack Dongarra University of Tennessee and Oak Ridge National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback