College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

Slides:



Advertisements
Similar presentations
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Compiler Challenges for High Performance Architectures
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Maths for Computer Graphics
HCI 530 : Seminar (HCI) Damian Schofield. HCI 530: Seminar (HCI) Transforms –Two Dimensional –Three Dimensional The Graphics Pipeline.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
University at Albany State University of NY lrm-1 lrm 6/21/15 CCI & CNE High Performance Computing from a General Formalism: Conformal Computing.
The application of Conformal Computing techniques to problems in computational physics: The Fast Fourier Transform James E. Raynolds, College of Nanoscale.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Topic Overview One-to-All Broadcast and All-to-One Reduction
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,
Chapter 7 Matrix Mathematics Matrix Operations Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
2IV60 Computer Graphics Basic Math for CG
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
APL and J Functional programming with linear algebra Large set of primitive operations Very compact program representation Regular grammar, uniform abbreviation.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Topic Overview One-to-All Broadcast and All-to-One Reduction All-to-All Broadcast and Reduction All-Reduce and Prefix-Sum Operations Scatter and Gather.
Geometric Transformations
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Lrm-1 lrm 11/15/2015 University at Albany, SUNY Efficient Radar Processing Via Array and Index Algebras Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry.
XYZ 11/21/2015 MIT Lincoln Laboratory Monolithic Compiler Experiments Using C++ Expression Templates* Lenore R. Mullin** Edward Rutledge Robert.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
The parity bits of linear block codes are linear combination of the message. Therefore, we can represent the encoder by a linear system described by matrices.
INTRODUCTION TO MATLAB DAVID COOPER SUMMER Course Layout SundayMondayTuesdayWednesdayThursdayFridaySaturday 67 Intro 89 Scripts 1011 Work
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Matrix Algebra Definitions Operations Matrix algebra is a means of making calculations upon arrays of numbers (or data). Most data sets are matrix-type.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
University at Albany State University of NY lrm-1 lrm 6/28/16 GE Global Research Simulating Quantum Computation: Essentials for High Performance.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
An Iterative FFT We rewrite the loop to calculate nkyk[1] once
Building the Support for Radar Processing Across Memory Hierarchies:
CSC4820/6820 Computer Graphics Algorithms Ying Zhu Georgia State University Transformations.
CS/EE 217 – GPU Architecture and Parallel Programming
Parallel Matrix Operations
Monolithic Compiler Experiments Using C++ Expression Templates*
1.3 Vector Equations.
Monolithic Compiler Experiments Using C++ Expression Templates*
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
Building the Support for Radar Processing Across Memory Hierarchies:
(HPEC-LB) Outline Notes
Building the Support for Radar Processing Across Memory Hierarchies:
Memory Efficient Radar Processing
<PETE> Shape Programmability
Low Depth Cache-Oblivious Algorithms
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Game Programming Algorithms and Techniques
Presentation transcript:

College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds College of Nanoscale Science and Engineering University at Albany, State University of New York, Albany, NY Lenore Mullin, Computer Science University at Albany, State University of New York, Albany, NY 12309

College of Nanoscale Science and Engineering Matrix Example  In Fortran 90:  First temporary computed:  Second temporary:  Last operation:

College of Nanoscale Science and Engineering Matrix Example (cont)  Intermediate temporaries consume memory and add to processing operations  Solution: compose index operations  Loop over i, j:  No temporaries:

College of Nanoscale Science and Engineering Need for formalism  Few problems are as simple as  Formalism designed to handle extremely complicated situations systematically  Goal: composition of algorithms For Example: Radar is composed of the composition of numerous algorithms: QR(FFT(X)). Optimizations are classically done sequentially even when parallel processors and nodes are used. FFT(or DFT?) then QR Optimizations are classically done sequentially even when parallel processors and nodes are used. FFT(or DFT?) then QR Optimizations can be optimized across algorithms, processors, and memories Optimizations can be optimized across algorithms, processors, and memories

College of Nanoscale Science and Engineering MoA and PSI Calculus Basic Properties: An index calculus: psi function. Shape polymorphic functions and operators: Operations are defined using shapes and psi. MoA defines some useful operations and function. As long as shapes define functions and operations any new function or operation may be defined and reduced. Fundamental type is the array: scalars are 0-dimensional arrays. Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …) Operational Normal Form(ONF) = reduced form for 1-d memory layout(s). Defines How to Build the code on processor/memory hierarchies. ONF reveals loops and control.

College of Nanoscale Science and Engineering Applications Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional dimension for each level of the hierarchy. –Envision data as reshaped/transposed to reflect mapping to increased dimensionality. –An Index Calculus automatically transforms algorithm to reflect restructured data array. –Data, layout, data movement, and scalarization automatically generated based on MoA descriptions and Psi Calculus Definitions of Array Operations, Functions and their compositions. –Arrays are any dimension, even 0, I.e. scalars

College of Nanoscale Science and Engineering Processor/Memory Hierarchy continued Math and indexing operations in same expression Framework for design space search –Rigorous and provably correct –Extensible to complex architectures Approach Mathematics of Arrays Example:“raising” array dimensionality y= conv intricate math intricate memory accesses (indexing) (x)(x) Memory Hierarchy Parallelism Main Memory L2 Cache L1 Cache Map x: Map: P0 P1 P2 P0 P1 P2

College of Nanoscale Science and Engineering Manipulation of an array  Given a 3 by 5 by 4 array:  Shape vector:  Index vector:  Used to select:

College of Nanoscale Science and Engineering More Definitions  Reverse: Given an array  The reversal is given through indexing  Examples:

College of Nanoscale Science and Engineering Some Psi Calculus Operations Built Using  & Shapes Operations take drop rotate cat unaryOmega binaryOmega reshape iota Arguments Vector A, int N Vector A, Vector B Operation Op, dimension D, Array A Operation Op, Dimension Adim. Array A, Dimension Bdim, Array B Vector A, Vector B int N Definition Forms a Vector of the first N elements of A Forms a Vector of the last (A.size-N) elements of A Forms a Vector of the last N elements of A concatenated to the other elements of A Forms a Vector that is the concatenation of A and B Applies unary operator Op to D-dimensional components of A (like a for all loop) Applies binary operator Op to Adim-dimensional components of A and Bdim-dimensional components of B (like a for all loop) Reshapes B into an array having A.size dimensions, where the length in each dimension is given by the corresponding element of A Forms a vector of size N, containing values 0.. N-1 = index permutation= operators= restructuring= index generation

College of Nanoscale Science and Engineering New FFT algorithm: record speed  Maximize in-cache operations through use of repeated transpose-reshape operations  Similar to partitioning for parallel implementation  Do as many operations in cache as possible  Re-materialize the array to achieve locality  Continue processing in cache and repeat process

College of Nanoscale Science and Engineering Example  Assume cache size c = 4; input vector length n = 32; number of rows r = n/c = 8  Generate vector of indices:  Use re-shape operator to generate a matrix

College of Nanoscale Science and Engineering Starting Matrix  Each row is of length equal to the size “c”  Standard butterfly applied to each row as...

College of Nanoscale Science and Engineering Next transpose  To continue further would induce cache misses so transpose and reshape.  Transpose-reshape operation composed over indices (only result is materialized.  The transpose is:

College of Nanoscale Science and Engineering Resulting Transpose-Reshape  Materialize the transpose- reshaped array B  Carry out butterfly operation on each row  Weights are re-ordered  Access patterns are standard...

College of Nanoscale Science and Engineering Transpose-Reshape again  As before: to proceed further would induce cache misses so:  Do the transpose-reshape again (composing indices)  The transpose is:

College of Nanoscale Science and Engineering Last step (in this example)  Materialize the composed transpose-reshaped array C  Carry out the last step of the FFT  This last step corresponds to cycles of length 2 involving elements 0 and 16, 1 and 17, etc. 1

College of Nanoscale Science and Engineering Final Transpose  Data has been permuted numerous times Multiple reshape-transposes  We could reverse the transformations There would be multiple steps, multiple writes.  Viewing the problem as an n-cube(hypercube for radix 2) allows us to use the number of reshape-transposes as an argument to rotate(or shift) of a vector generated from the dimension of the hypercube. This rotated vector is used as an argument to binary transpose. Permutes everything at once. Express Algebraically, Psi reduce to DNF then ONF for a generic design. ONF has only two loops no matter what dimension hypercube(or n-cube for radix = n) we start with.

College of Nanoscale Science and Engineering

Summary  All operations have been carried out in cache at the price of re-arranging the data  Data blocks can be of any size (powers of the radix): need not equal the cache size  Optimum performance: tradeoff between reduction of cache misses and cost of transpose-reshape operations  Number of transpose-reshape operations determined by the data block size (cache size)  Record performance: up to factor of 4 better than libraries

College of Nanoscale Science and Engineering Science Direct 25 Hottest Articles

College of Nanoscale Science and Engineering Book under review at springer

College of Nanoscale Science and Engineering New paper at J. Comp. Phys.