1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.

Slides:

Advertisements

Similar presentations

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Advertisements

Xushan Zhao, Yang Chen Application of ab initio In Zr-alloys for Nuclear Power Stations General Research Institute for Non- Ferrous metals of Beijing September.

Eos Compilers Fernanda Foertter HPC User Assistance Specialist.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

Lecture 2c: Benchmarks. Benchmarking Benchmark is a program that is run on a computer to measure its performance and compare it with other machines Best.

Background Computer System Architectures Computer System Software.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

1cs542g-term Notes  Assignment 1 is out (questions?)

Chapter Hardwired vs Microprogrammed Control Multithreading

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

CH12 CPU Structure and Function

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Computer System Architectures Computer System Software

Lecture 8: Caffe - CPU Optimization

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Cis303a_chapt03-2a.ppt Range Overflow Fixed length of bits to hold numeric data Can hold a maximum positive number (unsigned) X X X X X X X X X X X X X.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Sd&m software design & management GmbH & Co. KG Thomas-Dehler-Straße München Telefon (0 89) Telefax (0 89)

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Intel Math Kernel Library (MKL) Clay P. Breshears, PhD Intel Software College NCSA Multi-core Workshop July 24, 2007.

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

GPU Architecture and Programming

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 EPSII 59:006 Spring Real Engineering Problem Solving Analyzing Results of Designs is Paramount Problems are Difficult, Code Writing Exhaustive.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

HPC F ORUM S EPTEMBER 8-10, 2009 Steve Rowan srowan at conveycomputer.com.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Single Node Optimization Computational Astrophysics.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

SimTK 1.0 Workshop Downloads Jack Middleton March 20, 2008.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Distributed Shared Memory

Programming Models for SimMillennium

Constructing a system with multiple computers or processors

Array Processor.

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

VTune: Intel’s Visual Tuning Environment

Types of Parallel Computers

Husky Energy Chair in Oil and Gas Research

Shared-Memory Paradigm & OpenMP

Presentation transcript:

1

2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss MKL and LAPACK Describe VML, its features and use

IntroductionThe Library Sections Performance Features Using the Library

 MKL Addresses:  Solvers (BLAS, LAPACK  Eigenvector/eigenvalue solvers (BLAS, LAPACK)  Some quantum chemistry needs (dgemm)  PDEs, signal processing, seismic, solid-state physics (FFTs)  Geneal scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)

Software Construction Geometric Transformation Don’t use Intel® Math Kernel (Intel® MKL) on … Don’t use Intel® MKL on “small” counts. Don’t call vector math functions on small n. § But you could use Intel ® Performance Primitives

6  BLAS (Basic Linear Algebra Subroutines  Level 1 BLAS – vector-vector operations  15 function types  48 functions  Level 2 BLAS – matrix-vector operations  26 function types  66 functions  Level 3 BLAS – matrix-matrix operations  9 function types  30 functions  Extended BLAS – level 1 BLAS for sparse vectors  8 function types  24 functions

7  LAPACK (linear algebra package  Solvers and eigensolvers. Many hundreds of routines total  There are more than 1000 total user callable and support routines  Discrete Fourier Transformations (DFT)  Mixed radix, multi-dimensional transforms  Multi threaded  VML (Vector Math Library)  Set of vectorized transcendental functions  Most of libm functions, but faster  VSL (Vector Statistics Library)  Set of vectorized ran

8  BLAS and LAPACK* are both Fortran  Legacy of high performance computation  VSL and VML have Fortran and C interfaces  DFTs have Fortran 95 and C interfaces  cblas intercate. It is more convenient for a C/C++ programmer to call BLAS

9  Support 32-bit and 64-bit Intel Processors  Large set of examples and tests  Extensive documentation

11/28/ The goal of all optimization is maximum speed. Resource limited optimization – exhaust one or more resource of system:  CPU: Register use, FP units  Cache: Keep data in cache as long as possible; deal with cache interleaving.  TLBs: Maximally use data on each page  Memory bandwidth: Minimally access memory  Computer: Use all the processors available using threading  System: Use all the nodes available (cluster software)

11  Most of Intel MKL could be threaded but:  Limited resource is memory bandwidth  Threading level 1 and level 2 BLAS are mostly ineffective (O(n) )  There are numerous opportunities for threading:  Level 3 BLAS (O(n3) )  LAPACK* (O(n3) )  FFTs (O(n log(n) )  VML, VSL? Depends on processor and function  All threading is via OpenMP*  All Intel MKL is designed and compiled for thread safety

12 Scenario 1: ifort, BLAS, IA-32 processor: ifort myprog.f mkl_c.lib Scenario 2: CVF, LAPACK, IA-32 processor: f77 myprog.f mkl_s.lib Scenario 3: Statically link a C program with DLL linked at runtime: link myprog.obj mkl_c_dll.lib Note: Optimal binary code will execute at run time based on processor.

13

14

15  Most important LAPACK optimizations:  Threading – effectively uses multiple CPUs  Recursive factorization  Reduces scalar time (Amdahl’s law: t=tscalar + tparallel/p  Extends blocking further into the code  No runtime library support required

16  One dimensional, two-dimensional, three-dimensional  Multithreaded  Mixed radix  User – specified scaling, transform sign  Transforms on imbedded matrices  Multiple one-dimensional transforms on single cell  Strides  C and F90 interfaces

17  Basically a three-step process  Create a descriptor  Status = DftiCreate Descriptor (MDH,…)  Commit the descriptor (instantiates it)  Status = DftiCommit Descriptor (MDH)  Perform the transform  Status = DftiComputeForard (MDH, X)  Optionally free the descriptor

18  Vector Math Library: Vectorized transcendental functions – like libm but better (faster)  Interface: Have both Fortran and C interfaces  Multiple accuracies  High accuracy (<1ulp)  Lower accuracy, faster (<4 ulps)  Special value handling √(-a), sin(0), and so on  Error handling – can not duplicate libm here

19  It is important for financial codes (Monte Carlo simulations)  Exponentials, logarithms  Other scientific codes depend on transcendental functions  Error functions can be big time sinks in come codes

20 Vector Statistical Library (VSL) Set of random number generators (RNGs) Numerous non-uniform distributions VML used extensively for transformations Parallel computation support – some functions User can supply own BRNG or transformations Five basic RNGs (BRNGs) – bits, integer, FP ◦MCG31, R250, MRG32, MCG59, WH

21 Non-Uniform RNGs Gaussian (two methods) Exponential Laplace Weibull Cauchy Rayleigh Lognormal Gumbel

22 Using VSL Basically a 3-step Process Create a stream pointer. VSLStreamStatePtr stream; Create a stream. vslNewStream(&stream,VSL_BRNG_MC_G31, seed ); Generate a set of RNGs. vsRngUniform( 0, &stream, size, out, start, end ); Delete a stream (optional). vslDeleteStream(&stream);

23 Activity: Calculating Pi using a Monte Carlo method Compare the performance of C source code (RAND function) and VSL. Exercise control of the threading capabilities in MKL/VSL.

24 Performance Libraries: What’s Been Covered Intel® Math Kernel Library is a broad scientific/engineering math library. It is optimized for Intel® processors. It is threaded for effective use on SMP machines.