BLIS optimized for EPYCTM Processors

Slides:



Advertisements
Similar presentations
Reduce Cost & Complexity Partner logo here Presenters Name (16pt) Presenters Title (14pt) Company/ (14pt) Manage and Deploy Applications using Virtualization.
Advertisements

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
© 2014 Microsoft Corporation. All rights reserved.
Intel® Education Fluid Math™
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
Intel ® Server Platform Transitions Nov / Dec ‘07.
Intel® Education Read With Me Intel Solutions Summit 2015, Dallas, TX.
Intel® Education Learning in Context: Science Journal Intel Solutions Summit 2015, Dallas, TX.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Benefits of a SUSE ® Subscription Insert Presenter's Name (16pt) Insert Presenter's Title (14pt) Insert Company/ (14pt)
Conditions and Terms of Use
Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.
Title Slide – Option 1. Title Slide – Option 2 Insert Text.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
Enterprise Platforms & Services Division (EPSD) JBOD Update October, 2012 Intel Confidential Copyright © 2012, Intel Corporation. All rights reserved.
IBIS-AMI and Direction Decisions
Copyright © 2006 Intel Corporation. WiMAX Wireless Broadband Access: The World Goes Wireless Michael Chen Director of Product & Platform Marketing Group.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
The Drive to Improved Performance/watt and Increasing Compute Density Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise.
Enhancement Package Innovations Gabe Rodriguez - Halliburton Stefan Kneis – SAP Marco Valencia - SAP.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Manage Receipts.
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Put Away Loads.
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Manage Supplier Returns.
Installation of Storage Foundation for Windows High Availability 5.1 SP2 1 Daniel Schnack Principle Technical Support Engineer.
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Manage and Disposition Inventory Returns.
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Inspect Material.
May 18 – 22, Gas-Lift Workshop1 Presentation Title Presenter(s) name(s) and job title(s), if needed Company identification(s) and logo(s), if.
34 th Gas-Lift Workshop Singapore February , 2011 This presentation is the property of the author(s) and his/her/their company(ies). It may not be.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
May 23, 2016 SAP S/4HANA Service Packages Collection of 1-Sliders Public 1-Sliders v
2 Copyright © 2014 Mahindra & Mahindra Ltd. All rights reserved. 2.
This document is provided for informational purposes only and Microsoft makes no warranties, either express or implied, in this document. Information.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Proprietary and Confidential. 1.
From Source Code to Packages and even whole distributions By Cool Person From openSUSE.
Unit 2 Technology Systems
Connectivity to bank and sample account structure
TI Information – Selective Disclosure
ADP Product Suite Integration – New Hire Workflow
OGSA Service Classifications
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
The Small batch (and Other) solutions in Mantle API
Many-core Software Development Platforms
Computer Selection - Hardware Components
SOC Runtime Gregory Stoner.
Digital Video Solutions For Any Content Anywhere March 2010
Automation in an XML Authoring Environment
libflame optimizations with BLIS
Disclaimer The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of SAP. Except for.
LT Product Brief 2-Port MIPI to HDMI1.4 Converter
12/26/2018 5:07 AM Leap forward with fast, agile & trusted solutions from Intel & Microsoft* Eman Yarlagadda (for Christine McMonigal) Hybrid Cloud – Product.
Ideas for adding FPGA Accelerators to DPDK
Virtio/Vhost Status Quo and Near-term Plan
40th Gas-Lift Workshop Houston, Texas, USA Oct. 23 – 27, 2017
Motivation for 36OU Open Rack
41st Gas-Lift Workshop Houston, Texas, USA June 3 - 7, 2019
Advanced Micro Devices, Inc.
36th Gas-Lift Workshop Stavanger, Norway February 4 – 8, 2013
37th Gas-Lift Workshop Houston, Texas, USA February 3 – 7, 2014
35th Gas-Lift Workshop Houston, Texas, USA February 6 – 10, 2012
Expanded CPU resource pool with
Ajit Mathews Corp. VP Software Development ML Software Engineering
Rajy Rawther Kiriti Nagesh Gowda
Presentation transcript:

BLIS optimized for EPYCTM Processors © 2017 Advanced Micro Devices, Inc. All rights reserved.

AMD EPYCTM Processors The AMD EPYCTM SOC contains AMD X86 core codenamed “Zen”. Single EPYCTM package can have up to 32 Zen cores.

CONTENTS Small matrix GEMM optimizations and its impact on Caffe benchmark. Efficiency of GEMM in AMD EPYCTM Processor. Optimizations of TRSM, AMAX, GEMV, DOT….. Impact of optimizations on QR, LU AND Cholesky factorizations.

AMD EPYCTM Processor – Hardware configuration OS: Ubuntu Hard disk : 1 TB CPU name: AMD EPYC 7601 32-Core Processor CPU type: AMD K17 (Zen) architecture Sockets: 2 Cores per socket: 32 Threads per core: 2 L3 cache : 8 MB L2 Cache : 512 KB L1 Cache : 32 KB CPU frequency: 3.2 GHz RAM : 256 GB DDR4 @ 1.2 GHz *PC manufacturers may vary configurations yielding different results.

Small matrix optimizations AND ITs APPLICATIONS in MACHINE LEARNING

SMALL MATRIX PERFORMANCE AND MACHINE LEARNING Machine Learning frameworks such as Caffe spend significant* amount of time in 3D convolution (expressed as GEMM) and forward connected (expressed as GEMV) layers. Typically GEMM is called for small matrices. Problem size is small (x1, x2 ,…, xn) Search for faces at different resolutions and locations * https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart- of-deep-learning/

PERFORMANCE ANAYSIS FOR SMALL MATRIX Impact of Data Packing Data Packing in BLIS Enables efficient cache utilization during GEMM operation Packing benefits large matrices but introduces overhead for small matrices Low re-access does not cover packing expense Packing of data to avoid TLB misses is not necessary for small matrices that easily fit in L2 cache Chart source : https://github.com/flame/blis/wiki/Multithreading

PROPOSED approach / solution Title 29th & 30th January 2013 PROPOSED approach / solution Utilize cache through re-use without explicit packing. Applicable to matrix sizes that are small enough to fit in caches. Matrix C Matrix A Matrix B = x AMD Confidential-Internal Only

SMALL MATRIX GEMM OPTIMIZATIONS

CAFFE MNIST HAND WRITTEN DIGIT RECOGNITION Up to 18% improvement Source: http://yann.lecun.com/exdb/lenet/

Efficiency of gemm in amd epyctm Processor

GEMM TUNED FOR AMD EPYCTM Processor DGEMM efficiency hits 97% in AMD EPYCTM Processor. The GEMM loop parameters are tuned for higher efficiency.

TRSM and other blis api optimizations

GEMMTRSM fused kernel implementation Up to 20% improvement

GEMV and level 1F kernel optimizations Up to 80% improvement

Level 1 BLIS API OPTIMIZATIONS Up to 4x times improvement

Matrix factorizations performance improvement in libflame

Performance improvement in QR Up to 70% improvement

Performance improvement in cholesky Up to 12% improvement

Performance improvement in LU Up to 16% improvement

References & Source Code http://developer.amd.com/amd-cpu-foundation-libraries-epyctm-processors/ http://developer.amd.com/amd-secure-random-number-generator-library/ http://developer.amd.com/amd-cpu-libraries/ https://github.com/amd/blis https://github.com/amd/libflame http://developer.amd.com/accelerating-machine-learning-frameworks-using-blis/

Questions?

DISCLAIMER © 2017 Advanced Micro Devices, Inc. All rights reserved. The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of non infringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD’s Standard Terms and Conditions of Sale. AMD, the AMD Arrow logo, EPYC and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. © 2017 Advanced Micro Devices, Inc. All rights reserved.

EndNotes SLIDE 9 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM 7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-08 SLIDE 10 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM 7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-09 SLIDE 12 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-10 SLIDE 14 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-11 SLIDE 15 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-12 SLIDE 16 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results.TE-13 SLIDE 18 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-14

EndNotes Slide 19 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-15 Slide 20 Testing was conducted as of 11th August 2017 on the test system comprising AMD EPYCTM7601, 64 cores clocked at 3.2 GHz. The machine had DDR4 RAM of 256 GB, clocked at 1.2 GHz and 1 TB of Hard disk. The test system had Ubuntu operating system installed. PC manufacturers may vary configurations yielding different results. TE-16