Www.clearspeed.comWolfram Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 1 12 th October.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Computer Abstractions and Technology
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Click Here to Begin. Objectives Purchasing a PC can be a difficult process full of complex questions. This Computer Based Training Module will walk you.
Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved October.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
Introduction CS 524 – High-Performance Computing.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
ClearSpeed CSX620 Overview. References ClearSpeed Technical Training Slides for ClearSpeed Accelerator 620, software version 3.0, Slide Sets 1-6, Presentor:
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
How to build your own computer And why it will save you time and money.
Contemporary Languages in Parallel Computing Raymond Hummel.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
COMPUTER CONCEPTS.
The Need for Speed. The PDF-4+ database is designed to handle very large amounts of data and provide the user with an ability to perform extensive data.
Hardware Information Group Name Created by Michael Marcus.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Topics Introduction Hardware and Software How Computers Store Data
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
Computer Performance Computer Engineering Department.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
An Introduction to 64-bit Computing. Introduction The current trend in the market towards 64-bit computing on desktops has sparked interest in the industry.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 7-1 Measuring.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
4 November 2008NGS Innovation Forum '08 11 NGS Clearspeed Resources Clearspeed and other accelerator hardware on the NGS Steven Young Oxford NGS Manager.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
CPU Inside Maria Gabriela Yobal de Anda L#32 9B. CPU Called also the processor Performs the transformation of input into output Executes the instructions.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
CS 3500 L Performance l Code Complete 2 – Chapters 25/26 and Chapter 7 of K&P l Compare today to 44 years ago – The Burroughs B1700 – circa 1974.
Introduction to MMX, XMM, SSE and SSE2 Technology
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
© 2012 Autodesk A Fast Modal (Eigenvalue) Solver Based on Subspace and AMG Sam MurgieJames Herzing Research ManagerSimulation Evangelist.
Reduced Instruction Set Computing Ammi Blankrot April 26, 2011 (RISC)
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
CS203 – Advanced Computer Architecture Performance Evaluation.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Hardware Architecture
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
A next-generation many-core processor with reliability, fault tolerance and adaptive power management features optimized for embedded.
CS203 – Advanced Computer Architecture
Lecture 2: Performance Evaluation
Topics Introduction Hardware and Software How Computers Store Data
Enabling machine learning in embedded systems
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Vector Processing => Multimedia
CIS16 Application Development – Programming with Visual Basic
Topics Introduction Hardware and Software How Computers Store Data
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
ClearSpeed CSX620 Overview
Computer Graphics Graphics Hardware
Presentation transcript:

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 Real Acceleration for Mathematica® Simon McIntosh-Smith VP of Applications, ClearSpeed Technology

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Agenda Introduction Accelerators ClearSpeed math acceleration technology Accelerating Mathematica Summary

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 Introduction

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Introduction Mathematica® is being used to solve more and more computationally intensive problems General purpose CPUs keep getting faster, but a new wave of application accelerators are emerging that could give much greater performance –Much as GPUs have done for graphics ClearSpeed has been developing hardware accelerators specifically focused on scientific computing, and which accelerate the low-level math libraries used by Mathematica

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 Accelerators

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Accelerator technologies Visualization and media processing –Good for graphics, video, game physics, speech, … –Graphics Processing Units (GPUs) are well established in the mainstream –But there was a time not too long ago when your PC still did all the graphics in software on the main CPU… –Can be applied to some 32-bit applications today (64-bit coming at much lower speed), but currently they are fairly hard to program and very power hungry – 200W! Embedded content processing –Data mining, encryption, XML, compression –Field Programmable Gate Arrays (FPGAs) are often being used here, mainly to accelerate integer-intensive codes –Poor at floating point, especially 64-bit, and cut corners on precision so don’t get good accuracy –Very hard to program and get good performance

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Accelerator technologies continued Math Accelerators –Mostly floating point, 64-bit performance is crucial, high precision, supporting true IEEE754 floating point –Can accelerate numerically-intensive applications in Finance Oil and Gas Economics Electromagnetics Bioinformatics And many, many more –This is what ClearSpeed has developed To accelerate Mathematica, a true Math Accelerator is needed…

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October The other benefit of accelerators – low power Running 1 watt for 1 years costs about $1 Modern CPUs can consume around 100W –$100/year running cost for the CPU alone if used 24/7 –Significant associated CO 2 emissions Accelerators typically bring significant performance per watt gains –Examples later in this presentation show 1 CPU plus a 25W ClearSpeed board running as fast as a 4 CPU (8 core) machine –This power consumption reduction of around 275W, if applied 24/7, is a $275 energy cost saving –Not to mention how much smaller and quieter the accelerated system can be…

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 ClearSpeed’s Math Acceleration Technology

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October What are ClearSpeed’s products? Math accelerator boards, ClearSpeed Advance ™ e620 & X620 –Dual ClearSpeed CSX600 coprocessors –R ∞ ≈ 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls Hardware also supports 32-bit floating point –PCI Express x8 and 133 MHz PCI-X 2/3 rds support –1 GByte of memory on the board –Linux drivers today for RedHat and Suse –Low power; 25 to 33 Watts Significantly accelerates the low-level math library used by Mathematica (MKL): –Target functions: Level 3 BLAS and LAPACK

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Which MKL functions can ClearSpeed accelerate? Previous release – CSXL 2.51 and before: L3 BLAS: large matrix arithmetic (preferably at least 1,000 on a side): –DGEMM – real matrix multiply LAPACK: factorize and solve for large systems of linear equations –LU (DGETRF) New release – CSXL 2.52: L3 BLAS: –ZGEMM – complex matrix multiply –DTRSM – triangular solve –Future release: DTRMM, DSYRK and others LAPACK: –LU (DGETRS) –QR (DGEQRF, DORGQR & DORMQR) –Cholesky (DPOTRF & DPOTRS) –Future release – complex versions of the above

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Software development kit (SDK) C compiler with vector extensions (ANSI-C based commercial compiler), assembler, libraries, ddd/gdb- based debugger, newlib-based C-rtl etc. ClearSpeed Advance development boards Available for Linux, Windows

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 Accelerating Mathematica

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Mathematica uses libraries underneath Mathematica BLAS & LAPACK library: Intel’s MKL CPU Software Hardware

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Mathematica using accelerated libraries Mathematica BLAS & LAPACK library: Intel’s MKL CPU Software Hardware ClearSpeed’s CSXL Library ClearSpeed Advance TM board

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Plug-and-Play – No changes to your notebooks Mathematica has used MKL since v5.2 ClearSpeed provides a modified kernel –Uses a modified “math” script that launches the kernel –Sets the library path to pick up CSXL as well as MKL Functions supported in Mathematica today include: –Dot[] –Det[] –LUDecomposition[] –LinearSolve[] –Inverse[] –CholeskyDecomposition[] – new! –QRDecomposition[] – new! If your notebooks spend a high percentage of your total runtime in these functions, and a lot of time in each call to these functions, then you may have a candidate for ClearSpeed acceleration! It is very likely that other functions are also accelerated –If you find more, let us know!

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October ClearSpeed has been collaborating with ScienceOps to discover what kinds of problems are accelerated Early results show a good breadth of applications being accelerated –Performance improvements –Ability to run larger problem sets Initial results show speedup ranging from 2 – 5X What kind of notebooks could be accelerated?

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example notebooks Benchmarked on a fast server for comparison: –4 processors, each dual core (8 cores total), AMD Opteron 870 (2GHz) with 32GBytes of memory running Linux RHE4-64 Comparisons are between: –Using 2 Opteron cores on their own –Using all 8 Opteron cores on their own, and –Using 2 Opteron cores with a single ClearSpeed Advance accelerator board We haven’t re-benchmarked these notebooks on our latest release and on the new PCI Express verison of our board yet, both of which should increase performance

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example notebook descriptions ANOVA –Analysis of variance, a linear least squares minimisation, fitting a curve to sampled data Microarray –Microarray data analysis, determines coexpression networks – sets of genes that are commonly expressed together under different experimental conditions. Calculates distance metrics ImageDecode –Progressive decoding of images using the Haar wavelet transform. Grayscale images used in this example Spatial Auto Regression (SAR) –Simple regressions iterating on large, dense matrices

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example – ANOVA ANOVA notebook benefits from 2X speedup with 4,000 predictors Two cores with a ClearSpeed accelerator equivalent in performance to an eight core machine!

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example – Microarray Microarray notebook benefits from nearly a 3X speedup with 4,000 inputs Larger problems may receive even more speedup –Data sets with over 6,000 expression levels exist for yeast

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example – ImageDecode ImageDecode notebook speedup ranges from 2-3X depending on the image size When tuned this speedup should also be achieved for images around 960x960 in size (already around 1.6X)

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Example – Spatial Auto Regression SAR notebook speedup nearly 2X Larger problems should receive even more speedup –Run-times quite substantial too

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October New CholeskyDecomposition[] performance A = Table[Random[], {n}, {n}]; B = Dot[Transpose[A], A]; Clear[A]; AbsoluteTiming[CholeskyDecomposition[B];]

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October New QRDecomposition[] performance A = Table[Random[], {n}, {n}]; B = Dot[Transpose[A], A]; Clear[A]; AbsoluteTiming[QRDecomposition[B];]

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October New complex Dot[] performance A=Table[Complex[1.5,1.5],{n},{n}]; AbsoluteTiming[Dot[Transpose[A], A];]

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October The Challenge Mathematica does a great job of choosing the right method for the right problem… … Which make it hard to know which method is going to be used and when! Consequently it’s proving very difficult to know in advance what is going to be accelerated and by how much Call to action: –Can you think of any applications that should be significantly accelerated by ClearSpeed?

Technology Conference ENVISION. ACCELERATE.ARRIVE. Copyright © 2006 ClearSpeed Technology plc. All rights reserved th October 2007 Summary

Technology Conference Copyright © 2006 ClearSpeed Technology plc. All rights reserved. 12 th October Summary Accelerators can be used to significantly increase performance and performance per watt across a range of interesting applications in Mathematica You need a real 64-bit math accelerator for Mathematica to deliver the precision you depend upon ClearSpeed can accelerate notebooks making intensive use of Dot[], Det[], LUDecomposition[], LinearSolve[], Inverse[], CholeskyDecomposition[] and QRDecomposition[] –More in the future as the libraries are developed Plug-and-play – no changes to your notebooks What could you do if you added 66 GFLOPS of matrix crunching power to your Mathematica performance?