Scott Michael Indiana University July 6, 2017

Scott Michael Indiana University July 6, 2017
Performance Benchmarking of the R Programming Environment on Knight's Landing Scott Michael Indiana University July 6, 2017 Intro Slide

Who am I? Theoretical Astrophysicist NOT a statistician
HPC application optimization and performance tuning Lead the Research Analytics team in Research Technologies at Indiana University

Contributors IU Eric Wernert Jefferson Davis James McCombs Esen Tuna
TACC Bill Barth Tommy Minyard David Walling

Talk Overview Targeting productivity languages for Xeon Phi based architecture: Motivation and History Benchmark results and lessons learned The RHPCBenchmark package Future directions Conclusions

IU, The Stampede Supercomputer, and Xeon Phi
IU Research Technologies has a partnership with TACC collaborating on systems and support Stampede – largest XSEDE machine by core count Wrangler – data intensive computing and 20 PB out of region replication Jetstream – XSEDE production science cloud IU supports data intensive and “high productivity” languages on Stampede Including R, python, and Matlab Large transition between Stampede 1 & 2

Evolution of Xeon Phi Knight’s Corner Knight’s Landing
Coprocessor only Coprocessor or Self-hosted 1 TF peak (DP) 3 TF peak (DP) 8GB device + system memory 16GB MCDRAM + system memory

R Support on Stampede 1 & 2 Primary support on Stampede 1 for R
Support several methods for distributed R (pbdR, Rmpi, snow, etc.) R built in offload mode Configured R to use GPUs in portion of Stampede via HiPLAR However, much of the R workload on Stampede didn’t rely on KNC Stampede 1 Nodes 6,400 Interconnect FDR IB Filesystem 14 PB Lustre Node Configuration Processor Dual E “SandyBridge” Phi SE10P Memory 32GB DDR3 8GB GDDR5 Stampede 2 Nodes 4,200 Interconnect OmniPath v1 Node Configuration Processor Phi 7250 Memory 16GB GDDR4

R Performance on KNL KNL the sole processor on Stampede 2
Has shown good performance for large scale HPC codes (MD, climate, astro, etc.) How does KNL perform with a language like R?

KNL Architecture Intel(R) Xeon Phi(TM) CPU 1.60GHz (68 physical cores) Features of note for KNL Tiled architecture supporting 4 SMT threads per physical core

KNL Architecture (cont.)
Features of note for KNL 16GB on chip MCDRAM to act as fast memory can be configured into several modes

Benchmarking Strategy
Look at industry standard performance benchmarks for R on KNL and compare to SNB Further explore some exemplar workflows in each language and compare to benchmark results Compare both single node and multinode benchmarks

Benchmarking Strategy
R standard benchmark: R-25 benchmark Very old, fixed (small) problem sizes, report output challenging to parse Reasonable mix of mini-kernels focused on dense matrix operations and linear solvers R benchmark for scalability focused on similar kernels to R-25 Built to distribute and for flexibility, currently available on CRAN at RHPCBenchmark

R Benchmark Results Generally R lacks multithreading (some exceptions include mclapply) so we rely on the threading in MKL Standard profiling/tracing tools are challenging to employ Instrumenting entire R interpreter creates too much overhead

R Benchmark Results Benchmarks include
Cholesky decomp, eigendecomp, LS fit, linear solve, QR decomp, matrix cross, matrix det, matrix-matrix, matrix-vector Multiple threads per core aren’t useful Contrast to KNC

R Benchmark Results For some benchmarks single core KNL outperforms SNB

R Benchmark Results Need large matrices to make full use of all 68 cores

R Benchmark Results For math intensive kernels R interpreter overhead isn’t bad

RHPCBenchmark Package
The RHPCBenchmark initial release is available on CRAN Provides a variety of dense matrix, sparse matrix, and machine learning benchmarks Users can configure the set of benchmarks to run and benchmark parameters Results are provided in .csv files and a data frame for further analysis

Next Steps for R Performance
Internode performance Higher level functions Many R packages don’t rely on the building blocks tested (e.g. nnet, cluster) Other classes of functions Sparse matrix operations Data wrangling operations

Conclusions R performance on KNL better for dense matrix operations (3x SNB) and close to native C performance Performance is best for large matrices SNB does perform better for small matrices New RHPCBenchmark offers flexibility in benchmarking your hardware and R build

Questions? Suggestions?
Scott Michael James McCombs

Backups: KNL Speedup in R

Backups: KNL vs. IvyBridge

Backups: KNL Flat vs. Cached

Scott Michael Indiana University July 6, 2017

Similar presentations

Presentation on theme: "Scott Michael Indiana University July 6, 2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scott Michael Indiana University July 6, 2017

Similar presentations

Presentation on theme: "Scott Michael Indiana University July 6, 2017"— Presentation transcript:

Similar presentations

About project

Feedback