Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb. 2015.

Similar presentations


Presentation on theme: "Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb. 2015."— Presentation transcript:

1 Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb. 2015

2 Agenda 8:30-9:00 Welcome and introduction to TACC resources. 9:00–9:30 Getting started with running R at TACC. 9:30–10:00 Practice and coffee break. 10:00-11:00 R basics 11:00-11:30 Data analysis support in R 11:30-1:00 Lunch break 1:00-1:30 Scaling up R computations 1:30-2:00 A walkthrough with parallel package in R 2:00-3:00 hands on lab session 3:00-4:00 Understand the performance of R program

3 Introduction to TACC Resource

4 About TACC TACC is a Research Division at the University of Texas at Austin –Origins go back to 1960s Cray CDC 6600 support –TACC started in 2001 to support research beyond UT needs TACC is a service provider for XSEDE on several key systems –Currently providing between 80 to 90% of HPC cycles in XSEDE –Not limited to supporting NSF research TACC is also supported by partnering with UT Austin, UT System, Industrial Partners, Multi-institutional research grants, and donations TACC is 110+ people (40+ PhDs) bringing enabling technologies and techniques to drive digital research –Many collaborative research projects and mission specific proposals to support open research –Consulting to bring TACC expertise to other communities

5

6 Stampede Base Cluster (Dell/Intel/Mellanox): –6,400 nodes –Intel Sandy Bridge processors –Dell dual-socket nodes w/32GB RAM (2GB/core) –56 Gb/s Mellanox FDR InfiniBand interconnect –More than 100,000 cores, 2.2 PF peak performance Max Total Concurrency: –exceeds 500,000 cores –1.8M threads –#7 in HPC top 500 90% allocated through XSEDE

7 Additional Features of Stampede 6800 Intel Xeon Phi “MIC” Many Integrated Core processors –Special release of “Knight’s Corner” (61 cores) –10+ PF peak performance Stampede includes 16 1TB Sandy Bridge shared memory nodes with dual GPUs 128 of the compute nodes are also equipped with NVIDIA Kepler K20 GPUs Storage subsystem driven by Dell storage nodes: –Aggregate Bandwidth greater than 150GB/s –More than 14PB of capacity –Similar partitioning of disk space into multiple Lustre filesystems as previous TACC systems ($HOME, $WORK and $SCRATCH)

8 What does this mean? Faster processors More memory per node Starting hundreds of analysis jobs in batch. Access to latest “massive parallel” hardware –Intel Xeon Phi –GPGPU

9 Automatic offloading with latest hardware R is originally designed as for single thread execution. –Slow performance –Not scalable with large data R can be built and linked to library utilizes latest multiple core technology for automatic parallel execution for some operations, most commonly, linear algebra related computations.

10 Getting more from R Optimizing R performance on Stampede –Intel compiler vs. gcc was a factor of 2 improvement –MKL significantly improved performance –Some Xeon Phi performance enhancement too –Supporting common parallel packages

11 Maverick Hardware 132 Node dual core Ivy Bridge based cluster –Each node has NVIDIA Kepler K40 GPU –128 GB of memory –FDR Interconnect –Shares Work file system with Stampede (26 PB unformatted) –Users get 1 TB of Work to start Intended for real time analysis TACC system, 50% provided to XSEDE in kind, 50% discretionary

12 Visualization and Analysis Portal

13 R and Python Can launch RStudio Server and iPython Notebook –Introducing capabilities, best practices, and forms of parallelism to users –Simplifying UI with web interface –Improving visualization capabilities with Shiny package and GoogleVis

14 Hadoop Cluster: Rustler A Hadoop cluster with 64 Hadoop Data Nodes –2 x 10 core Ivy Bridge processors –128 GB memory –16x1TB disks (1 PB usable disk, 333 TB replicated) Login node, 2 Name nodes, 1 Web Proxy node 10 Gb/s Ethernet network with 40 Gb/s connectivity to TACC backbone In early user period today A pure TACC resource (All discretionary allocations)

15 Wrangler Three primary subsystems: –A 10PB disk storage system –Lustre based – (2 R720 dual E5- 2680 MD servers, 45 C8000 OSF servers with 6 TB drives) –An embedded analytics capability of several thousand cores. –96 Dell R620 Haswell E5-2680- v3 nodes with dual IB FDR/40 Gb/s Ethernet –A high speed global object store 500 TB usable Flash via PCI to all 96 analytics nodes 1TB/s IO rate &250M+ IOPS

16 Data Intensive Computing Support at TACC Data Management and Collection group –Providing data storage service Files, databases, irods, –Collection management and curation Data Mining and Statistics group –Collaborating with users to develop and implement scalable algorithmic solution. –In addition to general data mining and analysis method, also expertise in R, Hadoop and visual analytics.. We are here to help: –data@tacc.utexas.edu


Download ppt "Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb. 2015."

Similar presentations


Ads by Google