Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Slides:

Advertisements

Similar presentations

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Advertisements

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Analytic Evaluation of Shared-Memory Systems with ILP Processors

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Year 2 Updates.

Department of Computer Science University of California, Santa Barbara

Communication and Memory Efficient Parallel Decision Tree Construction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Soft Error Detection for Iterative Applications Using Offline Training

Data-Intensive Computing: From Clouds to GPU Clusters

COMP60621 Fundamentals of Parallel and Distributed Systems

Fast and Exact K-Means Clustering

Programming with Shared Memory Specifying parallelism

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

COMP60611 Fundamentals of Parallel and Distributed Systems

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Outline  Motivation  Random Write Reductions and Parallelization Techniques  Problem Definition  Analytical Model General Approach Modeling Cache and TLB Modeling waiting for locks and memory contention  Experimental Validation  Conclusions

Motivation  Frequently need to mine very large datasets  Large and powerful SMP machines are becoming available Vendors often target data mining and data warehousing as the main market  Data mining emerging as an important class of applications for SMP machines

Common Processing Structure Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } Applies to major association mining, clustering and decision tree construction algorithms How to parallelize it on a shared memory machine?

Challenges in Parallelization  Statically partitioning the reduction object to avoid race conditions is generally impossible.  Runtime preprocessing or scheduling also cannot be applied Can’t tell what you need to update w/o processing the element  The size of reduction object means significant memory overheads for replication  Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.

Parallelization Techniques  Full Replication: create a copy of the reduction object for each thread  Full Locking: associate a lock with each element  Optimized Full Locking: put the element and corresponding lock on the same cache block  Cache Sensitive Locking: one lock for all elements in a cache block

Memory Layout for Locking Schemes Optimized Full LockingCache-Sensitive Locking LockReduction Element

Relative Experimental Performance Different Techniques can outperform each other depending upon problem and machine parameters

Problem Definition  Can we predict the relative performance of different techniques for given machine, algorithm and dataset parameters ?  Develop an analytical model capturing the impact of memory hierarchy and modeling different parallelization overheads  Other applications of the model: Predicting speedups possible on parallel configurations Predicting performance as the output size is increased Scheduling and QoS in multiprogrammed environments Choosing accuracy of analysis and sampling rate in an interactive environment or when mining over data streams

Context  Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system Support parallelization on shared-nothing configurations Support parallelization on shared memory configurations Support processing of large datasets  Previously reported our work on parallelization techniques and processing of disk-resident datasets (SDM 01, SDM 02)

Analytical Model: Overview  Input data is read from disks – constant processing time  Reduction elements are accessed randomly – their size can vary considerably  Factors to model: Cache Misses on reduction elements -> Capacity and Coherence TLB Misses on reduction elements Waiting time for locks Memory contention

Basic Approach Focus on modeling reduction loops T loop = T average * N T average = T compute + T reduc T reduc = T update + T wait + T cache_miss + T reduc = T update + T wait + T cache_miss + T tlb_miss + T memory_contention T tlb_miss + T memory_contention T update can be computed by executing the loop with a reduction object that fits into L1 cache T update can be computed by executing the loop with a reduction object that fits into L1 cache

Modeling Waiting time for Locks  The spent by a thread in one iteration of the loop can be divided into three components Computing independently (a) Waiting for a lock (T wait ) Holding a lock (b) where b = T reduc - T wait  Each lock is a M/D/1 queue  The rate at which each requests to acquire a lock are issued are:  t / ((a + b + T wait )*m)

Modeling Waiting Time for Locks  Standard result on M/D/1 queues T wait = bU/ 2(1-U) where, U is the server utilization and is given by U = b  Result on T wait is T wait = b/(2(a/b + 1)(m/t) – 1)

Modeling Memory Hierarchy  Need to model L1 and L2 Cache L2 Cache TLB Misses  Ignore cold misses  Only consider directly-mapped cache – analyze capacity and conflict misses together  Simple analysis for capacity and conflict misses because of random accesses to the reduction object

Modeling Coherence Cache Misses  A coherence miss occurs when a cache block is invalidated by other CPU  Analyze the probability that: Between two accesses to a cache block on a processor, the same memory block is accessed, and this memory block is not updated by one of the other processors in the mean-time  Details are available in the paper

Modeling Memory Contention  Input elements displace reduction objects from cache  Results in a write-back followed by read operation  Memory system on many machines requires extra cycles to switch between write-back and read operations  Source of contention  Model using M/D/1 queues, similar to waiting time for locks

Experimental Platform  Small SMP machine Sun Ultra Enterprise X 250 MHz Ultra-II processors 1 GB of 4-way interleaved main memory  Large SMP machine Sun Fire X 900 MHz Sun UltraSparc III A 96KB L1 cache and a 64 MB L2 cache per processor 24 GB main memory

Impact of Memory Hierarchy, Large SMP Measured and predicted performance as the size of reduction object is scaled Full replication Optimized full locking Cache-sensitive locking

Modeling Parallel Performance with Locking, Large SMP Parallel performance with cache-sensitive locking, small reduction object sizes 1 thread 2 threads 4 threads 8 threads 12 threads

Modeling Parallel Performance, Large SMP Performance of optimized full locking with large reduction object sizes 1 thread 2 threads 4 threads 8 threads 12 threads

How good is the Model in Predicting Relative Performance ? (Large SMP) Performance of Optimized full locking and Cache Sensitive Locking (12 threads)

Impact of Memory Hierarchy, Small SMP Measured and predicted performance as the size of reduction object is scaled Full replication Optimized full locking Cache-sensitive locking

Parallel Performance, Small SMP Performance of optimized full locking 1 thread 2 threads 3 threads

Summary  A new application of performance modeling Choosing among different parallelization techniques  Detailed analytical model capturing memory hierarchy and parallelization overheads  Evaluated on two different SMP machines Predicted performance within 20% in almost all cases Effectively capture impact of both memory hierarchy and parallelization overheads Quite accurate in predicting the relative performance of different techniques