Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Slides:

Advertisements

Similar presentations

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

Advertisements

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

Reference: Message Passing Fundamentals.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.

BROADWAY: A SOFTWARE ARCHITECTURE FOR SCIENTIFIC COMPUTING Samuel Z. Guyer and Calvin Lin The University of Texas.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

HELSINKI UNIVERSITY OF TECHNOLOGY LABORATORY OF COMPUTER AND INFORMATION SCIENCE NEURAL NETWORKS RESEACH CENTRE Variability of Independent Components.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer Architecture Parallel Processing

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

Automated Design of Custom Architecture Tulika Mitra

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.

MIMD Distributed Memory Architectures message-passing multicomputers.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

MODPI: A parallel MOdel Data Passing Interface for integrating legacy environmental system models A. Dozier, O. David, Y. Zhang, and M. Arabi.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Programmability Hiroshi Nakashima Thomas Sterling.

Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Introduction to OOP CPS235: Introduction.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Parallel Computing Presented by Justin Reschke

A Study in Hadoop Streaming with Matlab for NMR data processing Kalpa Gunaratna1, Paul Anderson2, Ajith Ranabahu1 and Amit Sheth1 1Ohio Center of Excellence.

08/10/ High Performance Parallel Implementation of Adaptive Beamforming Using Sinusoidal Dithers High Performance Embedded Computing Workshop Peter.

Nguyen Thi Thanh Nha HMCL by Roelof Kemp, Nicholas Palmer, Thilo Kielmann, and Henri Bal MOBICASE 2010, LNICST 2012 Cuckoo: A Computation Offloading Framework.

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

Advanced Computer Systems

Code Optimization.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Xing Cai University of Oslo

GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.

Parallel Programming By J. H. Wang May 2, 2017.

Parallel and Distributed Simulation Techniques

TerraForm3D Plasma Works 3D Engine & USGS Terrain Modeler

Summary Background Introduction in algorithms and applications

MPJ: A Java-based Parallel Computing System

Implementation of a De-blocking Filter and Optimization in PLX

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds. Disk I/O may be a major bottleneck in applications such as: scientific codes related to image processing multimedia applications out-of-core computations Computational optimizations alone may not provide any significant improvements to these programs. Why Should Compilers Be Involved? Compilers have knowledge of both the application and the computer architecture or operating system. Compilers can reduce the burden on the programmer and increase code portability by requiring little to no change in the user level program to achieve good performance on different architectures. Compilers can automatically translate programs written in high-level languages, which may lack robust I/O or operating system interfaces, into higher performance languages that provide more control over low-level system activities. Human Neuroimaging Lab The Human Neuroimaging Laboratory at the Baylor College of Medicine conducts research in the physiology and functional anatomy of the human brain using fMRI technology. fMRI Technology Functional Magnetic Resonance imaging is a technique for determining which parts of the brain are activated when a person responds to stimuli. A high resolution brain scan is followed by a series of low resolution scans taken on regular time slices. Brain activity is identified by increased blood flow to specific regions of the brain. Motivating Application The HNL wants to optimize a preprocessing application, which normalizes brain images of human subjects to a canonical brain in order to make the images comparable and enable data analysis. The program uses calls to the SPM (Statistical Parametric Mapping) library. Anna Youssefi, Ken Kennedy Transformation: Loop Distribution & Parallelization Single processor Processor 1 Processor 2 Processor 3 Processor 4 Hand transformation on I/O-intensive loop in HNL preprocessing application The original loop reads a different input file and writes a portion of a single output file on each iteration. The loop is distributed into two separate loops: the first loop runs in parallel on four different processors; the second loop runs sequentially across all processors. Standard compiler transformations are implemented by hand to parallelize the loop. Dependence analysis can be used to automate the transformation. Performance Results Performance of the transformed loop was constrained by shortcomings of the MPI (Message Passing Interface) implementation we used. This implementation relies on file I/O to share data and results in excessive communication times, as demonstrated by the broadcast overhead. Even with these performance constraints, we achieved 30-40% improvement in running time. We expect to achieve even better results from using a different MPI implementation. Conclusion and Future Work Through parallelization, we achieved a minimum of 30% improvement in the running time of an I/O- intensive loop. Standard compiler transformations can be extended to reveal the parallelism in such loops. We plan to implement compiler strategies to automate these transformations. We also plan to implement compiler support for other application-level I/O transformations, such as converting synchronous to asynchronous I/O, prefetching and overlapping I/O with computation. for i=1 to 192 READ PROCESS WRITE for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE