Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA

Slides:



Advertisements
Similar presentations
Introduction to Computers Part II
Advertisements

CSE 105 Structured Programming Language (C)
ECMWF 1 Com Intro training course Compiling environment Compiling Environment – ecgate Dominique Lucas User Support.
Section 6.2. Record data by magnetizing the binary code on the surface of a disk. Data area is reusable Allows for both sequential and direct access file.
Chapter 4: Communication*
Lecture 21Comp. Arch. Fall 2006 Chapter 8: I/O Systems Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson.
ECMWF 1 COM HPCF 2004: High performance file I/O High performance file I/O Computer User Training Course 2004 Carsten Maaß User Support.
Overview of programming in C C is a fast, efficient, flexible programming language Paradigm: C is procedural (like Fortran, Pascal), not object oriented.
18-Dec-14 Pruning. 2 Exponential growth How many leaves are there in a complete binary tree of depth N? This is easy to demonstrate: Count “going left”
Fortran Jordan Martin Steven Devine. Background Developed by IBM in the 1950s Designed for use in scientific and engineering fields Originally written.
Chapter 2 Machine Language.
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Lecture 1: Overview of Computers & Programming
Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.
Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.
Computer Science 1620 Variables and Memory. Review Examples: write a program that calculates and displays the average of the numbers 45, 69, and 106.
1 Engineering Problem Solving With C++ An Object Based Approach Fundamental Concepts Chapter 1 Engineering Problem Solving.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
1 Programming & Programming Languages Overview l Machine operations and machine language. l Example of machine language. l Different types of processor.
E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.
Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.
Introduction to Computers and Programming. Some definitions Algorithm: Algorithm: A procedure for solving a problem A procedure for solving a problem.
1 Programming Languages Examples: C, Java, HTML, Haskell, Prolog, SAS Also known as formal languages Completely described and rigidly governed by formal.
CCSA 221 Programming in C CHAPTER 2 SOME FUNDAMENTALS 1 ALHANOUF ALAMR.
Chapter 5 Algorithm Analysis 1CSCI 3333 Data Structures.
Computer Organization
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Computer Sensing and Control How is binary related to what we are trying to accomplish in electronics? The PC GadgetMaster II uses binary to communicate.
Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.
CSC141 Introduction to Computer Programming
File Systems. It is a simple C program that prints ``hello world'' and then exits. The header describes it as an ELF image with two physical headers (e_phnum.
Instruction Set Architecture
Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.
1 History of compiler development 1953 IBM develops the 701 EDPM (Electronic Data Processing Machine), the first general purpose computer, built as a “defense.
TMF1013 : Introduction To Computing Lecture 1 : Fundamental of Computer ComputerFoudamentals.
Computer Programming A program is a set of instructions a computer follows in order to perform a task. solve a problem Collectively, these instructions.
Software Development Software Testing. Testing Definitions There are many tests going under various names. The following is a general list to get a feel.
Scientific Computing Division A tutorial Introduction to Fortran Siddhartha Ghosh Consulting Services Group.
Chapter 18 – Miscellaneous Topics. Multiple File Programs u Makes possible to accommodate many programmers working on same project u More efficient to.
Robert Crawford, MBA West Middle School.  Explain how the binary system is used by computers.  Describe how software is written and translated  Summarize.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
1.  10% Assignments/ class participation  10% Pop Quizzes  05% Attendance  25% Mid Term  50% Final Term 2.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
1 Lecture 10: Floating Point, Digital Design Today’s topics:  FP arithmetic  Intro to Boolean functions.
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Textbook C for Scientists and Engineers © Prentice Hall 1997 Available at NUS CO-OP at S$35.10.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory, Bits, & Bytes. Memory Part of the computer where programs and data are stored. Read and written (changed). Bit – Binary digit – Basic unit of.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Engineering Problem Solving With C An Object Based Approach
SOFTWARE DESIGN AND ARCHITECTURE
Microprocessor and Assembly Language
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
History of compiler development
Chapter 15 QUERY EXECUTION.
Compiler Construction
CSCE Fall 2013 Prof. Jennifer L. Welch.
SME1013 PROGRAMMING FOR ENGINEERS
Overview: File system implementation (cont)
CS105 Introduction to Computer Concepts Intro to programming
SME1013 PROGRAMMING FOR ENGINEERS
CSCE Fall 2012 Prof. Jennifer L. Welch.
Comp Org & Assembly Lang
Presentation transcript:

Firenze, Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA

Firenze, Giugno 2003, C. Gheller Reading and Writing data is a problem usually underestimated. However it can become crucial for: Performance Porting data on different platforms Parallel implementation of I/O algorithms

Firenze, Giugno 2003, C. Gheller Time to access disk: approx Mbyte/sec Time to access memory: approx 1-10 Gbyte/sec THEREFORE When reading/writing on disk a code is 100 times slower. Optimization is platform dependent. In general: write large amount of data in single shots Performance

Firenze, Giugno 2003, C. Gheller For example: avoid looped read/write do i=1,N write (10) A(i) enddo Is VERY slow Performance Optimization is platform dependent. In general: write large amount of data in single shots

Firenze, Giugno 2003, C. Gheller Data portability This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms. For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC There are two main problem: Data representation File structure

Firenze, Giugno 2003, C. Gheller Data portability: number representation There are two different representations: Big Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0 Little Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3 Alpha, PC Unix (IBM, SGI, SUN…)

Firenze, Giugno 2003, C. Gheller Data portability: File structure For performance reasons, Fortran organizes binary files in BLOCKS. Each block is identified by a proper bit sequence (usually 1 byte long) Unfortunately, each Fortran compiler has its own Block size and separators !!! Notice that this problem is typical of Fortran and does not affect C / C++

Firenze, Giugno 2003, C. Gheller Data portability: Compiler solutions Some compilers allows to overcome these problems with specific options However this leads to Spend a lot of time in re-configuring compilation on each different system Have a less portable code (the results depending on the compiler)

Firenze, Giugno 2003, C. Gheller Data portability: Compiler solutions For example, Alpha Fortran compiler allows to use Big- Endian data using the -convert big_endian option However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!

Firenze, Giugno 2003, C. Gheller Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files. Open(unit=10, file=datafile.bin, form=unformatted, access=direct, recl=N) The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism

Firenze, Giugno 2003, C. Gheller Direct Access Files The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words). Example 1 Real*4 x(100) Inquire(IOLENGTH=IOL) x(1) Open(unit=10, file=datafile.bin, access=direct, recl=IOL) Do i=1,100 write(10,rec=i)x(i) Enddo Close (10) Portable but not performing !!! (Notice that, this is precisely the C fread-fwrite I/O)

Firenze, Giugno 2003, C. Gheller Direct Access Files Example 2 Real*4 x(100) Inquire(IOLENGTH=IOL) x Open(unit=10, file=datafile.bin, access=direct, recl=IOL) write(10,rec=1)x Close (10) Portable and Performing !!!

Firenze, Giugno 2003, C. Gheller Direct Access Files Example 3 Real*4 x(100),y(100),z(100) Open(unit=10, file=datafile.bin, access=direct, recl=4*100) write(10,rec=1)x write(10,rec=2)y write(10,rec=3)z Close (10) The same result can be obtained as Real*4 x(100),y(100),z(100) Open(unit=10, file=datafile.bin, access=direct, recl=4*100) write(10,rec=2)y write(10,rec=3)z write(10,rec=1)x Close (10) Order is not important!!!

Firenze, Giugno 2003, C. Gheller Parallel I/OI/O is not a trivial issue in parallel Example Program Scrivi Write(*,*)Hello World End program Scrivi Execute in parallel on 4 processors: Pe 0 Pe 1 Pe 2 Pe 3 $./Scrivi Hello World

Firenze, Giugno 2003, C. Gheller Parallel I/O Goals: Improve the performance Ensure data consistency Avoid communication Usability

Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 1: Master-Slave Only 1 processor performs I/O Pe 1 Pe 2 Pe 3 Pe 0 Data File Goals: Improve the performance: NO Ensure data consistency: YES Avoid communication: NO Usability: YES (but in general not portable)

Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 2: Distributed I/O All the processors read/writes their own files Pe 1 Pe 2 Pe 3 Data File 0 Goals: Improve the performance: YES (but be careful) Ensure data consistency: YES Avoid communication: YES Usability: NO Pe 0 Data File 3 Data File 2 Data File 1 Warning: Do not parametrize with processors!!!

Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 3: Distributed I/O on single file All the processors read/writes on a single ACCESS=DIRECT file Pe 1 Pe 2 Pe 3 Goals: Improve the performance: YES for read, NO for write Ensure data consistency: NO Avoid communication: YES Usability: YES (portable !!!) Pe 0 Data File

Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 4: MPI2 I/O MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported Pe 1 Pe 2 Pe 3 Goals: Improve the performance: YES (strongly!!!) Ensure data consistency: NO Avoid communication: YES Usability: YES Pe 0 Data File MPI

Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 1 How many clusters are there in the image ??? Cluster finding algorithm Input = the image Output = a number

Firenze, Giugno 2003, C. Gheller Case Study Case 1- Parallel implementation Parallel Cluster finding algorithm Input = a fraction of the image Output = a number for each processor All the parallelism is in the setup of the input. Then all processors work independently !!!! Pe 0Pe 1

Firenze, Giugno 2003, C. Gheller Case Study Case 1- Setup of the input Each processor reads its own part of the input file Pe 0Pe 1 ! The image is NxN pixels, using 2 processors Real*4 array(N,N/2) Open (unit=10, file=image.bin,access=direct,recl=4*N*N/2) Startrecord=mype+1 read(10,rec=Startrecord)array Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype, found, N_cluster, clusters

Firenze, Giugno 2003, C. Gheller Case Study Case 1- Boundary conditions Boundaries must be treated in a specific way Pe 0Pe 1 ! The image is NxN pixels, using 2 processors Real*4 array(0:N+1,0:N/2+1) ! Set boundaries on the image side array(0,:) = 0.0 array(N+1,:)= 0.0 jside= mod(mype,2)*N/2+mod(mype,2) array(:,jside)=0.0 Open (unit=10, file=image.bin,access=direct,recl=4*N) Do j=1,N/2 record=mype*N/2+j read(10,rec=record)array(:,j) Enddo If(mype.eq.0)then record=N/2+1 read(10,rec=record)array(:,N/2+1) else record=N/2-1 read(10,rec=record)array(:,0) endif Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype, found, N_cluster, clusters suggested avoid

Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 2 From observed data… … …to the sky map

Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 2 Each map pixel is meausered N times. The final value for each pixel is an average of all the corresponding measurements … … values map pixels id MAP

Firenze, Giugno 2003, C. Gheller Case Study Case 2: parallelization Values and ids are distributed between processors in the data input phase (just like case 1) Calculation is performed independently by each processor Each processor produce its own COMPLETE map (which is small and can be replicated) The final map is the SUM OF ALL THE MAPS calculated by different processors

Firenze, Giugno 2003, C. Gheller Case Study Case 2: parallelization ! N Data, M pixels, Npes processors (M << N) Real*8 value(N/Npes) Real*8 map(M) Integer id(N/Npes) Open(unit=10,file=data.bin,access=direct,recl=4*N/Npes) Open(unit=20,file=ids.bin,access=direct,recl=4*N/Npes) record=mype+1 Read(10,rec=record)value Read(20,rec=record)id Call Sequential_Calculate_Local_Map(value,id,map) Call BARRIER Call Calculate_Final_Map(map) Call Print_Final_Map(map) Define basic arrays Read data in parallel (boundaries are neglected) Calculate local maps Sincronize process Parallel calculation of the final map Print final map

Firenze, Giugno 2003, C. Gheller Case Study Case 2: calculation of the final map Subroutine Calculate_Final_Map(map) Real*8 map(M) Real*8 map_aux(M) Do i=1,npes If(mype.eq.0)then call RECV(map_aux,i-1) map=map+map_aux Else if (mype.eq.i-1)then call SEND(map,0) Endif Call BARRIER enddo return Calculate final map processor by processor However MPI offers a MUCH BETTER solution (we will see it tomorrow)

Firenze, Giugno 2003, C. Gheller Case Study Case 2: print the final map Subroutine Print_Final_Map(map) Real*8 map(M) If(mype.eq.0)then do i=1,m write(*,*)i,map(i) enddo Endif return Only one processor writes the result At this point ONLY processor 0 has the final map and can print it out