Presentation is loading. Please wait.

Presentation is loading. Please wait.

E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.

Similar presentations


Presentation on theme: "E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE."— Presentation transcript:

1 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE

2 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 2 COMPUTING PERFORMANCE/1 MTR – PT, H 2 O, O 3 – Orbit 2081 – 72 Sequences November, 2003 T = 10h 30m T = 10h 30m [ Alphaserver ES45, CPU 1 GHz ] T = 5h 45m T = 5h 45m [ Intel P-IV, CPU 2.8 GHz ]

3 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 3 COMPUTING PERFORMANCE/2 T = 5h 45m T = 5h 45m [ P-IV ] T FM = 101m x 3 = 303m= 88% T AL = 20m x 2 = 40m= 12% November, 2003 T = 54m 54s T = 54m 54s [ P-IV ] T FM = 53m 32s = 98% T AL = 77s = 2% February, 2004 FM = FORWARD MODEL, AL = MATRIX ALGEBRA T = 2h 0m 24s T = 2h 0m 24s [ Alpha ]

4 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 4 LOOP OPTIMIZATION do i=1, 1000 do j=1, 1000 a(i,j) = b(i,j) enddo Stride: constant offset between the addresses of the locations of successive elements of the array In Fortran arrays are stored in column- major order Stride 1000 Slow access Stride 1 Fast access! do j=1, 1000 do i=1, 1000 a(i,j) = b(i,j) enddo When the code was developed attention was mainly payd wrt correctness of results (not too much wrt speed)

5 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 5 INNER LOOP VECTORIZATION/1 SSE2 architecture(P-IV): can execute two double- precision 64-bit floating point operations, or four 32-bit single-precision operations for clock cycle x0x0 x1x1 x2x2 x3x3 y0y0 y1y1 y2y2 y3y3 x 0 +y 0 x 3 +y 3 x 2 +y 2 x 1 +y 1 ++++ INTEL FORTRAN COMPILER (IFC) – PUBLIC DOMAIN CAN VECTORIZE REAL*4 AND REAL*8 WITH -Xw

6 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 6 INNER LOOP VECTORIZATION/2 do i=1,1000 do j=1,3 x(i,j) = y(i,j) + 9*z(i,j) enddo do i=1,1000 x(i,1) = y(i,1) + 9*z(i,1) x(i,2) = y(i,2) + 9*z(i,2) x(i,3) = y(i,3) + 9*z(i,3) enddo Non unit stride Inner loop too small Vectorizable! do i=1,1000 x(1,i) = y(1,i) + 9*z(1,i) x(2,i) = y(2,i) + 9*z(2,i) x(3,i) = y(3,i) + 9*z(3,i) enddo

7 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 7 PARALLEL ALGORITHM T = 8m 20s T FM = 6m 57s = 83% T AL = 77s = 16% February, 2004 [ Linux cluster, 8 Intel P-IV, CPU 2.8 GHz ] IF THE FM WAS COMPLETELY PARALLEL THE FM COMPUTING TIME WOULD BE: (53m 32s)/8 = 6m 41s FM IS PARALLEL (WITH 8 CPUs) AT 96% WORK OVER THE CPUs IS WELL BALANCED

8 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 8 SCALABILITY AND SPEED-UP THE SCALABILITY DESCRIBES THE ABILITY TO ACHIEVE PERFORMANCE PROPORTIONAL TO THE NUMBER OF PROCESSORS SPEED-UP A MEASURE OF THE SCALABILITY IS PROVIDED BY THE SPEED-UP: THE TIME SPENT TO SOLVE A PROBLEM ON ONE PROCESSOR DIVIDED BY THE TIME SPENT TO SOLVE THE SAME PROBLEM ON A NUMBER P OF PROCESSORS

9 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 9 SCALABILITY

10 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 10 SPEED-UP 72 is not divisible by 5 or 7!

11 E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 11 CONCLUSIONS & PERSPECTIVES COMPUTING TIME: COMPUTING TIME: SOME IMPROVEMENTS CAN STILL BE OBTAINED (SPECIALLY ON THE ALPHA SYSTEM) MERMORY REQUIREMENTS: MERMORY REQUIREMENTS: FOR THIS TEST CASE THE MEMORY REQUIREMENTS IS 1.05 Gbyte ( 1.7 AT LAST MEETING) IT CAN BE REDUCED AT THE EXPENSES OF CODE READIBILITY (WORK IS IN PROGRESS)


Download ppt "E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE."

Similar presentations


Ads by Google