3rd ACES WG mtg. 2003/06/06 Brisbane. 1 Current Target Transcurrent Plate Boundary Preliminary Study –e.g. San Andreas Faults, CA.

Slides:



Advertisements
Similar presentations
Jörg Uhlenbrock Universität Hannover Institut für Meteorologie und Klimatologie PALM-Seminar How to work with the user-interface-routine in.
Advertisements

Universität Hannover Institut für Meteorologie und Klimatologie Zingst PALM-Seminar July 2004 How to work with the user-interface-routine in PALM Contents:
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Goal: Write Programs in Assembly
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Chapter 4 M. Keshtgary Spring 91 Type of Workloads.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO High-Frequency Simulations of Global Seismic Wave Propagation A seismology challenge:
General Computer Science for Engineers CISC 106 Lecture 19 Dr. John Cavazos Computer and Information Sciences 04/06/2009.
1 Lecture 6 Performance Measurement and Improvement.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Results The following results are for a specific DUT device called Single Ring Micro Resonator: Figure 6 – PDL against Wavelength Plot Figure 7 – T max.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Guidelines for the CMM coding project 5 October 2006 (or, “How to make your life easier in the long run”)
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Higher Computing Computer Systems S. McCrossan 1 Higher Grade Computing Studies 3. Computer Performance Measures of Processor Speed When comparing one.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Vectors Readings: Chapter 3. Vectors Vectors are the objects which are characterized by two parameters: magnitude (length) direction These vectors are.
Computer Architecture and the Fetch-Execute Cycle
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
1 CHAPTER 2 THE ROLE OF PERFORMANCE. 2 Performance Measure, Report, and Summarize Make intelligent choices Why is some hardware better than others for.
International Workshop of APEC Cooperation for Earthquake Simulation Eiichi Fukuyama National Research Institute for Earth Science and Disaster Prevention,
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Introduction to Parallel Finite Element Method using GeoFEM/HPC-MW Kengo Nakajima Dept. Earth & Planetary Science The University of Tokyo VECPAR’06 Tutorial:
OMS Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,
Performance Optimization Getting your programs to run faster CS 691.
Main Memory CS448.
Computer Architecture Lecture 26 Fasih ur Rehman.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The APEC Cooperation for Earthquake Simulation (ACES) John B Rundle ACES Executive Director University of California, Davis, CA USA
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
Performance Optimization Getting your programs to run faster.
Parallel Iterative Solvers with the Selective Blocking Preconditioning for Simulations of Fault-Zone Contact Kengo Nakajima GeoFEM/RIST, Japan. 3rd ACES.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Playstation2 Architecture Architecture Hardware Design.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Introduction to Computer Organization Pipelining.
Programming for Performance Laxmikant Kale CS 433.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Single CPU Optimizations of SCEC AWP-Olsen Application Hieu Nguyen (UCSD), Yifeng Cui (SDSC), Kim Olsen (SDSU), Kwangyoon Lee (SDSC) Introduction Loop.
Response of an Elastic Half Space to an Arbitrary 3-D Vector Body Force Smith and Sandwell, JGR 2003 Develop the three differential equations relating.
Chapter Overview General Concepts IA-32 Processor Architecture
Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo.
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
CDA 3101 Spring 2016 Introduction to Computer Organization
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Review for Test1.
Register Pressure Guided Unroll-and-Jam
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
TI C6701 VLIW MIMD.
Memory System Performance Chapter 3
CSE 373: Data Structures and Algorithms
CSc 453 Final Code Generation
CSE 502: Computer Architecture
What Are Performance Counters?
Presentation transcript:

3rd ACES WG mtg. 2003/06/06 Brisbane. 1 Current Target Transcurrent Plate Boundary Preliminary Study –e.g. San Andreas Faults, CA.

3rd ACES WG mtg. 2003/06/06 Brisbane. 2 San Andreas Faults, CA. Transcurrent Plate Boundaries > 1,000 km US Geological Survey

3rd ACES WG mtg. 2003/06/06 Brisbane. 3 Problem Configuration Double Fault Patches for Initial Condition. 150~1,200km(Length)×45km(Depth). –705~6,000 parameters. Plate Motion : 50mm/yr. Earth Simulator (1-64 PEs).

3rd ACES WG mtg. 2003/06/06 Brisbane. 4 Overview Hashimoto Code –Tectonic stress accumulation simulation at transcurrent plate boundaries –Boundary Integral Method –Fault Length: 150km – 1200km

3rd ACES WG mtg. 2003/06/06 Brisbane. 5 Parallel Matrix Assembling for Linear EQNs: MRQCOF do ip= 1, PETOT is= (ip-1)*gN if (iflagM.eq.1) then do j= 1, gN wt= dydamatP(is+j)*sig2imatP(ip) !CDIR NODEP do k= 1, gM k1= gMTBL(k) gA2(j,k)= gA2(j,k) + wt*dydamatP(is+k1) enddo gB2(j)= gB2(j) + dymatP(ip)*wt enddo endif chisq= chisq + dymatP(ip)*dymatP(ip)*sig2imatP(ip) enddo Original if (iflagM.eq.1) then do ip= 1, PETOT is= (ip-1)*gN k= 1 k1= gMTBL(k) !CDIR NODEP do j= 1, gN wt= dydamatP(is+j)*sig2imatP(ip) gA2(j,k)= gA2(j,k) + wt*dydamatP(is+k1) gB2(j) = gB2(j) + wt*dymatP (ip) enddo do k= 2, gM k1= gMTBL(k) !CDIR NODEP do j= 1, gN wt= dydamatP(is+j)*sig2imatP(ip) gA2(j,k)= gA2(j,k) + wt * dydamatP(is+k1) enddo chisq= chisq + dymatP(ip)*dymatP(ip)*sig2imatP(ip) enddo else !CDIR NODEP do ip= 1, PETOT is = (ip-1)*gN chisq= chisq + dymatP(ip)*dymatP(ip)*sig2imatP(ip) enddo endif Optimized gM=gN/PETOT x gM additional computation for wt

3rd ACES WG mtg. 2003/06/06 Brisbane. 6 Matrix Component: FUNCS called NDATA times at every time step gs_d= 0.d0 do is= 1, stepj-1 if (dp_d.ne.0) then do p= 1, ma do it= 0, itcnt-1 if ((t(it).le.tau(stepj)-tau(is)).and. & (t(it+1).gt.tau(stepj)-tau(is))) then gst= gss(p,it) goto 111 endif enddo gst= gss(p,itcnt) 111 continue gs_d= gs_d + aaj(p,is)*gst enddo endif enddo if (itflag.eq.0) then do is= 1, stepj-1 do it= 0, itcnt-1 if ((t(it).le.tau(stepj)-tau(is)).and. & t(it+1).gt.tau(stepj)-tau(is))) then itCUR(is)= it goto 111 endif enddo itCUR(is)= itcnt 111 continue enddo endif... gs_d= 0.0d0 if (dp_d.ne.0) then do is= 1, stepj-1 !CDIR NODEP do p= 1, ma gs_d= gs_d + aaj(p,is)*gss(p,ITcur(is)) enddo endif Original Optimized Additional array “ITCUR(is)” is defined and this is calculated just once at every time step. Operations for computations of “gs_d” is very simple and easy to be optimized. “Subroutine FUNCS” is called “NDATA” times. “stepj” is current step number, therefore computational amount for this part is increasing as the simulation proceeds. “gst” only depends on time and location of parameter point.

3rd ACES WG mtg. 2003/06/06 Brisbane. 7 Results on Earth Simulator Single PE, 15 steps for 150km length region PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF mrqcof ( 64.3) funcs ( 28.3) srcinput ( 3.7) pgauss ( 2.2) quasi_static ( 0.8) consti_parameter ( 0.7) mrqmin ( 0.0) total (100.0) Original Optimized PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF funcs ( 46.2) mrqcof ( 42.1) srcinput ( 5.7) pgauss ( 3.5) consti_parameter ( 1.2) quasi_static ( 1.2) mrqmin ( 0.0) total (100.0) Computational time reduced dramatically. “MRQCOF” speed up in spite of larger amount of computation. Bank conflict in FUNCS.

3rd ACES WG mtg. 2003/06/06 Brisbane. 8 Array Access Pattern: FUNCS in order to avoid bank conflict idX1= idint(zz_d) do p= 1, ma ipp= idnint(dabs(kk(p)-xx_d)) if (ipp.gt.xmax0) then uu(p)= 0.d0 !CDIR NODEP do it= 0, itcnt gss(p,it)= 0.d0 enddo else idX2= idint(ll(p)/3.d0) uu(p)= u(ipp, idX1, idX2) !CDIR NODEP do it= 0, itcnt gss(p,it)= gs(ipp, idX1, idX2, it) enddo endif enddo idX1= idint(zz_d) it= 0 !CDIR NODEP do p= 1, ma ipp= idnint(dabs(kk(p)-xx_d)) if (ipp.gt.xmax0) then uu (p) = 0.d0 gss(p,it)= 0.d0 else idX2= idint(ll(p)/3.d0) uu (p) = u (ipp, idX2, idX1) gss(p,it)= gs(ipp, it, idX2, idX1) endif enddo do it= 1, itcnt !CDIR NODEP do p= 1, ma ipp= idnint(dabs(kk(p)-xx_d)) if (ipp.gt.xmax0) then gss(p,it)= 0.d0 else idX2= idint(ll(p)/3.d0) gss(p,it)= gs(ipp, it, idX2, idX1) endif enddo Original Optimized Innermost loops “it” -> “p” for “gss(p,it)”. “gss(ipp,idX1,idX2,it)” -> “gss(ipp,it,idx1,idX2)”.

3rd ACES WG mtg. 2003/06/06 Brisbane. 9 Results on Earth Simulator Single PE, 15 steps for 150km length region PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF mrqcof ( 64.3) funcs ( 28.3) srcinput ( 3.7) pgauss ( 2.2) quasi_static ( 0.8) consti_parameter ( 0.7) mrqmin ( 0.0) total (100.0) Original Optimized PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF funcs ( 46.2) mrqcof ( 42.1) srcinput ( 5.7) pgauss ( 3.5) consti_parameter ( 1.2) quasi_static ( 1.2) mrqmin ( 0.0) total (100.0) PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF mrqcof ( 60.2) funcs ( 22.8) srcinput ( 8.4) pgauss ( 5.1) quasi_static ( 1.8) consti_parameter ( 1.7) mrqmin ( 0.0) total (100.0) Final

3rd ACES WG mtg. 2003/06/06 Brisbane. 10 Results on Earth Simulator Single PE, 50 steps for 150km length region OriginalOptimized Real Time (sec) : User Time (sec) : System Time (sec) : Vector Time (sec) : Instruction Count : Vector Instruction Count : Vector Element Count : FLOP Count : MOPS : MFLOPS : Average Vector Length : Vector Operation Ratio (%) : Memory size used (MB) : MIPS : Instruction Cache miss (sec): Operand Cache miss (sec): Bank Conflict Time (sec): Final Real Time (sec) : User Time (sec) : System Time (sec) : Vector Time (sec) : Instruction Count : Vector Instruction Count : Vector Element Count : FLOP Count : MOPS : MFLOPS : Average Vector Length : Vector Operation Ratio (%) : Memory size used (MB) : MIPS : Instruction Cache miss (sec): Operand Cache miss (sec): Bank Conflict Time (sec): Real Time (sec) : User Time (sec) : System Time (sec) : Vector Time (sec) : Instruction Count : Vector Instruction Count : Vector Element Count : FLOP Count : MOPS : MFLOPS : Average Vector Length : Vector Operation Ratio (%) : Memory size used (MB) : MIPS : Instruction Cache miss (sec): Operand Cache miss (sec): Bank Conflict Time (sec): Real Time (sec) : User Time (sec) : System Time (sec) : Vector Time (sec) : Instruction Count : Vector Instruction Count : Vector Element Count : FLOP Count : MOPS : MFLOPS : Average Vector Length : Vector Operation Ratio (%) : Memory size used (MB) : MIPS : Instruction Cache miss (sec): Operand Cache miss (sec): Bank Conflict Time (sec): SR PEs 2205.sec sec sec.

3rd ACES WG mtg. 2003/06/06 Brisbane. 11 Results on Earth Simulator Single PE, 5 steps for 300km length region Optimized PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF mrqcof ( 52.1) funcs ( 39.7) pgauss ( 5.7) srcinput ( 1.6) consti_parameter ( 0.5) quasi_static ( 0.2) mrqmin ( 0.0) total (100.0) PROG.UNIT FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. I-CACHE O-CACHE BANK TIME[sec]( % ) [msec] RATIO V.LEN MISS MISS CONF mrqcof ( 72.8) funcs ( 15.7) pgauss ( 8.0) srcinput ( 2.3) consti_parameter ( 0.7) quasi_static ( 0.3) mrqmin ( 0.0) total (100.0) Final

3rd ACES WG mtg. 2003/06/06 Brisbane. 12 Results on Earth Simulator 16PEs, 20 steps for 1200km length region Global Data of 16 processes : Min [U,R] Max [U,R] Average =========================== Real Time (sec) : [0,4] [0,8] User Time (sec) : [0,14] [0,1] System Time (sec) : [0,9] [0,14] Vector Time (sec) : [0,14] [0,8] Instruction Count : [0,8] [0,6] Vector Instruction Count : [0,14] [0,8] Vector Element Count : [0,7] [0,8] FLOP Count : [0,3] [0,0] MOPS : [0,7] [0,14] MFLOPS : [0,3] [0,14] Average Vector Length : [0,8] [0,14] Vector Operation Ratio (%) : [0,6] [0,8] Memory size used (MB) : [0,1] [0,0] MIPS : [0,8] [0,6] Instruction Cache miss (sec): [0,7] [0,0] Operand Cache miss (sec): [0,12] [0,0] Bank Conflict Time (sec): [0,15] [0,0] Optimized

3rd ACES WG mtg. 2003/06/06 Brisbane. 13 Results on Earth Simulator 16PEs, 20 steps for 1200km length region Global Data of 16 processes : Min [U,R] Max [U,R] Average =========================== Real Time (sec) : [0,4] [0,8] User Time (sec) : [0,2] [0,1] System Time (sec) : [0,5] [0,2] Vector Time (sec) : [0,6] [0,8] Instruction Count : [0,8] [0,6] Vector Instruction Count : [0,7] [0,8] Vector Element Count : [0,7] [0,12] FLOP Count : [0,3] [0,0] MOPS : [0,7] [0,2] MFLOPS : [0,4] [0,2] Average Vector Length : [0,8] [0,7] Vector Operation Ratio (%) : [0,6] [0,8] Memory size used (MB) : [0,3] [0,0] MIPS : [0,8] [0,6] Instruction Cache miss (sec): [0,4] [0,0] Operand Cache miss (sec): [0,3] [0,0] Bank Conflict Time (sec): [0,15] [0,0] Final

3rd ACES WG mtg. 2003/06/06 Brisbane. 14 Results on Earth Simulator Parallel Efficiency for 1st Linear Step Final ● : L=150km, ○ : L=300km, ■ : L=450km, □ : L=600km, ▲ : L=1200km Original Length of the Innermost Loop= m/PE