Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©

Similar presentations


Presentation on theme: "Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©"— Presentation transcript:

1 Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©

2 2 CS Faculty and Staff Investigators M. Bhandarkar M. Brandyberry M. Campbell E. de Sturler R. Fiedler D. Guoy M. Heath J. Jiao L. Kale O. Lawlor J. Liesen J. Norris D. Padua E. Ramos D. Reed P. Saylor M. Winslett plus numerous students

3 3 ©2002 Board of Trustees of the University of Illinois Computer Science Research Overview Parallel programming environment Software integration framework Parallel component frameworks Component integration Clusters Parallel I/O and data migration Performance tools and techniques Computational steering Visualization Computational mathematics and geometry Interface propagation and interpolation Linear solvers and preconditioners Eigensolvers Mesh generation and adaptation

4 4 ©2002 Board of Trustees of the University of Illinois Software Integration Framework Object-oriented philosophy enforcing encapsulation and enabling polymorphism Minimal changes required to existing physical modules Minimal dependencies in component development Maximal flexibility for integration USE NEW SLIDE FROM MIKE Mechanism for data exchange and function invocation between Roc * components Solid HDF IO Fluid Roccom Orchestration Combustion Interface GEN2 architecture

5 5 ©2002 Board of Trustees of the University of Illinois GEN2 Architecture MPI/Charm Rocpanda Rocblas Rocface Rocman Roccom Rocflo-MP Rocflu-MP Rocsolid Rocfrac Rocburn2D ZNZN APNAPN PYPY Truegrid Tetmesh Metis Gridgen Makeflo

6 6 ©2002 Board of Trustees of the University of Illinois Parallel Programming Environment Charm++ and AMPI Embody the idea of processor virtualization Load Balancers Communication persistence: collective operations  Non collective operations, but happen together AMPI Ease of use: automatic packing Async communication extensions Broadly useful across many apps Recent success with Molecular dynamics Frameworks Unstructured-mesh FW Multiblock Gen 2.5 success Scaling to next gen of machines: BG/L

7 7 ©2002 Board of Trustees of the University of Illinois Virtualization: Object-based Decomposition Idea: Divide the computation into a large number of pieces  Independent of number of processors  Typically larger than number of processors Let the system map objects to processors Old idea? G. Fox Book (’86?), DRMS (IBM),.. This is “virtualization++” –Language and runtime support for virtualization –Exploitation of virtualization to the hilt

8 8 ©2002 Board of Trustees of the University of Illinois Adaptive MPI A migration path for legacy MPI codes Allows them dynamic load balancing capabilities of Charm++ AMPI = MPI + dynamic load balancing Uses Charm++ object arrays and migratable threads Minimal modifications to convert existing MPI programs Automated via AMPizer Bindings for C, C++, and Fortran90

9 9 ©2002 Board of Trustees of the University of Illinois AMPI: 7 MPI processes

10 10 ©2002 Board of Trustees of the University of Illinois AMPI: Real Processors 7 MPI “processes” Implemented as virtual processors (user-level migratable threads)

11 11 ©2002 Board of Trustees of the University of Illinois Virtualization Benefits Virtualization is using many “virtual processors” on each real processor A VP may be an object, an MPI thread, etc. Charm++ and AMPI Examples of programming systems based on virtualization Virtualization leads to: Message-driven (aka data-driven) execution Allows the runtime system to remap virtual processors to new processors  Several performance benefits

12 12 ©2002 Board of Trustees of the University of Illinois Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds Newtonian mechanics Thousands of atoms (1,000 - 500,000) 1 femtosecond time-step, millions needed! At each time-step Calculate forces on each atom  Bonds:  Non-bonded: electrostatic and van der Waal’s Short-distance: every timestep Long-distance: every 4 timesteps using PME (3D FFT) Multiple Time Stepping Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers

13 13 ©2002 Board of Trustees of the University of Illinois 700 VPs 192 + 144 VP s 30,000 VPs Virtualized Approach to implementation: using Charm++ These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

14 14 ©2002 Board of Trustees of the University of Illinois Synchronization overhead Symptom: Too much time spent in barriers and scalar reductions Be careful: this may be load imbalance  Most processors arrive at the barrier early and wait Problem with barriers: Not the direct cost of the operation itself as much But it prevents the program from adjusting to small variations  E.g. K phases, separated by barriers (or scalar reductions)  Load is effectively balanced. But, In each phase, there may be slight non-determistic load imbalance Let Li,j be the load on I’th processor in j’th phase. With barrier:Without:

15 15 ©2002 Board of Trustees of the University of Illinois Molecular Dynamics: Benefits of avoiding barrier In NAMD: The energy reductions were made asynchronous No other global barriers are used in cut-off simulations This came handy when: Running on Pittsburgh Lemieux (3000 processors) The machine (+ our way of using the communication layer) produced unpredictable, random delays in communication  A send call would remain stuck for 20 ms, for example How did the system handle it? See timeline plots

16 16 ©2002 Board of Trustees of the University of Illinois

17 17 ©2002 Board of Trustees of the University of Illinois PME parallelization Impor4t picture from sc02 paper (sindhura’s)

18 18 ©2002 Board of Trustees of the University of Illinois Performance: NAMD on Lemieux ATPase: 320,000+ atoms including water

19 19 ©2002 Board of Trustees of the University of Illinois

20 20 ©2002 Board of Trustees of the University of Illinois All to all on Lemieux for a 76 Byte Message

21 21 ©2002 Board of Trustees of the University of Illinois Impact on Application Performance Namd Performance on Lemieux, with the transpose step implemented using different all-to-all algorithms

22 22 ©2002 Board of Trustees of the University of Illinois Rocket simulation via virtual processors Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo

23 23 ©2002 Board of Trustees of the University of Illinois AMPI and Roc*: Communication Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo By separating independent modules into separate sets of virtual processors, flexibility was gained to deal with alternate formulations: Fluids and solids executing concurrently OR one after other. Change in pattern of load distribution within or across modules

24 24 ©2002 Board of Trustees of the University of Illinois Parallel I/O and Data Migration Rocpanda 3.0 integrated into GENx Supports HDF for Rocketeer Periodic snapshots  Data sent as messages from compute processors to I/O processors  Computation continues on compute processors while I/O nodes write, store, and migrate data files  Hides over 90% of periodic output cost New facility for parallel restart  Restart from previously written files, even with different number of compute or I/O processors Automatic tuning of parallel I/O performance Data migration concurrent with application Automatic choice of data migration strategy New facility under development for optimizing periodic input for visualization Compute I/O Disk

25 25 ©2002 Board of Trustees of the University of Illinois Performance Tuning with SvPablo Source code browsing and performance instrumentation Instrumented GEN2 produces SDDF performance data SDDF performance data linked to source code Helps identify problematic source code

26 26 ©2002 Board of Trustees of the University of Illinois 16 processors per node 15 processors per node Effect of Job Topology on Performance RocfloMP on Frost (16 processor per node IBM SP3) Communication bottleneck identified with SvPablo System OS uses cycles on one processor per node Bottleneck eliminated by leaving OS processor idle

27 27 ©2002 Board of Trustees of the University of Illinois Data Transfer Between Components Common refinement of nonmatching meshes Differing topological structures, geometric realizations, and partitionings Complex geometries Efficient data structures Accurate and conservative data transfer Node- or element-centered data Conservation enforced Errors minimized Efficient parallel implementation

28 28 ©2002 Board of Trustees of the University of Illinois Experimental Results CSAR methodConventional method Burning cavity with uniform pressure and regression Cumulative error in displacements after 500 time steps Shortened lab-scale rocket with triangular solid and quadrilateral fluid interface meshes

29 29 ©2002 Board of Trustees of the University of Illinois Mesh Adaptation and Refinement Dynamic global refinement or coarsening of mesh Adapt mesh to changing geometry in 3-D Example: as propellant burns away (60%), solid mesh is compressed, needs repair

30 30 ©2002 Board of Trustees of the University of Illinois Linear Solvers and Preconditioners Optimally truncated GMRES method adapted for solving sequences of linear systems in time dependent problems Preconditioners developed for indefinite linear systems arising in constrained problems Further work planned on domain decomposition and approximate inverse pre- conditioners

31 ©2001 Board of Trustees of the University of Illinois http://www.csar.uiuc.edu ©

32 32 ©2002 Board of Trustees of the University of Illinois Michael T. Heath, Director Center for Simulation of Advanced Rockets University of Illinois at Urbana-Champaign 2262 Digital Computer Laboratory 1304 West Springfield Avenue Urbana, IL 61801 USA m-heath@uiuc.edu http://www.csar.uiuc.edu telephone: 217-333-6268 fax: 217-333-1910

33 33 ©2002 Board of Trustees of the University of Illinois GEN2 Performance Processors Wall Clock Time per Iteration for GEN2 on Scaled Problem Time in Seconds

34 34 ©2002 Board of Trustees of the University of Illinois GEN2 Performance on Blue Horizon


Download ppt "Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©"

Similar presentations


Ads by Google