1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.

Slides:

Advertisements

Similar presentations

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

SALSA HPC Group School of Informatics and Computing Indiana University.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

OpenFOAM on a GPU-based Heterogeneous Cluster

WS-VLAM: Towards a Scalable Workflow System on the Grid V. Korkhov, D. Vasyunin, A. Wibisono, V. Guevara-Masis, A. Belloum Institute.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

On Anomalous Hot Spot Discovery in Graph Streams

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

1 Discussions on the next PAAP workshop, RIKEN. 2 Collaborations toward PAAP Several potential topics : 1.Applications (Wave Propagation, Climate, Reactor.

A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.

A modeling approach for estimating execution time of long-running Scientific Applications Seyed Masoud Sadjadi 1, Shu Shimizu 2, Javier Figueroa 1,3, Raju.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Parallelization of the Classic Gram-Schmidt QR-Factorization

1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

What Programming Paradigms and algorithms for Petascale Scientific Computing, a Hierarchical Programming Methodology Tentative Serge G. Petiton June 23rd,

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 VLDB - Data Management in Grids B. Del-Fabbro, D. Laiymani, J.M. Nicod and L. Philippe Laboratoire d’Informatique de l’Université de Franche-Comté Séoul,

Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:

FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.

1 Toward Petascale Programming and Computing, Challenges and Collaborations Serge G. Petiton PAAP workshop, RIKEN.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 IDGF International Desktop Grid Federation ASSESSING THE PERFORMANCE OF DESKTOP GRID APPLICATIONS A. Afanasiev, N. Khrapov, and M. Posypkin DEGISCO is.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Page : 1 SC2004 Pittsburgh, November 12, 2004 DEISA : integrating HPC infrastructures in Europe DEISA : integrating HPC infrastructures in Europe Victor.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Load Balancing:

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Hybrid Programming with OpenMP and MPI

Performance-Robust Parallel I/O

Cluster Computers.

Presentation transcript:

1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba 2 nd NEGST workshop at Tokyo May th, 2007

2 Outlines  Introduction  Distribution of the numerical method  Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments  Conclusion

3 Outlines ➔ Introduction  Distribution of the numerical method  Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments  Conclusion

4 Introduction  Huge number of nodes connected to Internet Clusters and NOWs of institutions,PCs of individual users Volunteer  Constant availability of nodes, on-demand access  HPC and large Grid Computing are complementary We do not target the highest performances We target a different community of users  Why the real symmetric eigenproblem? Requires a lot of resources on the nodes Communications, synchronization points Useful problem Few similar studies for very large Grid Computing

5 Outlines  Introduction ➔ Distribution of the numerical method  Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments  Conclusion

6 Distribution of the numerical method (1/2)  Real symmetric eigenproblem Au=lu, A real symmetric  Main steps: Lanczos tridiagonalization  T=Q t AQ, T real symmetric tridiagonal  Data accessed by means of MVP Bisection and Inverse Iteration  Tv=lv, same eigenvalues as A (Ritz eigenvalues)  Communication-free parallelism: task-farming Ritz eigenvectors computations (u) Accuracy tests |Au-lu| 2 <eps

7 Distribution of the numerical method (2/2)  Reducing the memory usage Out-of-core Restarted scheme  Reorthogonalization  Bisection, Inverse Iteration  Reduces the disk usage too  Volume of communications Data-persistence (A and Q)  Number of communications  Task-farming  Other issue to be improved Distribution of A

8 Outlines  Introduction  Distribution of the numerical method ➔ Experiments ➔ Experiments on world-wide grids: platforms, numerical settings ➔ Experiments on Grid'5000: motivations, platforms, numerical settings Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments  Conclusion

9 World-wide grid experiments Experimental platforms, numerical settings (1/2)  Computing and network resources University of Tsukuba  Homogeneous dedicated clusters  Dual Xeon ~3GHz, 1 to 4 GB University of Lille 1  Heterogeneous NOWs  Celeron 1.4 GHz to P4 3.2 Ghz  128MB to 1GB  Shared with students Internet

10 World-wide grid experiments Experimental platforms, numerical settings (2/2)  4 Platforms OmniRPC 2 local platforms: 29 / 58 nodes, Lille 2 world-wide platforms  58 (29 Lille+ 29 Tsukuba dual-proc.)  116 (58 Lille, 58 Tsukuba dual-proc.)  Matrix N= million elements, avg 48 nnz/row  Parameters M=10, 15, 20, 25 K=1, 2, 3, 4

11 Grid'5000 experiments Presentation, motivations  Up to 9 sites distributed in France Dedicated PC with reservation policy Fast and dedicated Network  RENATER (1GBit/s to 10GBit/s) PC are homogeneous (few exceptions) Homogeneous environment  (deployment strategy)  For those experiments Orsay: up to 300 single-CPU nodes Lille: up to 60 single-CPU nodes Nice: up to 60 dual-CPU nodes Rennes: up to 70 dual-CPU nodes

12 Grid'5000 experiments Platforms and numerical settings (1/2)  Step 1: Goal: improving previous analysis. Platforms  29 Orsay, single-proc  58 Orsay, single-proc  58 Lille, Sophia dual-proc  116 Orsay, Sophia dual-proc (1 core/proc)  Orsay, Lille, Sophia dual-proc (1 core/proc)  1 process/dual-processor Numerical settings  Matrix: N=47792, 2.5 million elements, avg 48 nnz/row  Parameters  m=10, 15, 20, 25  k=1, 2, 3, 4

13 Grid'5000 experiments Platforms and numerical settings (2/2)  Step 2: Goal: increasing the size of the problem. In progress N=430128, 193 million elements 7 OmniRPC relay nodes, 206 CPU  3 sites 11 OmniRPC relay nodes, 412 CPU  4 sites k=1, m=15

14 Outlines  Introduction  Distribution of the numerical method ➔ Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings ➔ Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments  Conclusion

15 World-wide grid experiments Results Sing. Proc. Orsay Dual. Proc. Tsukuba (all proc. Used) 116 Sing. Proc. Lille Dual. Proc. Tsukuba (all proc. Used) 58 Sing. Proc Lille58 Sing. Proc. Lille29

16 Grid'5000 experiments – step 1 Results Sing. Proc. Orsay Sing. Proc. Lille Dual. Proc. Sophia (1 proc. Used) 116 Sing. Proc. Orsay Dual. Proc. Sophia (all proc. Used) 116 Sing. Proc. Lille Dual. Proc. Sophia (all proc. Used) 58 Sing. Proc Orsay58 Sing. Proc. Orsay29

17 Grid'5000 experiments – step 2 Results 119Ritz eigenvector 9<1Bisection + Inverse Iteration Wall-clock time Send new column of Q: 20 MVP: Reorthog: 159 Send new column of Q: 22 MVP: Reorthog: 129 Lanczos tridiagonalization Details for N=430128, m=15, k=1 Wall-clock times in seconds |Au-lu| < eps Number cpu  Evaluation of the wall-clock-time for 1 MVP with the matrix A In the tridiagonalization:  15(m)*5(nb restarts)=75 MVPs  134 sec (206 cpu) and 164 sec (412 cpu) per MVP In the tests of convergence:  5(nb restarts) MVPs  138 sec (206 cpu) and 162 sec (412 cpu) per MVP

18 Outlines  Introduction  Distribution of the numerical method  Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings Results ➔ YML ➔ Progress of YML ➔ YvetteML workflow of the real symmetric eigenproblem ➔ First experiments  Conclusion

19 Progress of YML  YML  Stability, error reporting  Collections of data  out-of-core  Variable lists of parameters  Parameters in/out of the Workflow  Mainly developed at the PRiSM laboratory, University of Versailles   Olivier Delannoy, Nahid Emad

20 Resolution of the eigenproblem with YML  No data persistence Future work: binary cache Re-usability / aggregation of components

21 Experiments with YML & OmniRPC back-end YML + OmniRPC back-end (wall-clock times in min) OmniRPC (wall-clock times in min) Overhead (in %) Sources of overhead  No computation in the YvetteML workflow  Sheduler, (un)packing the parameters  Transfers of binaries

22 Outlines  Introduction  Distribution of the numerical method  Experiments Experiments on world-wide grids: platforms, numerical settings Experiments on Grid'5000: motivations, platforms, numerical settings Results  YML Progress of YML YvetteML workflow of the real symmetric eigenproblem First experiments ➔ Conclusion

23 Conclusion (1/3)  Reminder of the scope of this work Large grid computing and HPC: complementary tools  Used by people that have no access to HPC  Significant computations (size of the problem)  We do not (cannot) target the high performances  The resources are not dedicated  Slow networks, heterogeneous machines, external perturbations, etc  Linear algebra problems are useful for many general applications  Differences with HPC and cluster computing We must not have a “speed-up” approach of the computations Recommendations to save resources on nodes

24 Conclusion (2/3)  We propose Scalable real symmetric eigensolver for large grids  Next expected bounding limit: disk space for much larger or very dense matrix Before the implementation of the method, key choices must be done  Numerical methods and programming paradigms  Bisection (Task-farming)  Restarted scheme (memory and disk)  Out-of-core (memory)  Data persistence (communication) New version of YML Workflow of the eigensolver and re-usable components  In progress

25 Conclusion (3/3)  Topics of study for the eigensolver Improving the distribution of A Testing more matrices  Different kind of matrices (e.g. sparse, dense)  Larger matrices Scheduling level  adapting the workload balancing to the heterogeneity of the platforms  Current and future work on YML Finishing the multi back-end support Binary cache