Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation.

Similar presentations

Presentation on theme: "Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation."— Presentation transcript:

1 Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation thesis at Ecole Normale de Cachan - Brittany Extension (2009) E-mail: Main research interests – Data management on large-scale, distributed architectures – Focus: transparent access model Clusters, grids, clouds Contribution: Grid Data-sharing Service: GDS = DSM + P2P – 3 Ph.D. theses (2003-2009) – Issues Peer-to-peer techniques applied to grid computing (e.g. JXTA on the grid) Data consistency and fault tolerance – For grid data-sharing (past): JuxMem - – For cloud storage (future): BlobSeer -

2 Henri Bal Vrije Universiteit Amsterdam Research interests: Parallel & distributed programming environments Best known for: Orca, Manta, MagPIe, Albatross, Ibis DAS-1 - DAS-4 testbeds Solving Awari HPDC, CCGrid PC chair FT interests: FT grid PE’s (Satin d&c, object replication)

3 George Bosilca Innovative Computing Laboratory University of Tennessee Distributed Computing: Programming Models, Message Passing, Runtime Environments, Scalability, Fault Tolerance OVM, MPICH-V, FT-MPI, Open MPI, STCI

4 Franck Cappello - INRIA & UIUC Director of the INRIA-UIUC Joint laboratory on PetaScale Computing Initiator of Grid’5000 (and director during its research phase) Leader of INRIA Grand-Large group (MPICH-V, XtremWeb) Executive committee member of IESP (International Exascale Software Project) Main domains of interest: -Fault tolerance for large scale applications on large scale systems -Programming models and environments (including FT concerns) -Convergence between HPC and Cloud (transactional PHC computing model?) Current main work: -Solid: a directive based programming and runtime environment for FT in HPC systems - Example of open question: : use of SSD devices for FT in HPC systems

5 Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009. Dr. Christian Engelmann Oak Ridge National Laboratory (ORNL) Background – 9 years of fault tolerance research at ORNL Motivation – ORNL has large-scale HPC systems (0.5 and 1 PFlops) – Resilience is an urgent priority for systems with 100,000 and more cores Research interests – Proactive fault tolerance Fault prediction Preemptive migration – Soft-error resilience Computational redundancy 2000 2009

6 Da gst uhl, ma y 200 9 6 Dick H.J. Epema Delft University of Technology, the Netherlands web: Parallel and Distributed Systems Group Main research interests: grids and clouds peer-to-peer systems performance Web resources: Main research achievements: processor co-allocation the KOALA grid scheduler the Grids Workload Archive Condor Flocking the Tribler P2P system P2P live video and VoD cooperative downloading Fault tolerance research interests: how to schedule in the face of failures how to model failures

7 Name: Wolfgang Frings Institution:Jülich Supercomputing Centre Forschungzentrum Jülich, Germany Research Interests:Parallel I/O, SIONlib, Performance analysis, Benchmarking (parallel applications), HPC-Tool development: JuBE: Juelich Benchmark Environment LLview: Batch system monitoring

8 Richard L. Graham Oak Ridge National Laboratory Research Interests: Run time environments Programming environments Most know for: One of three founders of Open MPI Chairman of the MPI Forum

9 AMINA GUERMOUCHE PHD S TUDENT GRAND LARGE, INRIA PARI S SUD UNIVERS I TY GUERMOU@LRI. FR Dagstuhl Fault Tolerance Current main work: Solid: -a directive based programming and runtime environment for FT in HPC systems - transforms a code in a set of blocks, each one with user specified FT Open question: How to mix different FT approaches in a same code?

10 Paul H. Hargrove Lawrence Berkeley National Lab Berkeley, California U.S.A. Research Areas – Berkeley Lab Checkpoint/Restart – – PGAS Language runtime support (UPC & GASNet) –

11 Hermann Härtig ( Microkernels (L4) and multi-server OS System Security: Very small application-specifc Trusted Computing Bases (for example, 150LoC for Bank Transaction) Virtual Machines (L4Linux, …) Real-Time (hard, probabilistic, L4Linux as NRT guest)

12 Name: Thomas HERAULT E-mail: Institution(s): – Ass. Prof. At Université Paris- Sud (France) – Member of the Grand-Large team of INRIA – Visiting Scholar at the University of Tennessee (ICL) FT-Areas: – Rollback/Recovery in Message Passing Systems – Self-Stabilization – Application-level Fault Tolerance FT research Focus: – Automatic & Transparent Fault Tolerance in MPI using Rollback/Recovery – Fault Tolerant Runtime Environment (using Self- Stabilization Techniques) Other research areas: – MPI & Grids – Large Data Movements (Grids) – Model Checking – Theoretical Aspects of Self- Stabilization

13 Laxmikant (Sanjay) Kale University of Illinois at Urbana-Champaign – Parallel Programming Laboratory (20+ years) – – Interests: parallel programming abstractions, Adaptive Runtime, CSE apps, Fault tolerance Known for: – Object-based overdecomposition and adaptive RTS – Charm++, Adaptive MPI, recent: Charisma, MSA – Apps: NAMD, OpenAtom (nano), ChaNGa (astro) Fault tolerance: – multiple schemes in Charm++, – One where MTBF can be lower than checkpoint period!

14 Dagstuhl FT Workshop Rainer Keller, ORNL, Past experience at the High-Performance Computing Center Stuttgart (HLRS):  Applications, Models and Tools  PACX-MPI  Open MPI Doing PostDoc at ORNL on:  Open MPI  STCI Main Interest: Fault Tolerance in I/O

15 Barry Linnert pdf

16 Jörg Schneider pdf

17 17 Volker Lindenstruth Kirchhoff Institute for Physics Chair of Computer Science University Heidelberg, Germany Phone: +49 6221 54 9800 Fax:+49 6221 54 9809 ALICE HLT CHAIR FIAS Fellow GSI FAIR Computing Coordinator ALICE TRD Trigger 280000 core MPP system ALICE HLT Trigger HPC Cluster, FPGA, GPU Frankfurt Landesrechner

18 Xiaosong Ma Appointment – Assistant Professor, NC State University, USA – Joint Faculty Member, Oak Ridge National Lab, USA Contact – Research interest: HEC storage, parallel I/O, high- performance sequence search, cloud computing Known for? Active buffering for collective I/O, parallel I/O in mpiBLAST, FreeLoader

19 Frank Mueller, North Carolina State Univ. FT areas of interest: HPC, OS, I/O – Other areas: HPC tools, compilers, real-time/embedded Focus: FT around MPI (but also in map-reduce) – Scalable network overlays – Reactive and proactive FT – Process and OS level Open problems: – Scalability – I/O bandwidth – Sync/async chkpt abstraction through OS/process layers – Benefit of proactive FT / health monitoring – Standardization – Cross-community fertilization

20 20 Chokchai Box Leangsuksun SWEPCO Endowed Professor, Computer Science Director, High Performance Computing Initiative Louisiana Tech University *SWEPCO endowed professorship is made possible by LA Board of Regents Research Interest –Resilience in HPC, Cluster computing –Failure Analysis and Modeling –Reliability-aware Runtime & scheduling –Near realtime resilience modeling –Checkpoint/migration scheduling –Virtualization for Resilience –Heterogeneity = Host + accelerator

21 Supporting Fault-Tolerance in Modern High-End Computing Systems with InfiniBand Dhabaleswar K. (DK) Panda The Ohio State University E-mail: Research Interests: HPC, InfiniBand, Fault-Tolerance, MPI (MVAPICH/MVAPICH2 project)

22 Alexander Reinefeld Affiliations – Zuse Institute Berlin (ZIB) – Humboldt-Universität zu Berlin Interests – distributed computing, P2P algorithms, data management – supercomputing (we operate a 300 Tflops system) – HW accelerators: Nvidia, FPGA, Clearspeed, … – parallel tree search algorithms

23 Florian Schintke Affiliation – Zuse Institute Berlin (ZIB) Interests – distributed data management – P2P algorithms, structured overlays – scalable systems

24 Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009. Dr. Stephen L. Scott Oak Ridge National Laboratory (ORNL) Background – At ORNL since 1996 Senior Research Scientist Systems Research Team (team lead) – >10 years working in area of Fault Tolerance / Resilience Distributed control  HA/FT clusters  Resilience – HA-OSCAR – HAPCW High Availability and Performance Computing Workshop (since 2003) – Resilience Workshop (CCGrid08, HPDC09 Munich – June 11-13) – Resilience Summit (LACSS08, 09?) Motivation – Its not “if they fail” but “when…” Research interests – I like to solve problems… – Resilience Reactive fault tolerance Proactive fault tolerance Algorithms – Systems software Virtualization Tools

25 Dagstuhl FT Intro Slide Eugen Staab – PhD student – University of Luxembourg – Research interests: – Sabotage Tolerance – Result Checking – Trust Volunteer Computing Desktop Grids P2P Computing Fault tolerance for cases where: No control over machines that execute computations

26 May 4-8, 2009 Dagstuhl FT © NEC Laboratories Europe NEC Laboratories Europe St. Augustin (which is close to Bonn) Germany Known (???) for MPI implementations: MPI/SX for NEC SX-vector machines, Earth Simulator, … Jesper Larsson Träff MPI Forum: one-sided communication, topology interface, collectives (blocking and non-blocking), … Communication algorithms for MPI and other PP/HPC interfaces What algorithmic (and other) support is needed for what kind of FT?

27 Paolo Trunfio University of Calabria - Italy Assistant professor of Computer Engineering at University of Calabria Research interests: – Distributed data mining Weka4WS – Grid computing Knowledge Grid – Peer-to-Peer networks Resource/Service discovery – Distributed programming MapReduce in P2P/Grid systems FT focus: – Making MapReduce reliable in dynamic distributed environments (P2P, Grids) Talk today afternoon

28 Geoffroy Vallée Oak Ridge National Laboratory Email: Research interest: operating systems, system resilience, tools for HPC System resilience – System policies for fault-tolerance – Process checkpoint/restart/migration (Kerrighed)‏ – Virtual machine checkpoint/restart/migration

Download ppt "Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation."

Similar presentations

Ads by Google