Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation.

Similar presentations


Presentation on theme: "Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation."— Presentation transcript:

1 Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation thesis at Ecole Normale de Cachan - Brittany Extension (2009) Main research interests – Data management on large-scale, distributed architectures – Focus: transparent access model Clusters, grids, clouds Contribution: Grid Data-sharing Service: GDS = DSM + P2P – 3 Ph.D. theses ( ) – Issues Peer-to-peer techniques applied to grid computing (e.g. JXTA on the grid) Data consistency and fault tolerance – For grid data-sharing (past): JuxMem - – For cloud storage (future): BlobSeer -

2 Henri Bal Vrije Universiteit Amsterdam Research interests: Parallel & distributed programming environments Best known for: Orca, Manta, MagPIe, Albatross, Ibis DAS-1 - DAS-4 testbeds Solving Awari HPDC, CCGrid PC chair FT interests: FT grid PE’s (Satin d&c, object replication)

3 George Bosilca Innovative Computing Laboratory University of Tennessee Distributed Computing: Programming Models, Message Passing, Runtime Environments, Scalability, Fault Tolerance OVM, MPICH-V, FT-MPI, Open MPI, STCI

4 Franck Cappello - INRIA & UIUC Director of the INRIA-UIUC Joint laboratory on PetaScale Computing Initiator of Grid’5000 (and director during its research phase) Leader of INRIA Grand-Large group (MPICH-V, XtremWeb) Executive committee member of IESP (International Exascale Software Project) Main domains of interest: -Fault tolerance for large scale applications on large scale systems -Programming models and environments (including FT concerns) -Convergence between HPC and Cloud (transactional PHC computing model?) Current main work: -Solid: a directive based programming and runtime environment for FT in HPC systems - Example of open question: : use of SSD devices for FT in HPC systems

5 Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, Dr. Christian Engelmann Oak Ridge National Laboratory (ORNL) Background – 9 years of fault tolerance research at ORNL Motivation – ORNL has large-scale HPC systems (0.5 and 1 PFlops) – Resilience is an urgent priority for systems with 100,000 and more cores Research interests – Proactive fault tolerance Fault prediction Preemptive migration – Soft-error resilience Computational redundancy

6 Da gst uhl, ma y Dick H.J. Epema Delft University of Technology, the Netherlands web: Parallel and Distributed Systems Group Main research interests: grids and clouds peer-to-peer systems performance Web resources: gwa.ewi.tudelft.nl Main research achievements: processor co-allocation the KOALA grid scheduler the Grids Workload Archive Condor Flocking the Tribler P2P system P2P live video and VoD cooperative downloading Fault tolerance research interests: how to schedule in the face of failures how to model failures

7 Name: Wolfgang Frings Institution:Jülich Supercomputing Centre Forschungzentrum Jülich, Germany Research Interests:Parallel I/O, SIONlib, Performance analysis, Benchmarking (parallel applications), HPC-Tool development: JuBE: Juelich Benchmark Environment LLview: Batch system monitoring

8 Richard L. Graham Oak Ridge National Laboratory Research Interests: Run time environments Programming environments Most know for: One of three founders of Open MPI Chairman of the MPI Forum

9 AMINA GUERMOUCHE PHD S TUDENT GRAND LARGE, INRIA PARI S SUD UNIVERS I TY FR Dagstuhl Fault Tolerance Current main work: Solid: -a directive based programming and runtime environment for FT in HPC systems - transforms a code in a set of blocks, each one with user specified FT Open question: How to mix different FT approaches in a same code?

10 Paul H. Hargrove Lawrence Berkeley National Lab Berkeley, California U.S.A. Research Areas – Berkeley Lab Checkpoint/Restart – – PGAS Language runtime support (UPC & GASNet) –

11 Hermann Härtig Microkernels (L4) and multi-server OS System Security: Very small application-specifc Trusted Computing Bases (for example, 150LoC for Bank Transaction) Virtual Machines (L4Linux, …) Real-Time (hard, probabilistic, L4Linux as NRT guest)

12 Name: Thomas HERAULT Institution(s): – Ass. Prof. At Université Paris- Sud (France) – Member of the Grand-Large team of INRIA – Visiting Scholar at the University of Tennessee (ICL) FT-Areas: – Rollback/Recovery in Message Passing Systems – Self-Stabilization – Application-level Fault Tolerance FT research Focus: – Automatic & Transparent Fault Tolerance in MPI using Rollback/Recovery – Fault Tolerant Runtime Environment (using Self- Stabilization Techniques) Other research areas: – MPI & Grids – Large Data Movements (Grids) – Model Checking – Theoretical Aspects of Self- Stabilization

13 Laxmikant (Sanjay) Kale University of Illinois at Urbana-Champaign – Parallel Programming Laboratory (20+ years) – – Interests: parallel programming abstractions, Adaptive Runtime, CSE apps, Fault tolerance Known for: – Object-based overdecomposition and adaptive RTS – Charm++, Adaptive MPI, recent: Charisma, MSA – Apps: NAMD, OpenAtom (nano), ChaNGa (astro) Fault tolerance: – multiple schemes in Charm++, – One where MTBF can be lower than checkpoint period!

14 Dagstuhl FT Workshop Rainer Keller, ORNL, Past experience at the High-Performance Computing Center Stuttgart (HLRS):  Applications, Models and Tools  PACX-MPI  Open MPI Doing PostDoc at ORNL on:  Open MPI  STCI Main Interest: Fault Tolerance in I/O

15 Barry Linnert pdf

16 Jörg Schneider pdf

17 17 Volker Lindenstruth Kirchhoff Institute for Physics Chair of Computer Science University Heidelberg, Germany Phone: Fax: WWW:www.compeng.de ALICE HLT CHAIR FIAS Fellow GSI FAIR Computing Coordinator ALICE TRD Trigger core MPP system ALICE HLT Trigger HPC Cluster, FPGA, GPU Frankfurt Landesrechner

18 Xiaosong Ma Appointment – Assistant Professor, NC State University, USA – Joint Faculty Member, Oak Ridge National Lab, USA Contact – Research interest: HEC storage, parallel I/O, high- performance sequence search, cloud computing Known for? Active buffering for collective I/O, parallel I/O in mpiBLAST, FreeLoader

19 Frank Mueller, North Carolina State Univ. FT areas of interest: HPC, OS, I/O – Other areas: HPC tools, compilers, real-time/embedded Focus: FT around MPI (but also in map-reduce) – Scalable network overlays – Reactive and proactive FT – Process and OS level Open problems: – Scalability – I/O bandwidth – Sync/async chkpt abstraction through OS/process layers – Benefit of proactive FT / health monitoring – Standardization – Cross-community fertilization

20 20 Chokchai Box Leangsuksun SWEPCO Endowed Professor, Computer Science Director, High Performance Computing Initiative Louisiana Tech University *SWEPCO endowed professorship is made possible by LA Board of Regents Research Interest –Resilience in HPC, Cluster computing –Failure Analysis and Modeling –Reliability-aware Runtime & scheduling –Near realtime resilience modeling –Checkpoint/migration scheduling –Virtualization for Resilience –Heterogeneity = Host + accelerator

21 Supporting Fault-Tolerance in Modern High-End Computing Systems with InfiniBand Dhabaleswar K. (DK) Panda The Ohio State University Research Interests: HPC, InfiniBand, Fault-Tolerance, MPI (MVAPICH/MVAPICH2 project)

22 Alexander Reinefeld Affiliations – Zuse Institute Berlin (ZIB) – Humboldt-Universität zu Berlin Interests – distributed computing, P2P algorithms, data management – supercomputing (we operate a 300 Tflops system) – HW accelerators: Nvidia, FPGA, Clearspeed, … – parallel tree search algorithms

23 Florian Schintke Affiliation – Zuse Institute Berlin (ZIB) Interests – distributed data management – P2P algorithms, structured overlays – scalable systems

24 Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, Dr. Stephen L. Scott Oak Ridge National Laboratory (ORNL) Background – At ORNL since 1996 Senior Research Scientist Systems Research Team (team lead) – >10 years working in area of Fault Tolerance / Resilience Distributed control  HA/FT clusters  Resilience – HA-OSCAR – HAPCW High Availability and Performance Computing Workshop (since 2003) – Resilience Workshop (CCGrid08, HPDC09 Munich – June 11-13) – Resilience Summit (LACSS08, 09?) Motivation – Its not “if they fail” but “when…” Research interests – I like to solve problems… – Resilience Reactive fault tolerance Proactive fault tolerance Algorithms – Systems software Virtualization Tools

25 Dagstuhl FT Intro Slide Eugen Staab – PhD student – University of Luxembourg – Research interests: – Sabotage Tolerance – Result Checking – Trust Volunteer Computing Desktop Grids P2P Computing Fault tolerance for cases where: No control over machines that execute computations

26 May 4-8, 2009 Dagstuhl FT © NEC Laboratories Europe NEC Laboratories Europe St. Augustin (which is close to Bonn) Germany Known (???) for MPI implementations: MPI/SX for NEC SX-vector machines, Earth Simulator, … Jesper Larsson Träff MPI Forum: one-sided communication, topology interface, collectives (blocking and non-blocking), … Communication algorithms for MPI and other PP/HPC interfaces What algorithmic (and other) support is needed for what kind of FT?

27 Paolo Trunfio University of Calabria - Italy Assistant professor of Computer Engineering at University of Calabria Research interests: – Distributed data mining Weka4WS – Grid computing Knowledge Grid – Peer-to-Peer networks Resource/Service discovery – Distributed programming MapReduce in P2P/Grid systems FT focus: – Making MapReduce reliable in dynamic distributed environments (P2P, Grids) Talk today afternoon

28 Geoffroy Vallée Oak Ridge National Laboratory Research interest: operating systems, system resilience, tools for HPC System resilience – System policies for fault-tolerance – Process checkpoint/restart/migration (Kerrighed)‏ – Virtual machine checkpoint/restart/migration


Download ppt "Gabriel Antoniu Researcher at INRIA Rennes - Bretagne Atlantique, France since 2002 – Ph.D. at Ecole Normale Supérieure de Lyon, France (2001) – Habilitation."

Similar presentations


Ads by Google