Routine-Basis Experiments in PRAGMA Grid Testbed Yusuke Tanimura Grid Technology Research Center National Institute of AIST.

Slides:



Advertisements
Similar presentations
PRAGMA – TeraGrid – AIST Interoperation Testing Philip Papadopoulos.
Advertisements

Resource WG Breakout. Agenda How we will support/develop data grid testbed and possible applications (1 st day) –Introduction of Gfarm (Osamu) –Introduction.
National Institute of Advanced Industrial Science and Technology Experiences through Grid Challenge Event Yoshio Tanaka.
Resources WG Update PRAGMA 9 Hyderabad. Status (in 1 slide) Applications QMMD (AIST) Savannah (MU) iGAP (SDSC, AIST) Middleware Gfarm (AIST) Community.
Resource WG Update PRAGMA 8 Singapore. Routine Use - Users make a system work.
National Institute of Advanced Industrial Science and Technology Status report on the large-scale long-run simulation on the grid - Hybrid QM/MD simulation.
Cindy Zheng, PRAGMA 8, Singapore, 5/3-4/2005 Status of PRAGMA Grid Testbed & Routine-basis Experiments Cindy Zheng Pacific Rim Application and Grid Middleware.
Resource/data WG Summary Yoshio Tanaka Mason Katz.
National Institute of Advanced Industrial Science and Technology Running flexible, robust and scalable grid application: Hybrid QM/MD Simulation Hiroshi.
Resource WG Summary Mason Katz, Yoshio Tanaka. Next generation resources on PRAGMA Status – Next generation resource (VM-based) in PRAGMA by UCSD (proof.
Resource WG Report. Projects Applications EOL Ninf-G Climate model GridBlast GOC Gangla / SCMSWeb => Uniform Database Goodness Status map (e.g. IVDGL)
CCGrid 2006, 5/19//2006 The PRAGMA Testbed Building a Multi-Application International Grid San Diego Supercomputer Center / University of California, San.
GIN Testbed Status 5/11/2006 Peter Arzberger, Cindy Zheng
National Institute of Advanced Industrial Science and Technology Ninf-G - Core GridRPC Infrastructure Software OGF19 Yoshio Tanaka (AIST) On behalf.
Does the implementation give solutions for the requirements? Flexibility GridRPC enables dynamic join/leave of QM servers. GridRPC enables dynamic expansion.
Severs AIST Cluster (50 CPU) Titech Cluster (200 CPU) KISTI Cluster (25 CPU) Climate Simulation on ApGrid/TeraGrid at SC2003 Client (AIST) Ninf-G Severs.
Three types of remote process invocation
Developing an Agricultural Monitoring System from Remote Sensing Data Using GridRPC on Ninf-G Shamim Akther, Yann Chemin, Honda Kiyoshi Asian Institute.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
PRAGMA19, Sep. 15 Resources breakout Migration from Globus-based Grid to Cloud Mason Katz, Yoshio Tanaka.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
National Institute of Advanced Industrial Science and Technology ApGrid: Current Status and Future Direction Yoshio Tanaka (AIST)
Workload Management Massimo Sgaravatto INFN Padova.
National Institute of Advanced Industrial Science and Technology Introduction to Grid Activities in the Asia Pacific Region jointly presented by Yoshio.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Grid ASP Portals and the Grid PSE Builder Satoshi Itoh GTRC, AIST 3rd Oct UK & Japan N+N Meeting Takeshi Nishikawa Naotaka Yamamoto Hiroshi Takemiya.
Cindy Zheng, PRAGMA9, 10/21/2005 PRAGMA Grid Testbed & Routine-basis Experiments May – October 2005 Cindy Zheng Pacific Rim Application and Grid Middleware.
National Center for Supercomputing Applications The Computational Chemistry Grid: Production Cyberinfrastructure for Computational Chemistry PI: John Connolly.
Kento Aida, Tokyo Institute of Technology Grid Challenge - programming competition on the Grid - Kento Aida Tokyo Institute of Technology 22nd APAN Meeting.
Somsak Sriprayoonskul, Nopparat Nopkuat, Putchong Uthayopas, Sugree Phatanapherom ThaiGrid, Thailand Cindy Zheng, SDSC.
PRAGMA: Cyberinfrastructure, Applications, People Yoshio Tanaka (AIST, Japan) Peter Arzberger (UCSD, USA)
Resource WG PRAGMA 18 Mason Katz, Yoshio Tanaka.
NeSC Apps Workshop July 20 th, 2002 Customizable command line tools for Grids Ian Kelley + Gabrielle Allen Max Planck Institute for Gravitational Physics.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
National Institute of Advanced Industrial Science and Technology Introduction of PRAGMA routine-basis experiments Yoshio Tanaka
Building the PRAGMA Grid Through Routine-basis Experiments Cindy Zheng Pacific Rim Application and Grid Middleware Assembly San Diego Supercomputer Center.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
National Institute of Advanced Industrial Science and Technology Recent activities on building a production Grid in the Asia Pacific Region - PRAGMA routine-basis.
Building the PRAGMA Grid Through Routine-basis Experiments Cindy Zheng, SDSC, USA Yusuke Tanimura, AIST, Japan Pacific Rim Application Grid Middleware.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Building the PRAGMA Grid Through Routine-basis Experiments Cindy Zheng, SDSC, USA Yusuke Tanimura, AIST, Japan Pacific Rim Application Grid Middleware.
Experiences with the Globus Toolkit on AIX and deploying the Large Scale Air Pollution Model as a grid service Ashish Thandavan Advanced Computing and.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Data types Function handle – grpc_function_handle_t A structure that contains a mapping between a client and an instance of a remote function Object handle.
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
National Institute of Advanced Industrial Science and Technology APGrid PMA: Stauts Yoshio Tanaka Grid Technology Research Center,
SC2008 (11/19/2008) Resources Group Pacific Rim Application and Grid Middleware Assembly Reports.
Creating and running an application.
National Institute of Advanced Industrial Science and Technology ApGrid: Asia Pacific Partnership for Grid Computing - Introduction of testbed development.
February 22-23, Washington D.C. SURA ENDyne Software for Dynamics of Electrons and Nuclei in Molecules. Developed by Dr. Yngve Öhrn and Dr. Erik Deumens,
National Institute of Advanced Industrial Science and Technology GGF12 Workshop on Operational Security for the Grid Cross-site authentication and access.
National Computational Science National Center for Supercomputing Applications National Computational Science Integration of the MyProxy Online Credential.
National Institute of Advanced Industrial Science and Technology Developing Scientific Applications Using Standard Grid Middleware Hiroshi Takemiya Grid.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Digital Media SEAGF Somsak Sriprayoonsakul Thai National Grid Center
Workload Management Workpackage
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Presentation transcript:

Routine-Basis Experiments in PRAGMA Grid Testbed Yusuke Tanimura Grid Technology Research Center National Institute of AIST

2 Agenda Past status of PRAGMA testbed Discussions in PRAGMA 6 in May, 2004 Routine-basis experiments Result of 1 st application Technical results Lessons learned Future plans Current works toward the production grid Activity as Grid Operation Center Cooperation with other working groups

3 Status of Testbed in May, 2004 Computational resource 26 organizations (10 countries) 27 clusters (889 CPUs) Network performance is getting better. Architecture, technology Based on Globus Toolkit (mostly version 2) Ninf-G (GridRPC programming) Nimrod-G (parametric modeling system) SCMSWeb (resource monitoring) Grid Data FArm (Grid File System), etc. Operation policy Distributed management (No Grid Operation Center) Volunteer-based administration Less duty, less formality and less document

4 Status of Testbed in May, 2004 Questions??? Ready for real science application? Easy to use for every user? Reliable environment? Middleware stability? Plenty document? Enough security? and etc. Direction of PRAGMA Resource Working Group Do “Routine-basis Experiments” Try daily application runs for a long term Find out any problems and difficulty Learn what is necessary for the production grid?

5 Overview of Routine-Basis Exp. Purpose By daily runs of a sample application on PRAGMA testbed Find out and understand issues of the testbed operation for the real science application Case of 1 st application Application Time-Dependent Density Functional Theory (TDDFT) Software requirements of TDDFT are Ninf-G, Globus and Intel Fortran Compiler. Schedule June 1, 2004 ~ August 31, 2004 (For 3 months) Participants 10 Sites (in 8 countries): AIST, SDSC, KU, KISTI, NCHC, USM, BII, NCSA, TITECH, UNAM 193 CPUs (on 106 nodes)

6 Rough Schedule MayJuneJulyAug SC’04 SepOctNov PRAGMA6 1 st App. start1 st App. end PRAGMA7 2 nd App. startSetup Resource Monitor (SCMSWeb) 1. Apply account 2. Deploy application codes 3. Simple test at local site 4. Simple test between 2 sites Join in the main executions after all’s done 2 sites5 sites8 sites10 sites “These works were continued during 3 months.” 2 nd user start executions

7 Details of Application (1) TDDFT: Time-Dependent Density Functional Theory By Nobusada (IMS) and Yabana (Tsukuba Univ.) Application of the computational quantum chemistry Simulate how the electronic system evolves in time after excitation Time dependent N-electron wave function is which is approximated and transformed to then applied to numerical integration. A spectrum graph by calculated real-time dipole moments

8 Details of Application (2) GridRPC model using Ninf-G Execute some partial calculations on multiple servers in parallel main(){ : grpc_function_handle_default( &server, “tddft_func”); : grpc_call(&server, input, result); : user gatekeeper tddft_func() Exec func() on backends Cluster 1 Cluster 2 Cluster 3 Cluster 4 Client program of TDDFT GridRPC Sequential program Client Server

9 Details of Application (3) Parallelism: Suitable to GridRPC framework Real Science: Long-time run, Large data Require 6.1 millions of RPCs (Take about 1 week) main(){ : user Cluster 2 Cluster 3 Cluster 4 Client program Numerical integration part Cluster MB file 5000 iterations Ex. the legand-protected Au 13 molecule 1~2 sec calc MB 122 RPCs 3.25 MB

10 Fault-Tolerant Mechanism Management of the server’s status Status: Down, Idle, Busy (calculating or initializing) Error detection (ex. heartbeat from servers) Reboot a down server Periodical work (ex. 1 trial per hour) IdleDownBusy Error Restart Submitted task by RPC Finished task Start Error

11 Experiment Procedure (1) Application of user account Account application (Usual procedure) Installation of AIST GTRC CA’s certificate Update of grid-mapfile (In some cases) Update of access permission on firewalls Deployment of TDDFT application Software requirement: Installation of Globus version 2.x Intel Fortran Compiler version 6, 7 or latest 8 Installation of Ninf-G Some sites prepared Ninf-G for the experiment Installation of TDDFT server Upload source code and compile them  Real user’s work

12 Experiment Procedure (2) Test Globus level test globusrun –a –r globus-job-run /jobmanager-fork /bin/hostname globus-job-run /jobmanager-pbs –np 4 /bin/hostname Ninf-G level test It could be confirmed by calling a sample server. Application level test Run TDDFT with short-run parameters on 2 sites (client & server) Start experiment Run TDDFT with long-run parameters Monitor status of the run Task-throughput, Fault, Communication performance and etc.

13 Troubles for a user Authentication failure SSH login, Globus GRAM, Access to compute nodes CA/CRL, UID/GID had a problem. Job submission failure on each cluster A job was queued and never run. Incomplete configuration of jobmanager-{pbs/sge/lsf/sqms} Globus-related failure Globus installtion seemed to be incomplete. Application (TDDFT) failure No shared libraries of GT and Intel compiler on compute nodes Poor network performance in Asia Instability of clusters (by NFS, heat or power supply)

14 Numerical Results (1) Application user’s work How long does it take time to run TDDFT after getting account? 8.3 days (in average) How much work is necessary for one troubleshooting? 3.9 days and 4 s (in average) Executions Number of major executions by two users: 43 Execution time (Total): 1210 hours (50.4 days) (Max) : 164 hours (6.8 days) (Ave) : hours (1.2 days) Number of RPCs (Total): more than 2,500,000 Number of RPC failures: more than 1,600 (Error rate is about %)

15 The longest run using 59 servers over 5 sites Unstable network between KU (in Thailand) and AIST Result (2) : Server’s stability

16 Summary Found out the following issues In deployment and tests Need much user’s work Need self-trouble shooting In execution Unstable network Hard to know each cluster’s status Maintenance or troubling? Need some middleware improvement Details of lessons learned Current works toward the production grid Next. Please keep staying here.

17 Credits KISTI (Jysoo Lee, Jae-Hyuck Kwak) KU (Sugree Phatanapherom, Somsak Sriprayoonsakul) USM (Nazarul Annuar Nasirin, Bukhary Ikhwan Ismail) TITECH (Satoshi Matsuoka, Shirose Ken'ichiro) NCHC (Fang-Pang Lin, WeiCheng Huang, Yu-Chung Chen) NCSA (Radha Nandkumar, Tom Roney) BII (Kishore Sakharkar, Nigel Teow) UNAM (Jose Luis Gordillo Ruiz, Eduardo Murrieta Leon) UCSD/SDSC (Peter Arzberger, Phil Papadopoulos, Mason Katz, Teri Simas, Cindy Zheng) AIST (Yoshio Tanaka, Yusuke Tanimura) and other PRAGMA members

18

19 Result (3) : Task throughput / hour Reason of instability Waiting for some slow server and timeout from other servers Discussing about better fault detection and recovery mechanism

20 Ninf-G Grid middleware to develop and execute scientific application Support GridRPC API (Discussed on GGF ’ s APME working group) Built on Globus Toolkit 2.x, 3.0 and 3.2 May, 2004: Version Release main(){ : grpc_function_handle_default( &handle, “func_name”); : grpc_call(&handle, A, B, C); : Server globus-gatekeeper Compute node ( job-manager ) Use backend of a cluster user func() Executable func()

21 New Features of Ninf-G Ver.2 in Impl. Remote object Objectification Server has multiple methods. Server keeps internal data and share it between sessions. Effect To reduce extra calculations and communications To improve programmability Error handling and heartbeat function Return appropriate code for any errors Discussing GridRPC API standard Heartbeat function Servers send a packet to the client periodically. When heartbeat does not reach to the client for a certain time, GridRPC wait() function will be error.