Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Open Science Grid Fermilab. The Open Science Grid2 The Vision Practical support for end-to-end community systems in a heterogeneous gobal environment.

Similar presentations


Presentation on theme: "1 The Open Science Grid Fermilab. The Open Science Grid2 The Vision Practical support for end-to-end community systems in a heterogeneous gobal environment."— Presentation transcript:

1 1 The Open Science Grid Fermilab

2 The Open Science Grid2 The Vision Practical support for end-to-end community systems in a heterogeneous gobal environment to Trans f orm compute and data intensive science through a national cyberin f rastructure that includes from the smallest to the largest organizations. Practical support for end-to-end community systems in a heterogeneous gobal environment to Trans f orm compute and data intensive science through a national cyberin f rastructure that includes from the smallest to the largest organizations.

3 The Open Science Grid3 The Scope

4 The Open Science Grid4 Goals of OSG Enable scientists to use and share a greater % of available compute cycles. Help scientists to use distributed systems storage, processors and software with less effort. Enable more sharing and reuse of software and reduce duplication of effort through providing effort in integration and extensions. Establish “open-source” community working together to communicate knowledge and experience and also overheads for new participants. Enable scientists to use and share a greater % of available compute cycles. Help scientists to use distributed systems storage, processors and software with less effort. Enable more sharing and reuse of software and reduce duplication of effort through providing effort in integration and extensions. Establish “open-source” community working together to communicate knowledge and experience and also overheads for new participants.

5 The Open Science Grid5 The History 1999200020012002200520032004200620072008 2009 PPDG GriPhyN iVDGL TrilliumGrid3 OSG (DOE) (DOE+NSF) (NSF) Campus, regional grids LHC operations LHC construction, preparation LIGO operation LIGO preparation European Grid + Worldwide LHC Computing Grid

6 The Open Science Grid6 The Leaders High Energy & Nuclear Physics (HENP) Collaborations - Global communities with large distributed systems in Europe as well as the US Condor Project - distributed computing across diverse clusters. Globus - Grid security, data movement and information services software. Laser Interferometer Gravitational Wave Observatory - legacy data grid with large data collections, DOE HENP Facilities University Groups and researchers High Energy & Nuclear Physics (HENP) Collaborations - Global communities with large distributed systems in Europe as well as the US Condor Project - distributed computing across diverse clusters. Globus - Grid security, data movement and information services software. Laser Interferometer Gravitational Wave Observatory - legacy data grid with large data collections, DOE HENP Facilities University Groups and researchers

7 The Open Science Grid7 Community Systems

8 The Open Science Grid8 Institutions Involved Project Staff at Boston U Brookhaven National Lab + CalTech (Clemson) Columbia Cornell FermiLab + ISI, U of South California Indiana U LBNL + (Nebraska) RENCI SLAC + UCSD U of Chicago U of Florida U of Urbana Champaign/NCSA U of Wisconsin Madison Sites on OSG : Many with >1 resource. 46 separate institutions. * - no physicsFlorida State U.NebraskaU. Of Arkansas * Kansas StateLBNLU. Of Chicago U. Of Michigan U of IowaNotre DameU. California at Riverside Academia SinicaHampton UPenn State UUCSD Brookhaven National Lab UERJ BrazilOaklahoma U.U. Of Florida Boston U.Iowa StateSLACU. Illinois Chicago Cinvestav, Mexico City Indiana University Purdue U.U. New Mexico CaltechLehigh University * Rice U.U. Texas at Arlington Clemson U. *Louisiana University Southern Methodist U. U. Virginia Dartmouth U *Louisiana Tech * U. Of Sao PaoloU. Wisconsin Madison Florida International U. McGill UWayne State U.U. Wisconsin Milwaukee FermilabMITTTUVanderbilt U.

9 The Open Science Grid9 Monitoring the Sites

10 The Open Science Grid10 The Value Proposition Increased usage of CPUs and infrastructure alone (ie cost of processing cycles) is not the persuading cost- benefit value. The benefits come from reducing risk in and sharing support for large, complex systems which must be run for many years with a short life-time workforce. Savings in effort for integration, system and software support, Opportunity and flexibility to distribute load and address peak needs. Maintainance of an experienced workforce in a common system Lowering the cost of entry to new contributors. Enabling of new computational opportunities to communities that would not otherwise have access to such resources. Increased usage of CPUs and infrastructure alone (ie cost of processing cycles) is not the persuading cost- benefit value. The benefits come from reducing risk in and sharing support for large, complex systems which must be run for many years with a short life-time workforce. Savings in effort for integration, system and software support, Opportunity and flexibility to distribute load and address peak needs. Maintainance of an experienced workforce in a common system Lowering the cost of entry to new contributors. Enabling of new computational opportunities to communities that would not otherwise have access to such resources.

11 The Open Science Grid11 OSG Does Release, deploy and support Software. Integrate and test new software at the system level. Support operations and Grid-wide services. Provide Security operations and policy. Troubleshoot end to end user and system problems. Engage and help new communities. Extend capability and scale. Release, deploy and support Software. Integrate and test new software at the system level. Support operations and Grid-wide services. Provide Security operations and policy. Troubleshoot end to end user and system problems. Engage and help new communities. Extend capability and scale.

12 The Open Science Grid12 And OSG Does Training Grid Schools train students, teachers and new entrants to use grids: 2-3 day training with hands on workshops and core curriculum (based on iVDGL annual weeklong schools). 3 held already; several more this year (2 scheduled). Some as participants in internationals schools. 20-60 in each class. Each class regionally based with broad cachement area. Gathering an online repository of training material. End-to-end application training in collaboration with user communities. Grid Schools train students, teachers and new entrants to use grids: 2-3 day training with hands on workshops and core curriculum (based on iVDGL annual weeklong schools). 3 held already; several more this year (2 scheduled). Some as participants in internationals schools. 20-60 in each class. Each class regionally based with broad cachement area. Gathering an online repository of training material. End-to-end application training in collaboration with user communities.

13 The Open Science Grid13 Participants in a recent Open Science Grid workshop held in Argentina. Image courtesy of Carolina Leon Carri, University of Buenos Aires

14 The Open Science Grid14 Virtual Organizations A Virtual Organization is a collection of people (VO members). A VO has responsibilities to manage its members and the services its runs on their behalf. A VO may own resources and be prepared to share in their use. A Virtual Organization is a collection of people (VO members). A VO has responsibilities to manage its members and the services its runs on their behalf. A VO may own resources and be prepared to share in their use.

15 The Open Science Grid15 VOs Self Operated Research Vos: 15 Collider Detector at Fermilab (CDF) Compact Muon Solenoid (CMS) CompBioGrid (CompBioGrid) D0 Experiment at Fermilab (DZero) Dark Energy Survey (DES) Functional Magnetic Resonance Imaging (fMRI) Geant4 Software Toolkit (geant4) Genome Analysis and Database Update (GADU) International Linear Collider (ILC) Laser Interferometer Gravitational-Wave Observatory (LIGO) nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB) Sloan Digital Sky Survey (SDSS) Solenoidal Tracker at RHIC (STAR) Structural Biology Grid (SBGrid) United States ATLAS Collaboration (USATLAS) Campus Grids: 5. Georgetown University Grid (GUGrid) Grid Laboratory of Wisconsin (GLOW) Grid Research and Education Group at Iowa (GROW) University of New York at Buffalo (GRASE) Fermi National Accelerator Center (Fermilab) Regional Grids: 4 NYSGRID Distributed Organization for Scientific and Academic Research (DOSAR) Great Plains Network (GPN) Northwest Indiana Computational Grid (NWICG) OSG Operated VOs: 4 Engagement (Engage) Open Science Grid (OSG) OSG Education Activity (OSGEDU) OSG Monitoring & Operations

16 The Open Science Grid16

17 The Open Science Grid17 Sites A Site is a collection of commonly administered computing and/or storage resources and services. Resources can be owned by and shared among VOs A Site is a collection of commonly administered computing and/or storage resources and services. Resources can be owned by and shared among VOs

18 The Open Science Grid18 A Compute Element Processing Farms accessed through Condor-G submissions to Globus GRAM inteface which supports many different local batch systems. Priorities and policies through assignment of VO Roles mapped to accounts and batch queue priorities, modified by Site policies and priorities. Processing Farms accessed through Condor-G submissions to Globus GRAM inteface which supports many different local batch systems. Priorities and policies through assignment of VO Roles mapped to accounts and batch queue priorities, modified by Site policies and priorities. From ~20 CPU Department Computers to 10,000 CPU Super Computers Jobs run under any local batch system OSG gateway machine + services the network & other OSG resources

19 The Open Science Grid19 Storage Element Storage Services - access storage through Storage Resource Manager (SRM) inter f ace and GridFtp. A llocation o f shared storage through agreements between Site and VO(s) f acilitated by OSG. Storage Services - access storage through Storage Resource Manager (SRM) inter f ace and GridFtp. A llocation o f shared storage through agreements between Site and VO(s) f acilitated by OSG. From 20 GBytes Disk Cache To 4 Petabyte Robotic Tape Systems Any Shared Storage OSG SE gateway the network & other OSG resources

20 The Open Science Grid20 How are VOs supported? Virtual Organization Management services (VOMS) allow registration, administration and control o f members o f the group. Facilities trust and authorize VOs not individual users Storage and Compute Services prioritize according to VO group. Virtual Organization Management services (VOMS) allow registration, administration and control o f members o f the group. Facilities trust and authorize VOs not individual users Storage and Compute Services prioritize according to VO group. Resources that Trust the VO VO Management Service Network & other OSG resouces VO Middleware & Applications

21 The Open Science Grid21 Running Jobs Condor-G client Pre-ws or WS Gram as Site gateway Priority through VO role and policy, mitigate by Site policy Pilot jobs submitted through regular gateway can then bring down multiple user jobs until batch slot resources are used up. Glexec modelled on Apache suexec allows jobs to run under user identity. Condor-G client Pre-ws or WS Gram as Site gateway Priority through VO role and policy, mitigate by Site policy Pilot jobs submitted through regular gateway can then bring down multiple user jobs until batch slot resources are used up. Glexec modelled on Apache suexec allows jobs to run under user identity.

22 The Open Science Grid22 Data and Storage GridFTP data transfer Storage Resource Manager to manage shared and common storage Environment variables on the site let VOs know where to put and leave files. dCache - large scale, high I/O disk caching system for large sites DRM - NFS based disk management system for small sites. ? NFS V4 ? GPFS ? GridFTP data transfer Storage Resource Manager to manage shared and common storage Environment variables on the site let VOs know where to put and leave files. dCache - large scale, high I/O disk caching system for large sites DRM - NFS based disk management system for small sites. ? NFS V4 ? GPFS ?

23 The Open Science Grid23

24 The Open Science Grid24 Resource Management Many resources are owned or statically allocated to one user community. The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) The remainder of an organization’s available resources can be “used by everyone or anyone else”. organizations can decide against supporting particular VOs. OSG staff are responsible for monitoring and, if needed, managing this usage. Our challenge is to maximize good - successful - output from the whole system. Many resources are owned or statically allocated to one user community. The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) The remainder of an organization’s available resources can be “used by everyone or anyone else”. organizations can decide against supporting particular VOs. OSG staff are responsible for monitoring and, if needed, managing this usage. Our challenge is to maximize good - successful - output from the whole system.

25 The Open Science Grid25 An Example of Opportunistic use: D0’s own resources are committed to the processing of newly acquired data and analysis of the processed datasets. In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. The Council members agreed to contribute resources to meet this request. D0’s own resources are committed to the processing of newly acquired data and analysis of the processed datasets. In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. The Council members agreed to contribute resources to meet this request.

26 The Open Science Grid26 D0 Event Throughput

27 The Open Science Grid27 How did D0 Reprocessing Go? D0 had 2-3 months of smooth production running using >1,000 CPUs and met their goal by the end of May. To achieve this D0 testing of the integrated software system took until February. OSG staff and D0 then worked closely together as a team to reach the needed throughput goals - facing and solving problems sites - hardware, connectivity, software configurations application software - performance, error recovery scheduling of jobs to a changing mix of available resources. D0 had 2-3 months of smooth production running using >1,000 CPUs and met their goal by the end of May. To achieve this D0 testing of the integrated software system took until February. OSG staff and D0 then worked closely together as a team to reach the needed throughput goals - facing and solving problems sites - hardware, connectivity, software configurations application software - performance, error recovery scheduling of jobs to a changing mix of available resources.

28 The Open Science Grid28 D0 OSG CPUHours / Week

29 The Open Science Grid29 What did this teach us ? Consortium members contributed significant opportunistic resources as promised. VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. Combined teams make large production runs effective. How does this scale? how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon. Consortium members contributed significant opportunistic resources as promised. VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. Combined teams make large production runs effective. How does this scale? how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon.

30 The Open Science Grid30 Use by non-Physics Rosetta@Kuhlman lab: in production across ~15 sites since April Weather Research Forecase: MPI job running on 1 OSG site; more to come CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. NanoHUB at Purdue: Biomoca and Nanowire production. Rosetta@Kuhlman lab: in production across ~15 sites since April Weather Research Forecase: MPI job running on 1 OSG site; more to come CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. NanoHUB at Purdue: Biomoca and Nanowire production.

31 The Open Science Grid31 Rosetta User decided to submit jobs.. 3,000 jobs

32 The Open Science Grid32 Scale needed in 2008/2009: 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scienti f ic collaborations. High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely. Evolving and scaling smoothly to meet evolving requirements. E.g. f or a single experiment 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scienti f ic collaborations. High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely. Evolving and scaling smoothly to meet evolving requirements. E.g. f or a single experiment

33 The Open Science Grid33 CMS Data Transfer & Analysis

34 The Open Science Grid34 Software Infrastructure Applications VO Middleware Core grid technology distributions: Condor, Globus, Myproxy: shared with TeraGrid and others Virtual Data Toolkit (VDT) core technologies + software needed by stakeholders:many components shared with EGEE OSG Release Cache: OSG specific configurations, utilities etc. HEP Data and workflow management etc Biology Portals, databases etc User Science Codes and Interfaces Existing Operating, Batch systems and Utilities. Astrophysics Data replication etc

35 The Open Science Grid35 Horizontal and Vertical Integrations Infrastructure Applications HEP Data and workflow management etc Biology Portals, databases etc User Science Codes and Interfaces Astrophysics Data replication etc

36 The Open Science Grid36 The Virtual Data Toolkit Software Pre-built, integrated and packaged set of software which is easy to download, install and use to access OSG. Client, Server, Storage, Service versions. A utomated Build and Test: Integration and regression testing. Software Included: Grid software: Condor, Globus, dCache, Authz (voms/prima/gums), accounting (Gratia). Utilities: Monitoring, Authorization, Configuration Common components e.g. Apache Built for >10 flavors/versions of Linux Support structure. Software acceptance structure. Pre-built, integrated and packaged set of software which is easy to download, install and use to access OSG. Client, Server, Storage, Service versions. A utomated Build and Test: Integration and regression testing. Software Included: Grid software: Condor, Globus, dCache, Authz (voms/prima/gums), accounting (Gratia). Utilities: Monitoring, Authorization, Configuration Common components e.g. Apache Built for >10 flavors/versions of Linux Support structure. Software acceptance structure.

37 The Open Science Grid37 How we get to a Production Software Stack Input from stakeholders and OSG directors VDT Release OSG Integration Testbed Release OSG Production Release Test on OSG Validation Testbed

38 The Open Science Grid38 How we get to a Production Software Stack Input from stakeholders and OSG directors VDT Release OSG Integration Testbed Release OSG Production Release Test on OSG Validation Testbed Validation/Integration takes months and is the result of work many people.

39 The Open Science Grid39 How we get to a Production Software Stack Input from stakeholders and OSG directors VDT Release OSG Integration Testbed Release OSG Production Release Test on OSG Validation Testbed VDT used by others than OSG: TeraGrid, Enabling Grids for EscienE (Europe), APAC,

40 The Open Science Grid40

41 The Open Science Grid41 Security Operational security a priority Incident response Signed agreements, template policies Auditing, assessment and training Parity of Sites and VOs A Sites trust the VOs that use it. A VO trusts the Sites it runs on. VOs trust their users. Infrastructure X509 certificate based. With extended attributes for authorization. Operational security a priority Incident response Signed agreements, template policies Auditing, assessment and training Parity of Sites and VOs A Sites trust the VOs that use it. A VO trusts the Sites it runs on. VOs trust their users. Infrastructure X509 certificate based. With extended attributes for authorization.

42 The Open Science Grid42 Illustrative example of trust model User VO Site Jobs VO infra. Data Storage CECE W WW WWW WW W WW W W W W I trust it is the VO (or agent) I trust it is the user I trust it is the user’s job I trust the job is for the VO

43 The Open Science Grid43 Operations & Troubleshooting & Support Well established Grid Operations Center at Indiana University Users support distributed, includes osg- general@opensciencegrid community support. Site coordinator supports team of sites. Accounting and Site Validation required services of sites. Troubleshooting looks at targetted end to end problems Partnering with LBNL Troubleshooting work for auditing and forensics. Well established Grid Operations Center at Indiana University Users support distributed, includes osg- general@opensciencegrid community support. Site coordinator supports team of sites. Accounting and Site Validation required services of sites. Troubleshooting looks at targetted end to end problems Partnering with LBNL Troubleshooting work for auditing and forensics.

44 The Open Science Grid44 Campus Grids Sharing across compute clusters is a change and a challenge for many Universities. OSG, TeraGrid, Internet2, Educause working together on CI Days Work with CIO, Faculty, IT organizations for a 1 day meeting where we all come and talk about the needs the ideas and, yes, the next steps. Sharing across compute clusters is a change and a challenge for many Universities. OSG, TeraGrid, Internet2, Educause working together on CI Days Work with CIO, Faculty, IT organizations for a 1 day meeting where we all come and talk about the needs the ideas and, yes, the next steps.

45 The Open Science Grid45 OSG and TeraGrid Complementary and interoperating in f rastructures TeraGridOSG Networks supercomputer centers.Includes small to large clusters and organizations. Based on Condor & Globus s/w stack built at Wisconsin Build and Test. Based on Same versions of Condor & Globus in the Virtual Data Toolkit. Development of User Portals/Science Gateways. Supports jobs/data from TeraGrid science gateways. Currently relies mainly on remote login.No login access. Many sites expect VO attributes in the proxy certificate Training covers OSG and TeraGrid usage.


Download ppt "1 The Open Science Grid Fermilab. The Open Science Grid2 The Vision Practical support for end-to-end community systems in a heterogeneous gobal environment."

Similar presentations


Ads by Google