Presentation on theme: "1 Large Scale Computing at PDSF Iwona Sakrejda NERSC User Services Group February ??, 2006."— Presentation transcript:
1 Large Scale Computing at PDSF Iwona Sakrejda NERSC User Services Group ISakrejda@lbl.gov February ??, 2006
2 Outline Role of PDSF in HENP computing. Integration with other NERSC computational and storage systems. User management and user oriented services at NERSC PDSF layout. Workload management (batch systems) File System implications of data intensive computing. Operating system selection with CHOS. Grid use at PDSF (Grid3, OSG, ITB) Conclusions
3 3 PDSF Mission PDSF (Parallel Distributed Systems Facility) is a networked distributed computing environment used to meet the detector simulation and data analysis requirements of large scale High Energy Physics (HEP) and Nuclear Science (NS) experiments.
4 PDSF Principle of Operation Multiple groups pool their resources together Need for resources varies through the year – conferences, data taking periods at different times (Quark Mater vs PANIC for example). Peak resource availability enhanced. Idle cycles minimized by allowing groups with small resources (cycle scavenging). Software installation and license sharing (Totalview, IDL, PGI)
5 PDSF at NERSC HPSS IBM AIX Server 50 TB of cache disk 8 STK robots, 44,000 tape slots, max capacity 9 PB PDSF ~700 processors ~1.5 TF,.7 TB of Memory ~300 TB of Shared Disk HPSS FC Disk STK Robots Testbeds and servers SGI HPSS Global Filesystem Storage Fabric Jumbo 10 Gigabit Ethernet 10 gigabit ethernet Opteron Cluster – Jacquard 640 processors (peak: 2.8 Tflop/s Opteron/Infiniband 4X/12X 3.1 TF/ 1.2 TB memory SSP -.41 Tflop/s 30 TB Disk IBM POWER5 – Bassi 888 processors (peak: 6.7 Tflop/s) SSP -.8 Tflop/s 2 TB Memory 70 TB disk IBM POWER3 - Seaborg 6,080 processors (peak 9.1 TFlop/s) SSP – 1.35 Tflop/s 7.8 Terabyte Memory 55 Terabytes of Shared Disk Analytics Server - DaVinci 32 Processors 192 GB Memory 25 Terabytes Disk
6 User Management and Support at NERSC With >500 users and >10 projects a database management system needed. –Active user management (disabling, password expiration…) –Allocation management (especially mass storage accounting) PIs partly responsible for user management (from their own projects) –Adding users –Assigning users to groups –Removing users Users managing their own info, groups, certificates…. Account support User Support and the trouble ticket system. –Call center –Trouble ticket system
8 PDSF Layout ….. Interactive nodes pdsf.nersc.gov Grid gatekeepers Batch pool – several generations of Intel and AMD processors ~1200 1GHz Pool of disk vaults GPFS file systems HPSS
9 Workload Management (Batch) Effective resource sharing via batch workload management Fair share principle links shares to groups financial contributions –Fairness concept by groups and within groups –Concept at the heart of PDSF design Unused resources split among running users Group sharing places additional requirement on batch systems. Impact of batch system –LSF good scalability, performance and documentation, met requirements, costly –Condor (concept of a group share not implemented when transition was considered – 2 years ago) –SGE met requirements, scales reasonably, documentation lacking at times Changes minimized by SUMS (STAR)
10 Shares System at Work STAR’s 70% share “pushes out” KamLAND (9% share) SNO (1%, light blue), Majorana (no contribution) get time when the big share owners do not use it.
11 File System implications of data intensive computing - NFS NFS – cost effective solution but –scales poorly –data corruption during heavy use –data safety (raidset helps but not 100%) Disk vault are cheap IDE based centralized storage –Dvio batch-level “resource” integrated with the batch system –defined to limit number of simultaneous read/write access streams –hard to a priori asses load Ganglia facilitates load monitoring and the dvio requirement assessment – available to the users..
12 Usage per discipline IO and data dominated by Nuclear Physics
13 File System implications of data intensive computing – local storage Local storage on batch nodes –Cheap storage (large and cheap hard drives) –Very good I/O performance –Limited to jobs running on the node –Diversity of the user population does not facilitate batch node sharing users wary of Xrootd daemons –No redundancy, drive failure causes data loss –File catalog aids in job submission – SUMS does the rest
14 File System implications of data intensive computing - GPFS NERSC purchased GPFS software licenses for PDSF –Reliable (raid underneath) –Good performance (striping) –Self repairing Even after disengaging under load comes back on-line compare with “NFS stale file handles” (had to be fixed by either admin or a cron job) –Expensive PDSF hosts will host several GPFS file systems –7 already in place –~15TB/filesystem – not enough experience with GPFS on linux
15 File System implications of data intensive computing – beta testing file system (open software version) testing –File system performed reasonably well under high load –support and maintenance manpower intensive Storage units from commercial vendors made available for beta testing –Support provided by vendors –Users get cutting edge, highly capable, storage appliances to use for extensive periods of time –Staff obliged to produce reports – additional workload (light) –Units too expensive to purchase – work related to data uploading –Affordable units from new companies – uncertainty of support continuity
16 Role of mass storage in data management Data intensive experiments require “smart backup” –Only $HOME, system and application area are automatically backed up –PDSF storage media reliable – but not disaster-proof. –Groups have allocation in mass storage to selectively store their data –Users have individual accounts in mass storage to backup their work Network bandwidth (10GB to HPSS) –large HPSS cache and large number of tape movers facilitate quick access to stored data –number of drives still an issue
18 Operating system selection with CHOS PDSF is a secondary computing facility for most of the user groups –not free to independently select operating system –tied to the Tier0 selection PDSF projects originated at various times (in the past or still to come) –Tier0s embraced different operating systems, evolution PDSF accommodates needs of diverse groups with CHOS –framework for concurrently running multiple Linux environments (distributions) on single node. –accomplished through a combination of the chroot system call, a Linux kernel module, and some additional utilities. –can be configured so that users are transparently presented with their selected distribution on login.
19 Operating system selection with CHOS (cont) Support for operating systems based on same kernel version. –RH7.2 –RH8 –RH9 –SL 3.0.2 Base system – SL 3.03 –provides security –More info about CHOS available at: http://www.nersc.gov/nusers/resources/PDSF/chos/faq.php http://www.nersc.gov/nusers/resources/PDSF/chos/faq.php CHOS protected PDSF from fragmentation of resources – Unique approach to multi-group support. Sharing possible even when diverse OS required.
20 Who Has Used the Grid at NERSC PDSF pioneered introduction of Grid services at NERSC. Participation in the Grid3 project Mostly PDSF (Parallel Distributed Systems Facility) users, who analyze detector data and simulations: –STAR Detector Simulations and Data Analysis Studies the quark-gluon plasma and proton-proton collisions 631 collaborators from 52 project institutions 265 users at NERSC … –Simulations for the ALICE experiment at CERN Studies ion-ion collisions 19 NERSC users from 11 institutions –Simulations for the ATLAS experiment at CERN Studies fundamental particle processes 56 NERSC users from 17 institutions STAR Experiment Detector
21 Caveats - Grid usage thoughts … Most NERSC Users are not Using the Grid The Office of Science “Massively Parallel Processing” (MPP) user communities have not embraced the grid Even on the PDSF, only a few “production managers” use the grid; most users do not Site policy side effects: –ATLAS and CMS stopped using the grid at NERSC due to lack of support for group accounts –Difficult/tedious/confusing to get a Grid certificate –Lack of support at NERSC for Virtual Organizations One grid user’s opinion: instead of writing the middleware and troubleshooting just use a piece of paper to keep track of jobs and pftp for file transfers However, several STAR users have been testing the Grid for user analysis jobs, so interest may be growing.
22 STAR Grid Computing at NERSC Grid computing benefits to STAR: 1.Bulk data transfer RCF->NERSC with Storage Resource Management (SRM) technologies –SRM automates end-to-end transfers: increased throughput and reliability; less monitoring effort by data managers –Source/destination can be files on disk or in HPSS mass storage system –60 TB transferred in CY05 with automatic cataloging –Typical transfers are ~10k files, 5 days duration, 1 TB –Doubles STAR processing power since all data at two sites
23 STAR Grid Computing at NERSC (cont.) Grid computing benefits to STAR: 2.Grid-based job submission with STAR scheduler (SUMS) Production grid jobs are running daily from RCF to PDSF –SUMS job xml job description -> –condor-g grid job submission -> –SGE submission to PDSF batch system Uses SRMs for input and output file transfers Handles catalog queries, job definitions, grid/local job submission, etc. Underlying technologies largely hidden from user
24 STAR Grid Computing at NERSC (cont.) Goal: use SUMS to run STAR user analysis and data mining jobs on OSG sites. Issues are: –Transparent packaging and distribution of STAR software on OSG non-STAR-dedicated sites –SRM services need to be deployed consistently at OSG sites (preferred) or deployed along with the jobs (how to do?) –Inconsistencies of inbound/outbound site policies –SUMS Generic interface adaptable to other VOs running on OSG offer community support
25 NERSC Contributions to the Grid myproxy.nersc.gov –Users don’t have to scp their certs to different sites –Safely stores credentials; uses ssl –Anyone can use it from anywhere –myproxy-init –s myproxy.nersc.gov –myproxy-get-delegation –Part of VDT and OSG software distribution Management of grid-map files –NERSC users put their certs into our NERSC Information Management system –They automatically get propagated to all NERSC resources garchive.nersc.gov –GSI authentications added to the HPSS pftp client and server –Users can log in to HPSS using their grid certs –Software contributed to the HPSS consortium
26 Online Certification Services (in development) Would allow users to use grid services without having to get a grid cert myproxy-logon – s myproxy.nersc.gov Generates a proxy cert on the fly Built on top of PAM and Myproxy Will use radius server to authenticate users Radius is a protocol to securely send authentication and auditing information between sites Can authenticate with LDAP, One Time Password or Grid cert Could be used to federate sites
27 Audit Trail for Group Accounts (proposed development) NERSC needs to trace back sessions and commands to individual users Some projects need to set up a production environment managed by multiple users (who can then jointly manage the production jobs and data) Build an environment that accepts multiple certs or multiple username/passwords for a single account Keep logs that can associate PID/UIDs with the actual user Provide audit trail that constructs the original authentication associated with the PID/UID
28 Conclusions NERSC/PDSF is a fully resource sharing facility –Several storage solutions evaluated, lots of choices and some emerging trend (distributed file systems, IO balanced systems, …) –CPU shared based on financial contributions –Fully opportunistic (if not used, can be take by others) –NERSC will base its deployment decisions on science and user driven requirements A lot of ongoing research in distributed computing technologies NERSC can contribute to STAR/OSG efforts: –Auditing and login tracing tools –Online certification services (integrate LDAP, One Time Passwords and Grid certs) –Testbed for OSG software on HPC architectures –User Support