Presentation on theme: "1 Scientific Computing Resources Ian Bird – Computer Center Hall A Analysis Workshop December 11, 2001."— Presentation transcript:
1 Scientific Computing Resources Ian Bird – Computer Center Hall A Analysis Workshop December 11, 2001
2 OverviewOverview Current Resources –Recent evolution –Mass storage – HW & SW –Farm –Remote data access –Staffing levels Future Plans –Expansion/upgrades of current resources –Other computing – LQCD –Grid Computing What is it? – Should you care?
3 Gigabit Ethernet Switching Fabric Gigabit Ethernet Switching Fabric JLAB Network Backbone Batch Farm Cluster 350 Linux nodes (400 MHz – 1 GHz) 10,000 SPECint95 Managed by LSF + Java layer + web interface Batch Farm Cluster 350 Linux nodes (400 MHz – 1 GHz) 10,000 SPECint95 Managed by LSF + Java layer + web interface Interactive Analysis 2 Sun 450 – 4 processor 2 4-processor Intel/Linux Interactive Analysis 2 Sun 450 – 4 processor 2 4-processor Intel/Linux Lattice QCD Cluster 40 Alpha/Linux (667 MHz) 256 Pentium 4 (Q2 FY02?) Managed by PBS + Web portal Lattice QCD Cluster 40 Alpha/Linux (667 MHz) 256 Pentium 4 (Q2 FY02?) Managed by PBS + Web portal Unix, Linux, Windows desktops bbftp service Grid gateway 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers 2 STK silos Redwood 10 Solaris/Linux data movers w/ 300 GB stage 2 STK silos Redwood 10 Solaris/Linux data movers w/ 300 GB stage 10 TB work areas SCSI disk – RAID 5 CUE General Services JASMine managed Mass Storage Systems Internet (ESNet : OC-3) Jefferson Lab Scientific Computing Environment November TB Farm Cache SCSI – RAID 0 on Linux servers 2 TB Farm Cache SCSI – RAID 0 on Linux servers
4 Batch Farm – 350 processors 175 – dual nodes each connected at 100 Mb to 24-port switch with Gb uplink (8 switches) Batch Farm – 350 processors 175 – dual nodes each connected at 100 Mb to 24-port switch with Gb uplink (8 switches) Foundry BigIron 8000 Switch; 256 Gb backplane, ~45/60 Gb ports in use Foundry BigIron 8000 Switch; 256 Gb backplane, ~45/60 Gb ports in use Site Router – CUE and general services Site Router – CUE and general services 2 STK silos Redwood 10 Solaris/Linux data movers each w/ 300 GB stage & Gb uplink 2 STK silos Redwood 10 Solaris/Linux data movers each w/ 300 GB stage & Gb uplink CH-Router – Incoming data from Halls A & C CH-Router – Incoming data from Halls A & C Fiber Channel direct from CLAS Fiber Channel direct from CLAS Cache disk farm 20 Linux servers – each with Gb uplink Total 16 TB SCSI/IDE – RAID 0 Cache disk farm 20 Linux servers – each with Gb uplink Total 16 TB SCSI/IDE – RAID 0 Work disk farm 4 Linux servers – each with Gb uplink Total 4 TB SCSI – RAID 5 Work disk farm 4 Linux servers – each with Gb uplink Total 4 TB SCSI – RAID 5 Work disks 4 MetaStor systems each with 100 Mb uplink Total 5 TB SCSI – RAID 5 Work disks 4 MetaStor systems each with 100 Mb uplink Total 5 TB SCSI – RAID 5 JLAB Farm and Mass Storage Systems November 2001
5 CPU Resources Farm –Upgraded this summer with 60 dual 1 GHz P III (4 cpu / 1 u rackmount) –Retired original 10 dual 300 MHz –Now 350 cpu (400, 450, 500, 750, 1000 MHz) ~11,000 SPECint95 –Deliver > 500,000 SI95-hrs / week Equivalent to 75 1 GHz cpu Interactive –Solaris: 2 E450 (4-proc) –Linux: 2 quad systems (4x450, 4x750MHz) –If required can use batch systems (via LSF) to add interactive CPU to these (Linux) front ends
6 First purchases, 9 duals per 24 rackLast summer, 16 duals (2u) GB cache (8u) per 19 rack Recently, 5 TB IDE cache disk (5 x 8u) per 19 Intel Linux Farm
7 Tape storage Added 2 nd silo this summer –Required move of room of equipment –Added drives (5 as part of new silo) –Current: 8 Redwood, , –Redwood: 50 10MB/s (helical scan single reel) –9840: 20 10MB/s (linear mid-load cassette (fast)) –9940: 60 10MB/s (linear single reel) –9840 & 9940 are very reliable –9840 & 9940 have upgrade paths that use same media » nd generation – 100 ?? Add 10 more 9940 this FY (budget..?) Replace Redwoods (reduce to 1-2) –Requires copying 4500 tapes – started – budget for tape? »Reliability, end of support(!)
8 Disk storage Added cache space –For frequently used silo files, to reduce tape accesses –Now have 22 cache servers 4 dedicated to farm ~ 2 TB ~16 TB of cache space allocated to expts –Some bought and owned by groups Dual Linux systems, Gb network, ~ 1 TB disk, RAID 0 –9 SCSI systems –13 IDE systems »Performance approx equivalent –Good match cpu:network throughput:disk space –This is a model that will scale by a few factors, but probably not by 10 (but there is as yet no solution to that) Looking at distributed file systems for the future – to avoid NFS complications – GFS, etc., but no production level system yet. –Nb. Accessing data with jcache does not need NFS, and is fault tolerant Added work space –Added 4 systems to reduce load on fs3,4,5,6 (orig /work) –Dual Linux systems, Gb network, ~ 1 TB disk, SCSI RAID 5 –Performance on all systems is now good Problems – –Some issues with IBM 75 GB ATA drives, 3-ware IDE RAID cards, Linux kernels System is reasonably stable, but not yet perfect – but alternatives are not cost-effective
9 JASMineJASMine JASMine – Mass Storage system software Rationale – why write another MSS? –Had been using OSM Not scaleable, not supported, reached limit of sw, had to run 2 instances to get sufficient drive capacity Hidden from users by Tapeserver –Java layer that »Hid complexities of OSM installations »Implemented tape disk buffers (stage) »Provided get, put, managed cache (read copies of archived data) capabilities –Migration from OSM Production environment…. –Timescales driven by experiment schedules, need to add drive capacity –Retain user interface Replace osmcp function – tape to disk, drive and library management –Choices investigated Enstore, Castor, (HPSS) –Timescales, support, adaptability (missing functionality/philosophy – cache/stage) –Provide missing functions within Tapeserver environment, clean up and reworking JASMine (JLAB Asynchronous Storage Manager)
10 ArchitectureArchitecture JASMine –Written in Java For data movement, as fast as C code. JDBC makes using and changing databases easy. –Distributed Data Movers and Cache Managers –Scaleable to the foreseeable needs of the experiments –Provides scheduling – Optimizing file access requests User and group (and location dependent) priorities –Off-site cache or ftp servers for data exporting JASMine Cache Software –Stand-alone component – can act as a local or remote client, allows remote access to JASMine –Can be deployed to a collaborator to manage small disk system and as basis for coordinated data management between sites –Cache manager runs on each cache server. Hardware is not an issue. Need a JVM, network, and a disk to store files.
11 Software cont. MySQL database used by all servers. –Fast and reliable. –SQL Data Format –ANSI standard labels with extra information –Binary data –Support to read legacy OSM tapes cpio, no file labels Protocol for file transfers Writes to cache are never NFS Reads from cache may be NFS
13 JASMine Services Database –Stores metadata also presented to user on an NFS filesystem as stubfiles –But could equally be presented as e.g. a web service, LDAP, … Do not need to access stubfiles – just need to know filenames –Tracks status and locations of all requests, files, volumes, drives, etc. Request Manager –Handles user requests and queries. Scheduler –Prioritizes user requests for tape access. priority = share / ( (num_a * ACTIVE_WEIGHT) + (num_c * COMPLETED_WEIGHT) ) –Host vs User shares, farm priorities Log Manager –Writes out log and error files and databases. –Sends out notices for failures. Library Manager –Mount and dismounts tapes as well as other library related tasks.
14 JASMine Services -2 Data Mover –Dispatcher Keeps track of available local resources and starts requests the local system can work on. –Cache Manager Manages a disk or disks for pre-staging data to and from tape. Sends and receives data to and from clients. –Volume Manager Manages tapes for availability. –Drive Manager Manages tape drives for usage.
15 User Access Jput –Put one or more files on tape Jget –Get one or more files from tape Jcache –Copies one or more files from tape to cache Jls –Get metadata for one or more files Jtstat –Status of the request queue Web interface –Query status and statistics for entire system
16 Web interface
18 Data Access to cache NFS –Directory of links points the way. –Mounted read-only by the farm. –Users can mount read-only on their desktop. Jcache –Java client. –Checks to see if files are on cache disks. –Will get/put files from/to cache disks. More efficient than NFS, avoids NFS hangs if server dies, etc., but users like NFS
19 Disk Cache Management Disk Pools are divided into groups –Tape staging. –Experiments. –Pre-staging for the batch farm. Management policy set per group –Cache – LRU files removed as needed. –Stage – Reference counting. –Explicit – manual addition and deletion. –Policies are pluggable – easy to add
20 Protocol for file moving Simple extensible protocol for file copies –Messages are java serialized objects passed over streams, –Bulk data transfer uses raw data transfer over tcp Protocol is synchronous – all calls block –Asynchrony & multiple requests by threading CRC32 checksums at every transfer More fair than NFS Session may make many connections
21 Protocol for file moving Cache server extends the basic protocol –Add database hooks for cache –Add hooks for cache policies –Additional message types were added High throughput disk pool –Database shared by many servers –Any server in the pool can look up file location, But data transfer always direct between client and node holding file –Adding servers and disk to pool increases throughput with no overhead, Provides fault tolerance
22 Example: Get from cache cacheClient.getFile(/foo, halla); –send locate request to any server –receive locate reply –contact appropriate server –initiate direct xfer –Returns true on success cache4 Where is /foo? Client (farm node) cache1 cache2 cache3 Cache3 has /foo Database Get /foo Sending /foo
23 Example: simple put to cache putFile(/quux,halla, ); Cache4 has room Client (data mover) cache1 cache2 cache3 cache4 Where can I put /quux? Database
24 Fault Tolerance Dead machines do not stop the system –Data Movers work independently Unfinished jobs will restart on another mover –Cache Servers will only impact NFS clients System recognizes dead server and will re-cache file from tape If users would not use NFS would never see a failure – just extended access time Exception handling for –Received timeouts –Refused connections –Broken connections –Complete garbage on connections
25 Authorization and Authentication Shared secret for each file transfer session –Session authorization by policy objects –Example: receive 5 files from Plug-in authenticators –Establish shared secret between client and server –No clear text passwords –Extend to be compatible with GSI
26 JASMine Bulk Data Transfers Model supports parallel transfers –Many files at once, but not bbftp style But could replace stream class with a parallel stream –For bulk data transfer over WANs Firewall issues –Client initiates all connections
27 Architecture: Disk pool hardware SCSI Disk Servers –Dual Pentium III 650 (later 933)MHz CPUs –512 Mbytes 100MHz SDRAM ECC –ASUS P2B-D Motherboard –NetGear GA620 Gigabit Ethernet PCI NIC –Mylex eXtremeRAID 1100, 32 MBytes cache –Seagate ST150176LW (Qty. 8) - 50 GBytes Ultra2 SCSI in Hot Swap Disk Carriers –CalPC 8U Rack Mount Case with Redundant 400W Power Supplies IDE Disk Servers –Dual Pentium III 933MHz CPUs –512 Mbytes 133MHz SDRAM ECC –Intel STL2 or ASUS CUR-DLS Motherboard –NetGear GA620 or Intel PRO/1000 T Server Gigabit Ethernet PCI NIC –3ware Escalade 6800 –IBM DTLA (Qty. 12) - 75 GBytes Ultra ATA/100 in Hot Swap Disk Carriers –CalPC 8U Rack Mount Case with Redundant 400W Power Supplies
28 Cache Performance Matches network, disk I/O, and CPU performance with size of disk pool: –~800 GB, –2 x 850MHz –Gb Ethernet
29 Cache status
30 Performance – SCSI vs IDE Disk Array/File System – Ext2 –SCSI Disk Server GByte disks in a RAID-0 stripe over 2 SCSI controllers 68 MBytes/sec single disk write 79 MBytes/sec burst for a single disk write 52 MBytes/sec single disk read 56 MBytes/sec burst for a single disk read –IDE Disk Server GByte disks in a RAID-0 stripe 64 MBytes/sec single disk write 77 MBytes/sec burst for a single disk write 48 MBytes/sec single disk read 49 MBytes/sec burst for a single disk read
31 Performance NFS vs Jcache NFS v2 udp - 16 clients, rsize=8192 and wsize=8192 –Reads SCSI Disk Servers –7700 NFS ops/sec and 80% cpu utilization –11000 NFS ops/sec burst and 83% cpu utilization –32 MBytes/sec and 83% cpu utilization IDE Disk Servers –7700 NFS ops/sec and 72% cpu utilization –11000 NFS ops/sec burst and 92% cpu utilization –32 MBytes/sec and 72% cpu utilization Jcache - 16 clients –Reads SCSI Disk Servers –32 MBytes/sec and 100% cpu utilization IDE Disk Servers –32 MBytes/sec and 100% cpu utilization
32 JASMine system performance End-to-end performance i.e. tape load, copy to stage, network copy to client –Aggregate sustained performance of 50MB/s is regularly observed in production –During stress tests, up to 120 MB/s was sustained for several hours A data mover with 2 drives can handle ~15MB/s (disk contention is the limit) –Expect current system should handle 150MB/s and is scaleable by adding data movers & drives –N.B. this is performance to a network client! Data handling –Currently the system regularly moves 2-3 TB per day total ~6000 files per day, ~2000 requests
35 JASMine performance
36 Tape migration Begin migration of 5000 Redwood tapes to 9940 –Procedure written –Uses any/all available drives –Use staging to allow re-packing of tapes –Expect will last 9-12 months
37 Batch Farm Cluster 350 Linux nodes (400 MHz – 1 GHz) 10,000 SPECint95 Managed by LSF + Java layer + web interface Batch Farm Cluster 350 Linux nodes (400 MHz – 1 GHz) 10,000 SPECint95 Managed by LSF + Java layer + web interface 10 TB work areas SCSI disk – RAID 5 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers Typical Data Flows Raw Data < 10MB/s over Gigabit Ethernet (Halls A & C) Raw Data > 20 MB/s over Fiber channel (Hall B) MB/s
38 How to make optimal use of the resources Plan ahead! As a group: –Organize data sets in advance (~week) and use the cache disks for their intended purpose Hold frequently used data to reduce tape access –In a high data rate environment no other strategy works When running farm productions –Use jsub to submit many jobs in one command – as it was designed Optimizes tape accesses –Gather output files together on work disks and make a single jput for a complete tapes worth of data
39 Remote data access Tape copying is deprecated –Expensive, time consuming (for you and us), and inefficient –We have OC-3 (155 Mbps) connection that is under-utilized, filling it will get us upgraded to OC-12 (622 Mbps) At the moment we do often have to coordinate with ESnet and peers to ensure high-bandwidth path, but this is improving as Grid development continues Use network copies –Bbftp service Parallel, secure ftp – optimizes use of WAN bandwidth Future –Remote jcache Cache manager can be deployed remotely – demonstration Feb 02. –Remote silo access, policy-based (unattended) data migration –GridFTP, bbftp, bbcp Parallel, secure ftp (or ftp-like) –As part of a Grid infrastructure PKI authentication mechanism
40 (Data-) Grid Computing
41 Particle Physics Data Grid Collaboratory Pilot Who we are: Four leading Grid Computer Science Projects and Six international High Energy and Nuclear Physics Collaborations What we do: Develop and deploy Grid Services for our Experiment Collaborators and Promote and provide common Grid software and standards The problem at hand today: Petabytes of storage, Teraops/s of computing Thousands of users, Hundreds of institutions, 10+ years of analysis ahead
42 PPDG Experiments ATLAS - a Toroidal LHC ApparatuS at CERN Runs 2006 on Goals: TeV physics - the Higgs and the origin of mass … BaBar - at the Stanford Linear Accelerator CenterRunning Now Goals: study CP violation and more CMS - the Compact Muon Solenoid detector at CERN Runs 2006 on Goals: TeV physics - the Higgs and the origin of mass … D0 – at the D0 colliding beam interaction region at FermilabRuns Soon Goals: learn more about the top quark, supersymmetry, and the Higgs STAR - Solenoidal Tracker At RHIC at BNLRunning Now Goals: quark-gluon plasma … Thomas Jefferson National Laboratory Running Now Goals: understanding the nucleus using electron beams …
43 PPDG Computer Science Groups Condor – develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing on large collections of computing resources with distributed ownership. Globus - developing fundamental technologies needed to build persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations SDM - Scientific Data Management Research Group – optimized and standardized access to storage systems Storage Resource Broker - client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and cataloging/accessing replicated data sets.
44 Delivery of End-to-End Applications & Integrated Production Systems to allow thousands of physicists to share data & computing resources for scientific processing and analyses Operators & Users Resources: Computers, Storage, Networks PPDG Focus: - Robust Data Replication - Intelligent Job Placement and Scheduling - Management of Storage Resources - Monitoring and Information of Global Services Relies on Grid infrastructure: - Security & Policy - High Speed Data Transfer - Network management
45 Project Activities, End-to-End Applications and Cross-Cut Pilots Project Activities are focused Experiment – Computer Science Collaborative developments. Replicated data sets for science analysis – BaBar, CMS, STAR Distributed Monte Carlo production services – ATLAS, D0, CMS Common storage management and interfaces – STAR, JLAB End-to-End Applications used in Experiment data handling systems to give real-world requirements, testing and feedback. Error reporting and response Fault tolerant integration of complex components Cross-Cut Pilots for common services and policies Certificate Authority policy and authentication File transfer standards and protocols Resource Monitoring – networks, computers, storage.
46 Year Milestones (1) Align milestones to Experiment data challenges: –ATLAS – production distributed data service – 6/1/02 –BaBar – analysis across partitioned dataset storage – 5/1/02 –CMS – Distributed simulation production – 1/1/02 –D0 – distributed analyses across multiple workgroup clusters – 4/1/02 –STAR – automated dataset replication – 12/1/01 –JLAB – policy driven file migration – 2/1/02
47 Year Milestones Common milestones with EDG: GDMP – robust file replication layer – Joint Project with EDG Work Package (WP) 2 (Data Access) Support of Project Month (PM) 9 WP6 TestBed Milestone. Will participate in integration fest at CERN - 10/1/01 Collaborate on PM21 design for WP2 - 1/1/02 Proposed WP8 Application tests using PM9 testbed – 3/1/02 Collaboration with GriPhyN: SC2001 demos will use common resources, infrastructure and presentations – 11/16/01 Common, GriPhyN-led grid architecture Joint work on monitoring proposed
48 Year ~0.5-1 Cross-cuts Grid File Replication Services used by >2 experiments: –GridFTP – production releases Integrate with D0-SAM, STAR replication Interfaced through SRB for BaBar, JLAB Layered use by GDMP for CMS, ATLAS –SRB and Globus Replication Services Include robustness features Common catalog features and API –GDMP/Data Access layer continues to be shared between EDG and PPDG. Distributed Job Scheduling and Management used by >1 experiment: Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS Job specification language interfaces to distributed schedulers – D0-SAM, CMS, JLAB Storage Resource Interface and Management Consensus on API between EDG, SRM, and PPDG Disk cache management integrated with data replication services
49 Year ~1 other goals: Transatlantic Application Demonstrators : –BaBar data replication between SLAC and IN2P3 –D0 Monte Carlo Job Execution between Fermilab and NIKHEF –CMS & ATLAS simulation production between Europe/US Certificate exchange and authorization. –DOE Science Grid as CA? Robust data replication. –fault tolerant –between heterogeneous storage resources. Monitoring Services –MDS2 (Metacomputing Directory Service)? –common framework –network, compute and storage information made available to scheduling and resource management.
50 PPDG activities as part of the Global Grid Community Coordination with other Grid Projects in our field: GriPhyN – Grid for Physics Network European DataGrid Storage Resource Management collaboratory HENP Data Grid Coordination Committee Participation in Experiment and Grid deployments in our field: ATLAS, BaBar, CMS, D0, Star, JLAB experiment data handling systems iVDGL/DataTAG – International Virtual Data Grid Laboratory Use DTF computational facilities? Active in Standards Committees: Internet2 HENP Working Group Global Grid Forum
51 Staffing Levels We are stretched thin –But compared with other labs with similar data volumes we are efficient Systems support group: vacant Farms, MSS development: 2 HW support/ Networks: 3.7 Telecom: 2.3 Security: 2 User services: 3 MIS, Database support: 8 Support for Engineering: 1 –We cannot do as much as we would like
52 Future (FY02) Removing Redwoods is a priority –Copying tapes, replacing drives w/ 9940s Modest farm upgrades – replace older CPU as budget allows –Improve interactive systems Add more /work, /cache Grid developments: –Visible as efficient WAN data replication services After FY02 –Global filesystems – to supercede NFS –10 Gb Ethernet –Disk vs. tape? Improved tape densities, data rates We welcome (coordinated) input as to what would be most useful for your physics needs