Presentation is loading. Please wait.

Presentation is loading. Please wait.

DOSAR Workshop VII April 2, 2009 Louisiana Tech Site Report Michael S. Bryant Systems Manager, CAPS/Physics Louisiana Tech University www.dosar.org.

Similar presentations


Presentation on theme: "DOSAR Workshop VII April 2, 2009 Louisiana Tech Site Report Michael S. Bryant Systems Manager, CAPS/Physics Louisiana Tech University www.dosar.org."— Presentation transcript:

1 DOSAR Workshop VII April 2, 2009 Louisiana Tech Site Report Michael S. Bryant Systems Manager, CAPS/Physics Louisiana Tech University www.dosar.org

2 4/2/2009DOSAR Workshop VII 2 Louisiana Tech University and LONI COMPUTING IN LOUISIANA

3 High Energy Physics and Grid computing ▫ Dr. Dick Greenwood, Dr. Lee Sawyer, Dr. Markus Wobisch ▫ Michael Bryant (staff) High Availability and High Performance Computing ▫ Dr. Chokchai (Box) Leangsuksun ▫ Thanadech “Noon” Thanakornworakij (Ph.D. student) LONI Institute Faculty ▫ Dr. Abdelkader Baggag (Computer Science) ▫ Dr. Dentcho Genov (Physics, Electrical Engineering) Researchers at Louisiana Tech 4/2/2009DOSAR Workshop VII 3

4 HPCI is a campus-wide initiative to promote HPC and enable a local R&D community at Louisiana Tech ▫ Started by Dr. Chokchai (Box) Leangsuksun in 2007 Provides a highly computational infrastructure for local researchers that supports: ▫ GPGPU and PS3 computing ▫ Highly parallel and memory intensive HPC applications Sponsored by Intel equipment donations and University Research funding High Performance Computing Initiative 4/2/2009DOSAR Workshop VII 4

5 The HPCI infrastructure consists of three primary clusters ▫ Intel 32-bit Xeon cluster (Azul)  38 nodes (76 CPUs), Highly Available dual headnodes using HA-OSCAR ▫ Sony Playstation 3 cluster  25 nodes (8 cores per PS3 = 200 cores or SPEs) ▫ Intel 64-bit Itanium 2 (IA-64) cluster  7 nodes (14 CPUs), usually found in high-end HPC applications Local LONI computing resources ▫ Dell 5TF Intel Linux cluster (Painter)  128 nodes (512 CPUs total) ▫ IBM Power5 AIX cluster (Bluedawg)  13 nodes (104 CPUs) Local Resources at Louisiana Tech 4/2/2009DOSAR Workshop VII 5

6 Louisiana Optical Network Initiative Over 85 teraflops of computational capacity Around 250 TB of disk storage and 400 TB of tape A 40Gb/sec fiber-optic network connected to the National LambdaRail (10Gb/sec) and Internet2 Provides 12 high-performance computing clusters around the state (10 of which are online) Louisiana is fast becoming a leader in the knowledge economy. Through LONI, researchers have access to one of the most advanced optical networks in the country, along with the most powerful distributed supercomputing resources available to any academic community. 4/2/2009DOSAR Workshop VII 6 - http://ww.loni.org

7 LONI Computing Resources 1 x Dell 50 TF Intel Linux cluster ▫ 668 compute nodes (5,344 CPUs), RHEL4  Two 2.33 GHz quad-core Intel Xeon 64-bit processors  8 GB RAM per node (1GB/core) ▫ 192 TB Lustre storage ▫ 23 rd on Top500 in June 2007 ▫ Half of Queen Bee's computational cycles is contributed to TeraGrid 6 x Dell 5 TF Intel Linux clusters housed at 6 LONI sites ▫ 128 compute nodes (512 CPUs), RHEL4  Two dual-core 2.33 GHz Xeon 64-bit processors  4 GB RAM per node (1GB/core) ▫ 12 TB Lustre storage 5 x IBM Power5 AIX supercomputers housed at 5 LONI sites ▫ 13 nodes (104 CPUs), AIX 5.3  Eight 1.9 GHz IBM Power5 processors  16 GB RAM per node (2GB/processor) 4/2/2009DOSAR Workshop VII 7 IBM Power5 Dell Cluster

8 Painter: Dell Linux cluster ▫ 4.77 Teraflops peak performance ▫ Red Hat Enterprise Linux 4 ▫ 10 Gb/sec Infiniband network interconnect ▫ Located in the new Data Replication Center ▫ Named in honor of Jack Painter who was instrumental in bringing Tech's first computer (LGP-30) to campus in the 1950’s Pictures follow… LONI cluster at Louisiana Tech 4/2/2009DOSAR Workshop VII 8

9 PetaShare (left) and Painter (right) 4/2/2009DOSAR Workshop VII 9

10 With the lights off… 4/2/2009DOSAR Workshop VII 10

11 Front and back of Painter 4/2/2009DOSAR Workshop VII 11

12 4/2/2009DOSAR Workshop VII 12 LONI and the Open Science Grid ACCESSING RESOURCES ON THE GRID

13 OSG Compute Elements LONI_OSG1 LONI_LTU (not active) Official LONI CE ( osg1.loni.org ) ▫ Located at LSU OSG 0.8.0 production site Managed by LONI staff Connected to “Eric” cluster ▫ Opportunistic PBS queue ▫ 64 CPUs out of 512 CPUs  The 16 nodes are shared with other PBS queues. LaTech CE ( ce1.grid.latech.edu ) ▫ Located at Louisiana Tech OSG 1.0 production site Managed by LaTech staff Connected to “Painter” cluster ▫ Opportunistic PBS queue Highly Available LTU_OSG’s (caps10) successor 4/2/2009 13 DOSAR Workshop VII

14 Installed OSG 0.8.0 at the end of February 2008 Running DZero jobs steadily Need to setup local storage to increase SAMGrid job efficiency ▫ PetaShare with BeStMan in gateway mode ▫ Waiting to test at LTU before deploying at LSU Current Status of LONI_OSG1 4/2/2009DOSAR Workshop VII 14 Weekly MC production

15 In roughly four years (2004-2008), we produced 10.5 million events. In just one year with LONI resources, we have done 5.97 million events. ▫ Note: caps10 (LTU_OSG) is included, but minimum impact DZero Production in Louisiana 4/2/2009DOSAR Workshop VII 15 Cumulative MC production

16 Installed Debian 5.0 (lenny) on two old compute nodes ▫ Xen 3.2.1  Virtualization hypervisor or virtual machine monitor (VMM) ▫ DRBD 0.8 (Distributed Replicated Block Device)  Think “network RAID1”  Allows active/active setup with cluster FS (GFS or OCFS)  Can only use ext3 with active/passive setup, unless managed by Heartbeat CRM (which maintains a active/passive setup) ▫ Heartbeat 2.1  Part of the Linux-HA project  Node failure detection and fail-over/back software Current Status of LONI_LTU 4/2/2009DOSAR Workshop VII 16

17 Xen High Availability Architecture 4/2/2009DOSAR Workshop VII 17 DRBD Heartbeat Xen dom0 domU Active/Active

18 Tests with Xen live migration of VMs running on DRBD partition were v ery successful ▫ SSH connection was not lost during migration ▫ Ping round-trip times did rise but only noticed a 1-2% packet loss ▫ Complete state of system was moved Due to a lack of a proper fencing device, a split-brain situation was not tested. This occurs when both nodes think the other has failed and both start the same resources. ▫ A fencing mechanism ensures only one node is online by powering off (or rebooting) the failed node. ▫ HA clusters without a fencing mechanism are NOT recommended for production use. Plan to test with real CE soon HA Test Results 4/2/2009DOSAR Workshop VII 18

19 Robust, highly available grid infrastructure Building a Tier3 Grid Services Site 4/2/2009DOSAR Workshop VII 19

20 Tier3 Grid Services (T3gs) We are focused on building a robust, highly available grid infrastructure at Louisiana Tech for USATLAS computing and analysis. ▫ OSG Compute Element (grid gateway/head node) ▫ GUMS for grid authentication/authorization ▫ DQ2 Site Services for ATLAS data management ▫ Load balanced MySQL servers for GUMS and DQ2 Dedicated domain: grid.latech.edu 4/2/2009DOSAR Workshop VII 20 ServiceHostnameHA / LB OSG CEce1.grid.latech.eduYes / No GUMSgums.grid.latech.eduYes / Yes DQ2 Site Servicesdq2.grid.latech.eduYes / No MySQLt3db.grid.latech.eduYes / Yes

21 Our old compute nodes are limited by memory (2GB), hard disks (ATA100), and network connectivity (100Mb) We hope to purchase: ▫ 2 x Dell PowerEdge 2950 (or similar PowerEdge servers)  Used by Fermigrid and many others ▫ External storage  Still investigating (vs. DRBD)  Store virtual machines and OSG installation, which is will be exported to the LONI cluster New Hardware for T3gs 4/2/2009DOSAR Workshop VII 21

22 4/2/2009DOSAR Workshop VII 22 Storage Elements and PetaShare LOOKING AHEAD

23 We have plenty of storage available through PetaShare but no way to access it on the grid. ▫ Most challenging component because it involves multiple groups (LONI, LSU/CCT, PetaShare, LaTech) Our plan is to install BeStMan on the Painter I/O and PetaShare server, where the Lustre filesystem for PetaShare is assembled. ▫ BeStMan would then run in its gateway mode Either that, or we develop a iRODS/SRB interface for BeStMan, which may happen later on anyway. OSG Storage Elements 4/2/2009DOSAR Workshop VII 23

24 PetaShare is “a distributed data archival, analysis and visualization cyberinfrastructure for data-intensive collaborative research.” – http://www.petashare.org ▫ Tevfik Kosar (LSU/CCT), PI of NSF grant Provides 200 TB disk and 400 TB tape storage around the state of Louisiana ▫ Employs iRODS (Integrated Rule-Oriented Data System) which is the successor to SRB Initial 10 TB at each OSG site for USATLAS More details tomorrow in Tevfik Kosar’s talk… PetaShare Storage 4/2/2009DOSAR Workshop VII 24

25 Establish three LONI OSG CEs around the state: OSG Roadmap 4/2/2009DOSAR Workshop VII 25

26 In the last year, we’ve produced nearly 6 million DZero MC events on LONI_OSG1 at LSU. Expanded our computing resources at LaTech with Painter and HPCI’s compute clusters. ▫ In fact, with Painter alone we have over 18 times more CPUs than last year (28 -> 512). We look forward to becoming a full Tier3 Grid Services site in early-mid summer. Closing Remarks 4/2/2009DOSAR Workshop VII 26

27 4/2/2009DOSAR Workshop VII 27 QUESTIONS / COMMENTS?


Download ppt "DOSAR Workshop VII April 2, 2009 Louisiana Tech Site Report Michael S. Bryant Systems Manager, CAPS/Physics Louisiana Tech University www.dosar.org."

Similar presentations


Ads by Google