October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab.

October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab

October 27, 2015 What need to be monitored. Linux Farm Monitoring Description – 800 linux nodes. – Advertise farm information for grid-level scheduling – Performance Data must be summarized for advertising to Grid. Performance events required: – Configuration information – Status information: CPU load, (5 minutes, 10, 15), memory load, disk load, and network load Example usage: A resource broker might ask the availability of Linux farm system resources in order to plan the efficient execution of tasks.

October 27, 2015 More… Network Monitoring: Description: – 8 USATLAS testbeds. – Publish the connectivity of these test-beds, monitor the healthiness of the USATLAS network. – Archived performance data can be used to predict the network behavior a user can choose the source and destination for file replication. Performance events required: – Bandwidth, Delay ( round trip time), trace route. Requirements : Interface, Overhead, Scalability, Security, Archive, Consistency.

October 27, 2015 Monitoring System Components Four tier structure Sensors. Implementation: Unix command: top, /proc and LSF host load Archive System (Database system). Implementation: unixODBC+myODBC+mySQL database Information Providers. Globus 2.0 Beta, MDS2.1 Front-end browsing System. GridView (Grid Visualization tool developed at UTA.)

October 27, 2015 ATLAS Monitoring Framework LSF Grid Cluster LSF Server 1 LSF Server2 Gatekeeper Job manager Information Provider (GRIS) Top, LSF load and NSW Monitoring Database ( ODBC+MYSQL Or Oracle ) DB Info. Providers Data Collectors Row to object class Aggregate Service Index (GIIS) Register Grid-View (UTA Web Server) Grid-info-search Network Sensor (Iperf, GridFtp) ServerHPSS Sensor Information Provider (GRIS) Information Provider (GRIS) Register

October 27, 2015 Advantages Information Provider provides cache for the newest value from the mysql database. Non-intrusiveness: Information provider can eliminate the user random accesses to the database server. Scalability can be significantly increased. 800 linux nodes are being monitored Network connectivity of eight USATLAS testbeds. Flexibility: Independent on Sensors. Many sensors can be easily plugged as long it has well defined protocol and API. Archive system is independent to underlying database. – Can be RDBMS, Oracle, MySql, Sybase, Informix, flat files, objectivity as long the ODBC drivers is available.

October 27, 2015 Level of Farm monitoring Linux Farm is divided into different sub-clusters based on site policy, different experiments, OS and version, CPU speed. A sub-clustering contains the host with the same configuration. BNL atlas farm is partitioned into four subclusters: CPU200MHz, CPU400MHz, CPU700Hz and CPU1GHz The status information of a subcluster is summarized from all nodes in this subcluster. Grid resource broker schedules in the level of farm subclusters.

October 27, 2015 Information Schema (Linux Farm Monitoring) Queue-Info: objectclass ( 1.3.6.1.4.1.3536.2.6.0.0.0.0 NAME 'Queue-Info' SUP 'Mds' STRUCTURAL MUST ( MdsQueueNumberOfCpu $ MdsQueueSpeed $ MdsQueueAverageLoad $ MdsQueueAverageUserPercent $ MdsQueueAverageSysPercent ))

October 27, 2015 Information Schema (Linux Farm Monitoring) Host-Info: objectclass ( 1.3.6.1.4.1.3536.2.6.0.1 NAME 'Host-Info' SUP ‘Mds' STRUCTURAL MUST ( MdsNodeAddress $ MdsHostNodeName $ MdsHostNodeDomainName $ MdsNetMacAddr) MAY ( MdsHostVendor $ MdsCpuVendor $ MdsCpuSmpSize $ MdsOsName $ MdsOsKernelVersion $ MdsMemoryRamSizeMB $ MdsMemoryVmSizeMB $ MdsTimeFrom $ MdsCpuLoad5min $ MdsCpuUser15min $ MdsCpuSystem15min))

October 27, 2015 Information Provider (Linux Farm Monitoring) # generate Farm information every 10 minutes dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov, Mds-Host- hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=grid objectclass: GlobusTop objectclass: GlobusActiveObject objectclass: GlobusActiveSearch type: exec path: /usr/local/globus-new/customize base: mds-farm-batch-info.pl args: -dn MdsFarmQueueName=1000,MdsHostNodeDomainName=usatl as.bnl.gov,Mds-Host- hn=gremlin.usatlas.bnl.gov,Mds-Vo- name=local,o=grid -ttl 900 cachetime: 600 timelimit: 20 sizelimit: 400

October 27, 2015 DEMO http://heppc1.uta.edu/kaushik/computing/grid- status/mds.gremlin.usatlas.bnl.gov.html http://heppc1.uta.edu/kaushik/computing/grid- status/mds.gremlin.usatlas.bnl.gov.html http://heppc1.uta.edu/atlas/grid-status/index.html

October 27, 2015 Observation from Grid-View

October 27, 2015 Future work Work with PPDG monitoring group, GGF, install “recommended” sensors on Atlas Monitoring Framework. Identity what should be monitored for Grid resource broker and deploy tools. Optimize Backend Data Structure Scalability test. The system should be able to handle 1600 linux nodes. Extend this prototype to other facility monitoring.

October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab.

Similar presentations

Presentation on theme: "October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab.

Similar presentations

Presentation on theme: "October 27, 2015 Atlas Monitoring Infrastructure in Grid Environment Richard Baker Dantong Yu Brookhaven National Lab."— Presentation transcript:

Similar presentations

About project

Feedback