Presentation is loading. Please wait.

Presentation is loading. Please wait.

GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC.

Similar presentations


Presentation on theme: "GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC."— Presentation transcript:

1 GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC Computing Facility Brookhaven National Lab

2 6/27/2015 CHEP 03, La Jolla 2 Outline  Requirements  System Framework, Structure and Characteristics  I: Ganglia and Its Information Provider  II: Archive and Its Information Provider  Gridview, Front End System: http://heppc1.uta.edu/atlas/grid-status/mds.gremlin.usatlas.bnl.gov.html  Current Status and Future Works

3 6/27/2015 CHEP 03, La Jolla 3 Requirements  Requirements :  Modularity and Extensibility: Make Use of Existing Monitoring Pieces  Flexibility: Adjustable to the Dynamics of the Monitored Systems  Overhead: Non-intrusive  Scalability  Security, Consistency, Inter-operability, Etc-bility

4 6/27/2015 CHEP 03, La Jolla 4 What Need to Be Monitored  Linux Farm Monitoring  Description  About 1100 Dual CPU LINUX Nodes.  Performance Data Must Be Summarized for Advertising to Grid.  Performance Events Required:  Configuration Information  Status Information: CPU Load, (1, 5, 10, 15), Memory Load, Disk Load, and Network Load  Example Usage: A Resource Broker Might Ask the Availability of Linux Farm System Resources in Order to Plan the Efficient Execution of Tasks

5 6/27/2015 CHEP 03, La Jolla 5 More…  Network Monitoring:  Description:  8 USATLAS Testbeds  Publish the Connectivity of These Test-beds, Monitor the Healthiness of the USATLAS Network  Archived Performance Data Can Be Used to Predict the Network Behavior a User Can Choose the Source and Destination for File Replication  Performance Events Required:  Bandwidth, Delay ( Round Trip Time), Trace Route

6 6/27/2015 CHEP 03, La Jolla 6 Monitoring Framework Monitoring Database ( ODBC+MYSQL) Or RRD DB Info. Providers Data Collectors Aggregate Service Index (GIIS) Grid-View (Web Server) Information Provider (GRIS) Information Provider (GRIS) Information Provider (GRIS) Information Provider (GRIS) Grid-info-search Server HPSS Network Computing Nodes Sensor

7 6/27/2015 CHEP 03, La Jolla 7 Monitoring System Components Four Tier Structure  Sensors  Host: Ganglia, Top, /Proc and lsf Host Load  Archive System (Database System)  Round Robin Database (RRD)  Relational Database: UNIXodbc+myodbc+mysql Database  Information Providers  Monitoring and Discovery Service (Mds2.2), GLUE Schema, Customized Ganglia Client Tool Reporting the Lastest Monitoring Data and Database Client Tools Reporting the Summary Information  Front-end Browsing System  Gridview (Grid Visualization Tool Developed at Univ. of Texas at Arlington)

8 6/27/2015 CHEP 03, La Jolla 8 Advantages  Information Provider Provides Cache for the Newest Value From the Mysql Database  Non-intrusiveness: Information Provider Can Eliminate the User Random Accesses to the Database Server  Scalability Can Be Significantly Increased  1000 Linux Nodes Are Being Monitored  Network Connectivity of Eight Usatlas Testbeds: Each Site Monitoring the Paths From Itself to the Other Seven. Network Topology and Traffic Can Be Easily Constructed  Flexibility:  Independent on Sensors. Many Sensors Can Be Easily Plugged As Long It Has Well Defined Protocol and API: We Could Switch Among Ganglia, top, /proc  Archive System Is Independent to Underlying Database  Can Be rdbms, Oracle, Mysql, Sybase, Informix, Flat Files, Objectivity As Long the Odbc Drivers Is Available

9 6/27/2015 CHEP 03, La Jolla 9 I: Ganglia Monitoring with MDS  Ganglia Information Provider  Front-end: Glue-schema Http://www.cnaf.Infn.It/~sergio/datatag/glue/  Back-end: XML Cluster A Multicast Channel Gmond XML Gmetad (filtered) Gmetad (filtered) … ? MDS Ganglia IP XML GLUE Layered Gmetad

10 6/27/2015 CHEP 03, La Jolla 10 I: Ganglia Monitoring with MDS gremlin % grid-info-search -x -h spider.usatlas.bnl.gov -s one # ATLAS Linux Cluster, local, grid dn: cl=ATLAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: ATLAS Linux Cluster GlueClusterUniqueID: ATLAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute # PHOBOS CAS Linux Cluster, local, grid # PHOBOS CAS Linux Cluster, local, grid dn: cl=PHOBOS CAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: PHOBOS CAS Linux Cluster GlueClusterUniqueID: PHOBOS_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute # STAR CAS Linux Cluster, local, grid # STAR CAS Linux Cluster, local, grid dn: cl=STAR CAS Linux Cluster, mds-vo-name=local, o=grid objectClass: GlueClusterTop objectClass: GlueCluster GlueClusterName: STAR CAS Linux Cluster GlueClusterUniqueID: STAR_CAS_Linux_Cluster-RCF_and_ACF_Linux_Farm_Group GlueClusterService: compute

11 6/27/2015 CHEP 03, La Jolla 11 II: Farm Monitoring  Linux Farm Is Divided Into Different Sub-clusters Based on Site Policy, Different Experiments, OS and Version, CPU Speed. A Sub-cluster Contains the Host With the Same Configuration  Bnl Atlas Farm Is Partitioned Into Four Subclusters: Cpu400mhz, Cpu700hz, Cpu1ghz, Cpu1.4ghz and CPU 2.4GHZ  The Status Information of a Sub-cluster Is Summarized From All Nodes in This Sub-cluster  Grid Resource Broker Schedules in the Level of Farm Sub- clusters

12 6/27/2015 CHEP 03, La Jolla 12 Information Schema (Linux Farm Monitoring) Queue-Info:  objectclass ( 1.3.6.1.4.1.3536.2.6.0.0.0.0 NAME 'Queue-Info' SUP 'Mds' STRUCTURAL MUST ( MdsQueueNumberOfCpu $ MdsQueueSpeed $ MdsQueueAverageLoad $ MdsQueueAverageUserPercent $ MdsQueueAverageSysPercent ))  Need to be replaced by GLUB-schema

13 6/27/2015 CHEP 03, La Jolla 13 Backend Data Structure Node Status Information mysql> describe node_load; +-------------+-------------------------+------+----- +---------+---------------------+ | Field |Type | Null | Key |Default| Extra | +-------------+------------------------+------+--------+--------+----------------------+ | load_index | int(10) unsigned | | PRI | NULL| auto_increment | | sampletime| timestamp(14) | YES | MUL | NULL| | | machine_id| varchar(31) | | | | | | owner | varchar(8) | | | | | | load_5 | float(10,2) | | | 0.00 | | | user_cpu | float(10,2) | | | 0.00 | | | sys_cpu | float(10,2) | | | 0.00 | | +---------------+-----------------------+-------+--------+-------+---------------------+

14 6/27/2015 CHEP 03, La Jolla 14 Information Provider (Linux Farm Monitoring)  # generate Farm information every 10 minutes dn: MdsFarmQueueName=1000, MdsHostNodeDomainName=usatlas.bnl.gov, Mds-Host-hn=gremlin.usatlas.bnl.gov, Mds-Vo-name=local, o=grid objectclass: GlobusTop objectclass: GlobusActiveObject objectclass: GlobusActiveSearch type: exec path: /usr/local/globus-new/customize base: mds-farm-batch-info.pl args: -dn MdsFarmQueueName=1000,MdsHostNodeDomainName=usatlas.bnl.gov,Mds- Host-hn=gremlin.usatlas.bnl.gov,Mds-Vo-name=local,o=grid -ttl 900 cachetime: 600 timelimit: 20 sizelimit: 400

15 6/27/2015 CHEP 03, La Jolla 15 Observation from Grid-View

16 6/27/2015 CHEP 03, La Jolla 16 Current Status and Future Work  Current Status:  Sensors & Local Monitoring Tools Put Less Than 1 Percent CPU Load: Non-intrusive  Improved the Ganglia Information Provider, It Can Obtain Information From Both Gmond and Gmetad  Multiple & Hierarchical Clusters Are Supported  Future Works  Merge the Ganglia RRD Information Provider and the Archive DB Information Provider  Work With the Ganglia Team and Glue-schema, Help to Define Requirements for What Information Be Monitoring for Job Scheduling  Automate the Mapping From Xml to Glue Schema, Provide Flexibility  Continue to Optimize The Information Provider to Deliver Data Faster  Scalability Test  Extend This Prototype To Other Facility Monitoring


Download ppt "GridMonitor: Integration of Large Scale Facility Monitoring With MDS Richard Baker, Antonio Chan Richard Baker, Antonio Chan Jason Smith, Dantong Yu USATLAS/RHIC."

Similar presentations


Ads by Google