Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.

Similar presentations


Presentation on theme: "1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and."— Presentation transcript:

1 1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and operational management in USLHCNet CHEP09 - March 2009 Prague

2 Ramiro Voicu CHEP09 Prague March 2009 2 2 Outline  MonALISA Framework  Architecture  Data handling  Automatic actions  USLHCNet  Network topology  Monitoring modules  Reliable monitoring & accounting  Alarms & triggers  Conclusions

3 Ramiro Voicu CHEP09 Prague March 2009 3 The MonALISA Architecture 3 Regional or Global High Level Services, Repositories & Clients Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Distributed Dynamic Registration and Discovery- based on a lease mechanism and remote events JINI-Lookup Services Secure & Public MonALISA services Proxies HL services Agents Network of Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Fully Distributed System with no Single Point of Failure

4 Ramiro Voicu CHEP09 Prague March 2009 4 MonALISA Service & Data Handling 4 Data Store Data Cache Service & DB Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications Clients or Higher Level Services WS Clients and service Web Service WSDL SOAP Lookup Service Lookup Service Registration Discovery Postgres AGENTS FILTERS / TRIGGERS Monitoring Modules Collects any type of information Dynamic (Re)Loading Push and Pull

5 Ramiro Voicu CHEP09 Prague March 2009 5 Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts (emails/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. ML Service Actions based on global information Actions based on local information Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Sensors Local decisions Global decisions Local and Global Decision Framework Global ML Services

6 Ramiro Voicu CHEP09 Prague March 2009 6 Monitoring architecture in ALICE 6 Long History DB LCG Tools MonALISA @Site ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA @CERN MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonaLisaRepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions See Costin Grigoras’ poster (067): Automated agents for management and control of the ALICE Computing Grid

7 Ramiro Voicu CHEP09 Prague March 2009 7 USLHCNet  USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.  Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.  The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.  Hybrid network: uses both Ciena CD and Force10 routers  4 transatlantic 10G links at the moment (6 links in the second part of this year)* * See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network”

8 Ramiro Voicu CHEP09 Prague March 2009 8 USLHCnet ML weather map

9 Ramiro Voicu CHEP09 Prague March 2009 9 Monitoring modules We developed a set of monitoring modules for USLHCNet network devices:  Force10 (SNMP & sFlow)  Traffic per interface  sFlow traffic  Link status monitoring  Ciena Core Director (TL1 – Transaction Language1)  ETTP (Ethernet Termination Point) traffic  EFLOW (Ethernet Flow) traffic  OSRP (routing protocol) topology  Dynamic circuits inside the optical core of the network

10 Ramiro Voicu CHEP09 Prague March 2009 10 USLHCnet monitoring MonALISA @GVA MonALISA @CHI MonALISA @NYC MonALISA @AMS SNMP TL1 SNMP

11 Ramiro Voicu CHEP09 Prague March 2009 11 USLHCnet redundant monitoring MonALISA @GVA MonALISA @CHI MonALISA @NYC MonALISA @AMS Each Circuit is monitored at both ends by at least two MonALISA services; the monitored data is aggregated by global filters in the repository

12 Ramiro Voicu CHEP09 Prague March 2009 12 Local and global filters  Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems  The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services  The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.  We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.

13 Ramiro Voicu CHEP09 Prague March 2009 13 USLHCnet: Precise measurements for the Operational Status on the WAN Link  Operations & management assisted by agent-based software  Used on the new CIENA equipment used for network managment

14 Ramiro Voicu CHEP09 Prague March 2009 14 USLHCnet: Traffic on different segments

15 Ramiro Voicu CHEP09 Prague March 2009 15 USLHCnet: Accounting for Integrated Traffic

16 Ramiro Voicu CHEP09 Prague March 2009 16 USLHCnet: Ciena alarms monitoring

17 Ramiro Voicu CHEP09 Prague March 2009 17 The Need for Planning and Scheduling for Large Data Transfers In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially

18 Ramiro Voicu CHEP09 Prague March 2009 18 Dynamic restoration of lightpath if a segment has problems Monitoring Optical Switches

19 Ramiro Voicu CHEP09 Prague March 2009 19 CERN Geneva CALTECH Pasadena Starlight Manlan USLHCnet Internet2 Controlling Optical Planes Automatic Path Recovery “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 1 2 3 4 FDT Transfer 4 Fiber cuts simulations 200+ MBytes/sec From a 1U Node 4 fiber cut emulations For more details, see Iosif Legrand’s poster (054): A High Performance Data Transfer Service

20 Ramiro Voicu CHEP09 Prague March 2009 20 Conclusions  The MonALISA framework provides a flexible and reliable monitoring infrastructure  350+ installed services, 1.5M+ unique parameters, 25kHz value updates  Truly distributed architecture with no single points of failure  Highly modular platform  Automatic decision taking capability at both local and global levels  USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services  Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime  We are investigating dynamic provisioning of circuits from collaborating agents http://monalisa.caltech.edu http://repository.uslhcnet.org


Download ppt "1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and."

Similar presentations


Ads by Google