Presentation is loading. Please wait.

Presentation is loading. Please wait.

LHC-OPN Monitoring Working Group Update

Similar presentations


Presentation on theme: "LHC-OPN Monitoring Working Group Update"— Presentation transcript:

1 LHC-OPN Monitoring Working Group Update
Shawn McKee LHC-OPN T0-T1 Meeting Rome, Italy April 4th, 2006

2 LHC-OPN Monitoring Overview
The LHC-OPN exists to share LHC data with, and between, the T1 centers Being able to monitor this network is vital to its success and is required for “operations”. Monitoring is important for: Fault notification Performance tracking Problem diagnosis Scheduling and prediction Security See previous (Amsterdam) talk for an overview and details on all this…

3 The LHC-OPN Network

4 LHC-OPN Monitoring View
The diagram to the right is a logical representation of the LHC-OPN showing monitoring hosts The LHC-OPN extends to just inside the T1 “edge” Read/query access should be guaranteed on LHC-OPN “owned” equipment. We also request RO access to devices along the path to enable quick fault isolation

5 Status Update During the Amsterdam meeting (Jan 2006)
we decided to focus on two areas: Important/required metrics Prototyping LHC-OPN monitoring There is an updated LHC-OPN Monitoring document on the LHC-OPN web page emphasizing this new focus. This Meeting What metrics should be required for LHC-OPN? We need to move forward on prototyping LHC-OPN monitoring services …volunteer sites?

6 Monitoring Possibilities by Layer
For each “layer” we could monitor a number of metrics of the LHC-OPN: Layer-1: Optical power levels Layer-2: Packet statistics (e.g., RMON) Layer-3/4: Netflow All Layers: Utilization (bandwidth in use,Mbits/sec) Availability (track accessibility of device over time) Error Rates Capacity Topology

7 LHC-OPN “Paths”; Multiple Layers
Each T0-T1 “path” has many views Each OSI Layer (1-3) may have different devices involved. This diagram is likely simpler than most cases in the LHC-OPN

8 Metrics for the LHC-OPN (EGEE Network Performance Metrics V2)
For “edge-to-edge” monitoring this list of relevant metrics include: Availability (of T0-T1 path, each hop, T1-T1?) Capacity (T0-T1, each hop) Utilization (T0-T1, each hop) Delays (T0-T1 paths, One-way, RTT, jitter) Error Rates (T0-T1, each hop) Topology (L3 traceroute, L1?, L2) MTU (each path and hop) What about Scheduled Downtime, Trouble Tickets?

9 Availability Availability (or “uptime”) measures the amount of time the network is up and running. Can be by “hop” or a complete “path” Methodology: Layer 1: Measure power levels/bit rate? Layer 2: Utilize SNMP to check interface Layer 3: ‘ping’ Units: Expressed as a percentage

10 Capacity Capacity is the maximum amount of data per unit time a hop or path can transport. Can be listed by “hop” or “path” Methodology: Layer 1: Surveyed (operator entry) Layer 2: SNMP query on interface Layer 3: Minimum of component hops Units: Bit rate (Bits[K,M,G] per second)

11 Utilization Utilization is the amount of capacity being consumed on a hop or path. Can be listed by “hop” or “path” Methodology: Layer 2: Use of SNMP to query interface stats Layer 3: List of utilization along path Units: Bits per second

12 Delay Delay metrics are at Layer 3 (IP) and are defined by RFC 2679 and 2681 and IPPM. Delay related info are three types: one-way delay (OWD), one-way delay variation (jitter) and round-trip time (RTT) One way delay between two observation points is the time between occurrence of the first bit of the packet on the first point and the last bit of the packet at the second point. Methodology: application (OWAMP) generating defined size packet with time-stamp to target end-host application. Units: Time (seconds) Jitter is the one way delay difference along a given unidirectional path (RFC 3393) Methodology: statistical analysis of OWD application Units: Time (positive or negative) Round-trip time (RFC 2681) well defined Methodology: ‘ping’ Units: Time (min/max/average) or a histogram of time

13 Error Rates Error rates track the bit or packet error rate (depending upon layer). Can be listed by “hop” or “path” Methodology: Layer 1: Read (TL1) equipment error rate Layer 2: SNMP access to interface error counter Layer 3: Checksum errors on packets Units: Fraction (erroneous/total for bits or packets)

14 Topology Topology refers to the connectivity between nodes in the network (varies by OSI layer) Methodology: Layer 1: Surveyed (input) Layer 2: Surveyed (input)…possible L2 discovery? Layer 3: Traceroute or equivalent Units: Representation should record a vector of node-link pairs representing the described path May vary with time (that is what is interesting) but that is probably only “trackable” at L3.

15 MTU The Maximum Transmission Unit is defined as the maximum size of a packet which an interface can transmit without having to fragment it. Can be listed by “hop” or “path” Methodology: Use Path MTU Discovery (RFC 1191) Units: Bytes

16 LHC-OPN: Which Metrics Are REQUIRED (if any)?
We should converge on a minimal set of metrics that the LHC-OPN Monitoring needs to provide Example: for each T0-T1 path: Availability (is path “up”?) Capacity (path bottleneck bandwidth) Utilization (current usage along path) Error rates? (bit errors along path) Delay? Topology? MTU? Do we need/require “hop” level metrics at various layers? How to represent/monitor downtime and trouble tickets? (Is this in scope?)

17 REMINDER: T0 Site Requests
A robust machine meeting the following specs must be made available: Dual cpu Xeon 3 GHz processors or dual opteron 2.2 GHz or better 4 Gigabytes of memory to support monitoring apps and large TCP buffers 1 or 10 Gigabit network interface on the LHC-OPN. 200 GB of disk space to allow for the LHCOPN apps & data repository. A separate disk (200+ GB) to back-up the LHCOPN data repository. OPTIONAL: An out-of-band link for maintenance/problem diagnosis. Suitably privileged account(s) for software installation/access. This machine should NOT be used for other services. SNMP RO access for the above machine is required for all L2 and L3 devices or proxies (in case of security/performance concerns) Access to Netflow (or equiv.) LHC-OPN data from the edge device Appropriate RO access (via proxy?) to the optical components (for optical power monitoring) must allowed from this same host. Access (testing/maint.) must be allowed from all LHC-OPN nets. The Tier-0 needs a point-of-contact (POC) for LHC-OPN monitoring.

18 REMINDER: T1 Site Requests
A dedicated LHC-OPN monitoring host must be provided: A gigabyte of memory 2 GHz Xeon or better CPU. 1 Gigabit network interface on the LHC-OPN. At least 20 GB of disk space allocated for LHC-OPN monitoring apps. An suitably privileged account for software installation. OPTIONAL: An out-of-band network link for maintenance/problem diagnosis OPTIONAL: This host should only be used for LHC-OPN monitoring OPTIONAL: each Tier-1 site should provide a machine similar to the Tier-0. SNMP RO access for the above machine is required for all T1 LHC-OPN L2 and L3 devices or proxies (for security/performance concerns) Access to Netflow (or equiv.) LHC-OPN data from the edge device Appropriate RO access, possibly via proxy, to the T1 LHC-OPN optical components (for optical power monitoring) must allowed from this host. Access (testing/maint.) should be allowed from all LHC-OPN networks. The Tier-1 needs to provide a point-of-contact (POC) for LHC-OPN monitoring

19 REMINDER: NREN Desired Access
We expect that we will be unable to “require” anything for all possible NRENs in the LHC-OPN. However the following list represents what we would like to have for the LHC-OPN: SNMP (readonly) access to LHC-OPN related L2/L3 devices from either a closely associated Tier-1 site or the Tier-0 site. We require associated details about the device(s) involved with the LHC-OPN for this NREN Suitable (readonly) access to the optical components along the LHC-OPN path which are part of this NREN. We require associated details about the devices involved. Topology information on how the LHC-OPN maps onto the NREN Information about planned service outages and interruptions. For example URLs containing this information, mailing lists, applications which manage them, etc. Responsibility for each acquiring NREN information should be distributed to the various Tier-1 POCs.

20 Prototype Deployments
We like to begin prototype distribution deployments to at least two Tier-1’s and the Tier-0 The goal is to prototype various software which might be used for LHC-OPN monitoring: Active measurements (and scheduling?) Various applications which can provide LHC-OPN metrics (perhaps in different ways) GUI interfaces to LHC-OPN data Metric data management/searching for LHC-OPN Alerts and automated problem handling applications Interactions between all the preceding This process should lead to a final LHC-OPN monitoring “system” matched to our needs.

21 Prototype Deployment Needs
For sites volunteering to support the LHC-OPN monitoring prototypes we need: Suitable host (see requirements) Account details (username/password). Can provide SSH public key as alternative for passwd. Any constraints or limitations about host usage. Out-of-band access info (if any) Each site should also provide a monitoring point-of-contact. VOLUNTEERS? (

22 Monitoring Site Requirements
Eventually each LHC-OPN site should provide the following for monitoring: Appropriate host(s) (see previous slides) Point-of-contact for monitoring L1/L2/L3 “Map” to Tier-0 listing relevant nodes and links: Responsible for contacting intervening NRENs Map is used for topology and capacity information Should include node(device) address, description and access information Readonly access for LHC-OPN components Suitable account(s) on monitoring host Sooner rather than later…dictated by interest

23 Future Directions / Related Activities
There are a number of existing efforts we anticipate actively prototyping for LHC-OPN monitoring (alphabetically): EGEE JRA4/ EGEE-II SA1 Network Performance Monitoring - This project has been working on an architecture and a series of prototype services intended to provide Grid operators and middleware with both end-to-end and edge-to-edge performance data. See and a demo at IEPM -- Internet End-to-end Performance Monitoring. The IEPM effort has its origins in the 1995 WAN monitoring group at SLAC. IEPM-BW was developed to provide an infrastructure more focused on a making active end-to-end performance measurements for a few high-performance paths. MonALISA – Monitoring Agents using a Large-scale Integrated Services Architecture. This framework has been designed and implemented as a set of autonomous agent-based dynamic services that collect and analyze real-time information from a wide variety of sources (grid nodes, network routers and switches, optical switches, running jobs, etc.) NMWG Schema - The NMWG (Network Measurement Working Group) focuses on characteristics of interest to grid applications and works in collaboration with, other standards groups such as the IETF IPPM WG and the Internet2 End-to-End Performance Initiative. The NMWG will determine which of the network characteristics are relevant to Grid applications, and pursue standardization of the attributes required to describe these characteristics. PerfSonar – This project plans to deploy a monitoring infrastructure across Abilene (Internet2), ESnet, and GEANT. A standard set of measurement applications will regularly measure these backbones and store their results in the Global Grid Forum Network Measurement Working Group schema (see below).

24 Summary and Conclusion
The LHC-OPN monitoring document is updated to reflect the new emphasis on: Determining the appropriate metrics Prototyping possible applications/systems All sites should identify an LHC-OPN point-of-contact to help expedite the monitoring effort. We have a number of possibilities regarding metrics. Defining which (if any) are required will help direct the prototyping efforts. Prototyping is ready to proceed -- we need to identify sites which will host this effort.


Download ppt "LHC-OPN Monitoring Working Group Update"

Similar presentations


Ads by Google