ECSAC- August 2009, Veli Losink California Institute of Technology

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Advertisements

May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology May 2005 An Agent Based, Dynamic Service System to Monitor, Control and Optimize.
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology ICFA WORKSHOP Daegu, May 2005 Daegu, May 2005 An Agent Based, Dynamic Service.
MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
Monitoring and controlling VRVS Reflectors Catalin Cirstoiu 3/7/2003.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Grid Monitoring By Zoran Obradovic CSE-510 October 2007.
September 2005 Iosif Legrand 1 End User Agents: extending the "intelligence" to the edge in Distributed Service Systems Iosif Legrand California Institute.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 VINCI : Virtual Intelligent Networks for Computing Infrastructures An Integrated Network Services System to Control and Optimize Workflows in Distributed.
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
LISHEP 2004 Iosif Legrand Iosif Legrand California Institute of Technology DISTRIBUTED SERVICES.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 MonALISA Team Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Ciprian Dobre, Alexandru Costan MonALISA capabilities for the LHCOPN LHCOPN.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
June 22, 1999MONARC Simulation System I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centres Distributed System Simulation Iosif C. Legrand.
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.
Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
January 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology January 2006 An Agent Based, Dynamic Service System to Monitor, Control and.
Storage discovery in AliEn
Jean-Philippe Baud, IT-GD, CERN November 2007
Module 12: I/O Systems I/O hardware Application I/O Interface
California Institute of Technology
ALICE Monitoring
Open Source distributed document DB for an enterprise
Securing the Network Perimeter with ISA 2004
Integration of Network Services Interface version 2 with the JUNOS Space SDK
CHAPTER 3 Architectures for Distributed Systems
Storage elements discovery
Simulation use cases for T2 in ALICE
Enabling High Speed Data Transfer in High Energy Physics
Data Networking Fundamentals
University of Technology
#01 Client/Server Computing
GGF15 – Grids and Network Virtualization
Chapter 16: Distributed System Structures
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
ExaO: Software Defined Data Distribution for Exascale Sciences
Operating System Concepts
CS703 - Advanced Operating Systems
Chapter 2: Operating-System Structures
Chapter 3 Part 3 Switching and Bridging
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 2: Operating-System Structures
Module 12: I/O Systems I/O hardwared Application I/O Interface
#01 Client/Server Computing
Presentation transcript:

ECSAC- August 2009, Veli Losink California Institute of Technology Monitoring, Control and Optimization in Large Distributed Systems ECSAC- August 2009, Veli Losink Iosif Legrand California Institute of Technology

Monitoring Distributed Systems An essential part of managing large scale, distributed data processing facilities, is a monitoring system that is able to monitor computing facilities, storage systems, networks and a very large number of applications running on these systems in near-real time. The monitoring information gathered for all the subsystems is essential for design, debugging, accounting and the development of “higher level services”, that provide decision support and some degree of automated decisions and for maintaining and optimizing workflow in large scale distributed systems.

Modeling & Simulations Monitoring Information is necessary for System Design, Control, Optimization, Debugging and Accounting ACCOUNTING Computing Models Modeling & Simulations Optimization Algorithms MONITORING ~ REAL Information Create resilient Distributed Systems Control and Operational support ALARMS DEBUGGING

CERN Center PBs of Disk; Tape Robot The LHC Data Grid Hierarchy The need for distributed computing (MONARC) 11 Tier1 and 120+ Tier2 Centers ~PByte/sec ~150-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 10 - 40 Gbps IN2P3 Center RAL Center INFN Center FNAL Center ~10 Gbps Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center ~1-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by ~2010 Physics data cache 1 to 10 Gbps Tier 4 Workstations

Communication in Distributed Systems Different types of dedicated protocols Distributed Object Systems CORBA , DCOM, RMI The Stub is linked to the Client. The Client must know about the service from the beginning and needs the right stub for it “Traditional” Distributed Object Models (CORBA, DCOM) Server Lookup Service Skeleton Lookup Service Stub CLIENT “IDL” Compiler The Server and the client code must be created together !!

Distributed Object Systems Web Services WSDL/SOAP Server CLIENT Lookup Service WSDL Lookup Service Interface The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL Platform independent Large overhead Based on stateless connections

Mobile Code and Distributed Services Any well suited protocol for the application Service Lookup Service Proxy CLIENT Lookup Service Proxy Dynamic Code Loading Services can be used dynamically Remote Services Proxy == RMI Stub Mobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit Is Based on Java Iosif Legrand August 2009

The MonALISA Framework MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications. 8 Iosif Legrand August 2009

The MonALISA Architecture Regional or Global High Level Services, Repositories & Clients HL services Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Proxies Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Agents MonALISA services Distributed Dynamic Registration and Discovery-based on a lease mechanism and remote events Network of JINI-Lookup Services Secure & Public Fully Distributed System with no Single Point of Failure 9 Iosif Legrand August 2009

MonALISA Service & Data Handling Postgres Data Store Lookup Service Lookup Service Data Cache Service & DB Registration Web Service WSDL SOAP Discovery WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Higher Level Services Applications Configuration Control (SSL) AGENTS FILTERS / TRIGGERS Collects any type of information Dynamic Loading Monitoring Modules Push and Pull 10 Iosif Legrand August 2009

Registration / Discovery Admin Access and AAA for Clients Application Registration (signed certificate) MonALISA Service Discovery Client (other service) Lookup Service Trust keystore Services Proxy Multiplexer Applications Data Filters & Agents MonALISA Service Client authentication Services Proxy Multiplexer Admin SSL connection Lookup Service MonALISA Service Client (other service) Trust keystore AAA services Iosif Legrand August 2009

Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity TOPOLOGY ACCOUNTING 12 Iosif Legrand August 2009

Monitoring the Execution of Jobs and the Time Evolution SPLIT JOBS LIFELINES for JOBS Job Job1 Job2 Job3 31 32 Summit a Job DAG 13 Iosif Legrand August 2009

Monitoring CMS Jobs Worldwide CMS is using MonALISA and ApMon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend ~ 5 *109 Monitoring values Collected in the first half of 2009 The computer that runs ML at CERN was replaces Rate of collected monitoring values Total Collected values 3 4 2 1 X109 April May June 1000 500 May June Organize and structure Monitoring Information Collects monitoring data with rates up to more than 1000 values per second Peaks of more than 1500 jobs reporting concurrently to the same server Uptime for the service > 150 days continuous operation without any problems Collected 4.5* 109 monitoring values in the first half of 2009 Lost in UDP messages (values) is less than 5 * 10-6 Iosif Legrand August 2009

Monitoring architecture in ALICE ApMon AliEn CE ApMon AliEn IS ApMon AliEn Optimizers ApMon Cluster Monitor ApMon AliEn Job Agent ApMon AliEn Brokers ApMon AliEn TQ ApMon AliEn SE slots job run time net In/out ApMon AliEn Job Agent time cpu free space jobs status processes load vsz ApMon MySQL Servers MonALISA @Site MonALISA @CERN sockets rss ApMon AliEn Job Agent migrated mbytes sessions active ApMon CastorGrid Scripts Aggregated Data ApMon API Services ApMon Cluster Monitor ApMon AliEn SE ApMon AliEn CE ApMon AliEn Job Agent open files nr. of files JobAgents Queued job status MonaLisa Repository MonALISA LCG Site Alerts ApMon AliEn Job Agent cpu ksi2k Actions disk used Long History DB MyProxy status ApMon AliEn Job Agent LCG Tools 15 Iosif Legrand August 2009

ALICE : Global Views, Status & Jobs http://pcalimonitor.cern.ch Iosif Legrand August 2009

ALICE: Job status – history plots Iosif Legrand August 2009

ALICE: Resource Usage monitoring Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd) Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage Aggregated per site Iosif Legrand August 2009

ALICE: Job agents monitoring From Job Agent itself Requesting job Installing packages Running job Done Error statuses From Computing Element Available job slots Queued Job Agents Running Job Agents Iosif Legrand August 2009

Local and Global Decision Framework Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts (emails/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. Traffic Jobs Hosts Apps ML Service Actions based on global information Global ML Services Actions based on local information Temperature Humidity A/C Power … ML Service Sensors Local decisions Global decisions Iosif Legrand August 2009

ALICE: Automatic job submission Restarting Services MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is kept full by the automatic submission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information 21 Iosif Legrand August 2009

Automatic actions in ALICE ALICE is using the monitoring information to automatically: resubmit error jobs until a target completion percentage is reached, submit new jobs when necessary (watching the task queue size for each service account)‏ production jobs, RAW data reconstruction jobs, for each pass, restart site services, whenever tests of VoBox services fail but the central services are OK, send email notifications / add chart annotations when a problem was not solved by a restart dynamically modify the DNS aliases of central services for an efficient load-balancing. Most of the actions are defined by few lines configuration files. Iosif Legrand August 2009

The USLHCnet Advanced Standards: Dynamic Circuits CIENA Core Directors Mesh Protection Equipment and link Redundancy Also Internet2 and SINET3 (Japan) Together with ESnet provide highly resilient data paths for US Tier1s US LHCNet status report Artur Barczyk, 04/16/2009 Iosif Legrand August 2009

Monitoring Links Availability Very Reliable Information AMS-GVA(GEANT) 99.5% AMS-NYC(GC) 97.9% CHI-NYC (Qwest) 99.9% CHI-GVA (GC) 96.6% CHI-GVA (Qwest) 99.3% Ref @ CERN) 98.9% GVA – NYC (Colt) GVA – NYC (GC) 99.5% 0-95% 95-97% 97-98% 98-99% 99-100% 100% P1 Network LINK P1 Artur Barczyk, 04/16/2009 Iosif Legrand August 2009

Monitoring USLHCNet Topology Topology & Status & Peering Real Time Topology for L2 Circuits Iosif Legrand August 2009

USLHCnet: Traffic on different segments Iosif Legrand August 2009

USLHCnet: Accounting for Integrated Traffic Iosif Legrand August 2009

ALARMS and Automatic notifications for USLHCnet Iosif Legrand August 2009

The UltraLight Network BNL ESnet IN /OUT Iosif Legrand August 2009

Monitoring Network Topology (L3), Latency, Routers NETWORKS AS ROUTERS Real Time Topology Discovery & Display Iosif Legrand August 2009

Available Bandwidth Measurements Embedded Pathload module. 31 Iosif Legrand August 2009

EVO : Real-Time monitoring for Reflectors and the quality of all possible connections Iosif Legrand August 2009

EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second. Building in near real time a minimum- spanning tree with addition constrains Resilient Overlay Network that optimize real-time communication Iosif Legrand August 2009

Dynamic MST to optimize the Connectivity for Reflectors Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems. Iosif Legrand August 2009

EVO: Optimize how clients connect to the system for best performance and load balancing Iosif Legrand August 2009

FDT – Fast Data Transfer FDT is an application for efficient data transfers. Easy to use. Written in java and runs on all major platforms. It is based on an asynchronous, multithreaded system which is using the NIO library and is able to: stream continuously a list of files use independent threads to read and write on each physical device transfer data in parallel on multiple TCP streams, when necessary use appropriate size of buffers for disk IO and networking resume a file transfer session 36 Iosif Legrand August 2009 36

FDT – Fast Data Transfer Control connection / authorization Pool of buffers Kernel Space Pool of buffers Kernel Space Data Transfer Sockets / Channels Restore the files from buffers Independent threads per device 37 Iosif Legrand August 2009 37

FDT can be monitored and controlled dynamically by MonALISA System FDT features The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and authorization. Supports several security schemes : IP filtering SSH GSI-SSH Globus-GSI SSL User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, compression … FDT can be monitored and controlled dynamically by MonALISA System April 2007 Iosif Legrand 38 Iosif Legrand August 2009

FDT – Memory to Memory Tests in WAN ~9.4 Gb/s ~9.0 Gb/s CPUs Dual Core Intel  Xenon @ 3.00  GHz,  4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom October 2006 Iosif Legrand 39 Iosif Legrand August 2009 39

Disk -to- Disk transfers in WAN NEW YORK GENEVA Reads and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour MB/s CERN CALTECH Reads and writes on two 12-port RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour 1U Nodes with 4 Disks 4U Disk Servers with 24 Disks Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd Interface to dCache using the dcap protocol October 2007 Iosif Legrand Iosif Legrand August 2009

Active Available Bandwidth measurements between all the ALICE grid sites Iosif Legrand August 2009

Active Available Bandwidth measurements between all the ALICE grid sites (2) Iosif Legrand August 2009

End to End Path Provisioning on different layers Default IP route Layer 3 Site B VCAT and VLAN channels Layer 2 Site A Optical path Layer 1 Monitor layout / Setup circuit Monitor interfaces traffic Monitor host & end-to-end paths / Setup end-host parameters Control transfers and bandwidth reservations Iosif Legrand August 2009

Monitoring Optical Switches Dynamic restoration of lightpath if a segment has problems Iosif Legrand August 2009

Monitoring the Topology and Optical Power on Fibers for Optical Circuits Controlling Port power monitoring Glimmerglass Switch Example Iosif Legrand August 2009

“On-Demand”, End to End Optical Path Allocation >FDT A/fileX B/path/ OS path available Configuring interfaces Starting Data Transfer CREATES AN END TO END PATH < 1s Real time monitoring Internet MonALISA Distributed Service System Regular IP path APPLICATION DATA MonALISA Service Control Monitor A OS Agent B LISA AGENT LISA sets up - Network Interfaces - TCP stack - Kernel parameters - Routes LISA  APPLICATION “use eth1.2, …” LISA Agent TL1 Optical Switch Active light path Detects errors and automatically recreate the path in less than the TCP timeout 46 Iosif Legrand August 2009

Controlling Optical Planes Automatic Path Recovery CERN Geneva CALTECH Pasadena Starlight Manlan USLHCnet Internet2 200+ MBytes/sec From a 1U Node FDT Transfer “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 4 2 3 1 4 fiber cut emulations 4 Fiber cuts simulations Iosif Legrand August 2009

“On-Demand”, Dynamic Circuits Channel and Path Allocation APPLICATION >FDT A/fileX B/path/ path or channel allocation Configuring interfaces Starting Data Transfer Regular IP path Regular IP path Local VLANs MAP Local VLANs to WAN channels or light paths Recommended to use two NICs one for management /one for data - bonding two NICs to the same IP Iosif Legrand August 2009

The Need for Planning and Scheduling for Large Data Transfers In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially Iosif Legrand August 2009

Dynamic Path Provisioning Queueing and Scheduling Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities, control time to completion, where needed Reallocate resources requested but not used Request User Scheduling Realtime Feedback Control Monitoring End Host Agents Iosif Legrand August 2009

Dynamic priority for FDT Transfers on common segments Iosif Legrand August 2009

Bandwidth Challenge at SC2005 151 Gbs ~ 500 TB Total in 4h Iosif Legrand August 2009

FDT & MonLISA Used at SC 2006 17.7 Gb/s Disk to Disk on 10 Gb/s link used in Both directions from Florida to Caltech April 2007 Iosif Legrand Iosif Legrand August 2009

SC2006 Hyper BWC Official BWC April 2007 Iosif Legrand Iosif Legrand August 2009

SC2007 +80 Gb/s disk to disk Iosif Legrand August 2009

SC2008 – Bandwidth Challenge, Austin, TX Using FDT &MonALISA Storage to Storage WAN transfers: Bi-directional peak of 114Gbps (sustained 110Gbps) between Caltech, Michigan, CERN, FermiLab, Brazil, Korea, Estonia, Chicago, New York and Amsterdam Iosif Legrand August 2009

SC2008 – Bandwidth Challenge, Austin, TX CIENA – Caltech Booths FDT TRANSFERS Ciena, OTU-4 standard link carrying a 100 Gbps payload (or 200 Gbps bidirectional) with forward error correction. 10x10Gbps multiplex over an 80km fibre spool Pick : 199.9 Gb/s Mean 191 Gb/s We transferred : ~ 1PB in ~12 hours Iosif Legrand August 2009

The eight fallacies of distributed computing It is fair to say that at the beginning of this project we underestimated some of the potential problems in developing a large distributed system in WAN, and indeed the “eight fallacies of distributed computing” are very important lessons: 1) The network is reliable. 2) Latency is zero. 3) Bandwidth is infinite. 4) The network is secure. 5) Topology doesn't change. 6) There is one administrator. 7) Transport cost is zero. 8) The network is homogeneous. Iosif Legrand August 2009

MonALISA Usage USLHCnet VRVS ALICE OSG EVO MonALISA Today Running 24 X 7 at ~360 Sites Collecting ~ 1.5 million “persistent” parameters in real-time 60 million “volatile” parameters per day Update rate of ~25,000 parameter updates/sec Monitoring 40,000 computers > 100 WAN Links > 6,000 complete end-to-end network path measurements Tens of Thousands of Grid jobs running concurrently Controls jobs summation, different central services for the Grid, EVO topology, FDT … The MonALISA repository system serves ~6 million user requests per year. VRVS ALICE USLHCnet Major Communities ALICE CMS ATLAS PANDA EVO LGC RUSSIA OSG MXG RoEduNet USLHCNET ULTRALIGHT Enlightened - - U OSG EVO 59 Iosif Legrand August 2009

http://monalisa.caltech.edu Summary Modeling and simulations is important to understand the way distributed systems scale, the limits, and help to develop and test optimization algorithms. The algorithms to create resilient distributed systems are not easy (Handling the asymmetric information) For data intensive applications, is important to move to more synergetic relationships between the applications, computing and storage facilities and the NETWORK. http://monalisa.caltech.edu Iosif Legrand August 2009