Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECSAC- August 2009, Veli Losink California Institute of Technology

Similar presentations


Presentation on theme: "ECSAC- August 2009, Veli Losink California Institute of Technology"— Presentation transcript:

1 ECSAC- August 2009, Veli Losink California Institute of Technology
Monitoring, Control and Optimization in Large Distributed Systems ECSAC- August 2009, Veli Losink Iosif Legrand California Institute of Technology

2 Monitoring Distributed Systems
An essential part of managing large scale, distributed data processing facilities, is a monitoring system that is able to monitor computing facilities, storage systems, networks and a very large number of applications running on these systems in near-real time. The monitoring information gathered for all the subsystems is essential for design, debugging, accounting and the development of “higher level services”, that provide decision support and some degree of automated decisions and for maintaining and optimizing workflow in large scale distributed systems.

3 Modeling & Simulations
Monitoring Information is necessary for System Design, Control, Optimization, Debugging and Accounting ACCOUNTING Computing Models Modeling & Simulations Optimization Algorithms MONITORING ~ REAL Information Create resilient Distributed Systems Control and Operational support ALARMS DEBUGGING

4 CERN Center PBs of Disk; Tape Robot
The LHC Data Grid Hierarchy The need for distributed computing (MONARC) 11 Tier1 and 120+ Tier2 Centers ~PByte/sec ~ MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 Gbps IN2P3 Center RAL Center INFN Center FNAL Center ~10 Gbps Tier 2 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center ~1-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by ~2010 Physics data cache 1 to 10 Gbps Tier 4 Workstations

5 Communication in Distributed Systems
Different types of dedicated protocols Distributed Object Systems CORBA , DCOM, RMI The Stub is linked to the Client. The Client must know about the service from the beginning and needs the right stub for it “Traditional” Distributed Object Models (CORBA, DCOM) Server Lookup Service Skeleton Lookup Service Stub CLIENT “IDL” Compiler The Server and the client code must be created together !!

6 Distributed Object Systems Web Services WSDL/SOAP
Server CLIENT Lookup Service WSDL Lookup Service Interface The client can dynamically generate the data structures and the interfaces for using remote objects based on WSDL Platform independent Large overhead Based on stateless connections

7 Mobile Code and Distributed Services
Any well suited protocol for the application Service Lookup Service Proxy CLIENT Lookup Service Proxy Dynamic Code Loading Services can be used dynamically Remote Services Proxy == RMI Stub Mobile Agents Proxy == Entire Service “Smart Proxies” Proxy adjusts to the client Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information mechanism to dynamically discover all the “Service Units" remote event notification for changes in the any system lease mechanism for each registered unit Is Based on Java Iosif Legrand August

8 The MonALISA Framework
MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications. 8 Iosif Legrand August

9 The MonALISA Architecture
Regional or Global High Level Services, Repositories & Clients HL services Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Proxies Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Agents MonALISA services Distributed Dynamic Registration and Discovery-based on a lease mechanism and remote events Network of JINI-Lookup Services Secure & Public Fully Distributed System with no Single Point of Failure 9 Iosif Legrand August

10 MonALISA Service & Data Handling
Postgres Data Store Lookup Service Lookup Service Data Cache Service & DB Registration Web Service WSDL SOAP Discovery WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Higher Level Services Applications Configuration Control (SSL) AGENTS FILTERS / TRIGGERS Collects any type of information Dynamic Loading Monitoring Modules Push and Pull 10 Iosif Legrand August

11 Registration / Discovery Admin Access and AAA for Clients
Application Registration (signed certificate) MonALISA Service Discovery Client (other service) Lookup Service Trust keystore Services Proxy Multiplexer Applications Data Filters & Agents MonALISA Service Client authentication Services Proxy Multiplexer Admin SSL connection Lookup Service MonALISA Service Client (other service) Trust keystore AAA services Iosif Legrand August

12 Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity
TOPOLOGY ACCOUNTING 12 Iosif Legrand August

13 Monitoring the Execution of Jobs and the Time Evolution
SPLIT JOBS LIFELINES for JOBS Job Job1 Job2 Job3 31 32 Summit a Job DAG 13 Iosif Legrand August

14 Monitoring CMS Jobs Worldwide
CMS is using MonALISA and ApMon to monitor all the production and analysis jobs. This information is than used in the CMS dashboard frontend ~ 5 *109 Monitoring values Collected in the first half of 2009 The computer that runs ML at CERN was replaces Rate of collected monitoring values Total Collected values 3 4 2 1 X109 April May June 1000 500 May June Organize and structure Monitoring Information Collects monitoring data with rates up to more than 1000 values per second Peaks of more than 1500 jobs reporting concurrently to the same server Uptime for the service > 150 days continuous operation without any problems Collected 4.5* 109 monitoring values in the first half of 2009 Lost in UDP messages (values) is less than 5 * 10-6 Iosif Legrand August

15 Monitoring architecture in ALICE
ApMon AliEn CE ApMon AliEn IS ApMon AliEn Optimizers ApMon Cluster Monitor ApMon AliEn Job Agent ApMon AliEn Brokers ApMon AliEn TQ ApMon AliEn SE slots job run time net In/out ApMon AliEn Job Agent time cpu free space jobs status processes load vsz ApMon MySQL Servers MonALISA @Site MonALISA @CERN sockets rss ApMon AliEn Job Agent migrated mbytes sessions active ApMon CastorGrid Scripts Aggregated Data ApMon API Services ApMon Cluster Monitor ApMon AliEn SE ApMon AliEn CE ApMon AliEn Job Agent open files nr. of files JobAgents Queued job status MonaLisa Repository MonALISA LCG Site Alerts ApMon AliEn Job Agent cpu ksi2k Actions disk used Long History DB MyProxy status ApMon AliEn Job Agent LCG Tools 15 Iosif Legrand August

16 ALICE : Global Views, Status & Jobs
Iosif Legrand August

17 ALICE: Job status – history plots
Iosif Legrand August

18 ALICE: Resource Usage monitoring
Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd) Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage Aggregated per site Iosif Legrand August

19 ALICE: Job agents monitoring
From Job Agent itself Requesting job Installing packages Running job Done Error statuses From Computing Element Available job slots Queued Job Agents Running Job Agents Iosif Legrand August

20 Local and Global Decision Framework
Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts ( s/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. Traffic Jobs Hosts Apps ML Service Actions based on global information Global ML Services Actions based on local information Temperature Humidity A/C Power ML Service Sensors Local decisions Global decisions Iosif Legrand August

21 ALICE: Automatic job submission Restarting Services
MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is kept full by the automatic submission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information 21 Iosif Legrand August

22 Automatic actions in ALICE
ALICE is using the monitoring information to automatically: resubmit error jobs until a target completion percentage is reached, submit new jobs when necessary (watching the task queue size for each service account)‏ production jobs, RAW data reconstruction jobs, for each pass, restart site services, whenever tests of VoBox services fail but the central services are OK, send notifications / add chart annotations when a problem was not solved by a restart dynamically modify the DNS aliases of central services for an efficient load-balancing. Most of the actions are defined by few lines configuration files. Iosif Legrand August

23 The USLHCnet Advanced Standards: Dynamic Circuits CIENA Core Directors Mesh Protection Equipment and link Redundancy Also Internet2 and SINET3 (Japan) Together with ESnet provide highly resilient data paths for US Tier1s US LHCNet status report Artur Barczyk, 04/16/2009 Iosif Legrand August

24 Monitoring Links Availability Very Reliable Information
AMS-GVA(GEANT) 99.5% AMS-NYC(GC) 97.9% CHI-NYC (Qwest) 99.9% CHI-GVA (GC) 96.6% CHI-GVA (Qwest) 99.3% CERN) 98.9% GVA – NYC (Colt) GVA – NYC (GC) 99.5% 0-95% 95-97% 97-98% 98-99% 99-100% 100% P1 Network LINK P1 Artur Barczyk, 04/16/2009 Iosif Legrand August

25 Monitoring USLHCNet Topology
Topology & Status & Peering Real Time Topology for L2 Circuits Iosif Legrand August

26 USLHCnet: Traffic on different segments
Iosif Legrand August

27 USLHCnet: Accounting for Integrated Traffic
Iosif Legrand August

28 ALARMS and Automatic notifications for USLHCnet
Iosif Legrand August

29 The UltraLight Network
BNL ESnet IN /OUT Iosif Legrand August

30 Monitoring Network Topology (L3), Latency, Routers
NETWORKS AS ROUTERS Real Time Topology Discovery & Display Iosif Legrand August

31 Available Bandwidth Measurements
Embedded Pathload module. 31 Iosif Legrand August

32 EVO : Real-Time monitoring for Reflectors and the quality of all possible connections
Iosif Legrand August

33 EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity
A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second. Building in near real time a minimum- spanning tree with addition constrains Resilient Overlay Network that optimize real-time communication Iosif Legrand August

34 Dynamic MST to optimize the Connectivity for Reflectors
Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems. Iosif Legrand August

35 EVO: Optimize how clients connect to the system for best performance and load balancing
Iosif Legrand August

36 FDT – Fast Data Transfer
FDT is an application for efficient data transfers. Easy to use. Written in java and runs on all major platforms. It is based on an asynchronous, multithreaded system which is using the NIO library and is able to: stream continuously a list of files use independent threads to read and write on each physical device transfer data in parallel on multiple TCP streams, when necessary use appropriate size of buffers for disk IO and networking resume a file transfer session 36 Iosif Legrand August 36

37 FDT – Fast Data Transfer
Control connection / authorization Pool of buffers Kernel Space Pool of buffers Kernel Space Data Transfer Sockets / Channels Restore the files from buffers Independent threads per device 37 Iosif Legrand August 37

38 FDT can be monitored and controlled dynamically by MonALISA System
FDT features The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and authorization. Supports several security schemes : IP filtering SSH GSI-SSH Globus-GSI SSL User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, compression … FDT can be monitored and controlled dynamically by MonALISA System April Iosif Legrand 38 Iosif Legrand August

39 FDT – Memory to Memory Tests in WAN
~9.4 Gb/s ~9.0 Gb/s CPUs Dual Core Intel 3.00  GHz,  4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom October Iosif Legrand 39 Iosif Legrand August 39

40 Disk -to- Disk transfers in WAN
NEW YORK GENEVA Reads and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour MB/s CERN CALTECH Reads and writes on two 12-port RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour 1U Nodes with 4 Disks 4U Disk Servers with 24 Disks Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd Interface to dCache using the dcap protocol October Iosif Legrand Iosif Legrand August

41 Active Available Bandwidth measurements between all the ALICE grid sites
Iosif Legrand August

42 Active Available Bandwidth measurements between all the ALICE grid sites (2)
Iosif Legrand August

43 End to End Path Provisioning on different layers
Default IP route Layer 3 Site B VCAT and VLAN channels Layer 2 Site A Optical path Layer 1 Monitor layout / Setup circuit Monitor interfaces traffic Monitor host & end-to-end paths / Setup end-host parameters Control transfers and bandwidth reservations Iosif Legrand August

44 Monitoring Optical Switches
Dynamic restoration of lightpath if a segment has problems Iosif Legrand August

45 Monitoring the Topology and Optical Power on Fibers for Optical Circuits
Controlling Port power monitoring Glimmerglass Switch Example Iosif Legrand August

46 “On-Demand”, End to End Optical Path Allocation
>FDT A/fileX B/path/ OS path available Configuring interfaces Starting Data Transfer CREATES AN END TO END PATH < 1s Real time monitoring Internet MonALISA Distributed Service System Regular IP path APPLICATION DATA MonALISA Service Control Monitor A OS Agent B LISA AGENT LISA sets up - Network Interfaces - TCP stack - Kernel parameters - Routes LISA  APPLICATION “use eth1.2, …” LISA Agent TL1 Optical Switch Active light path Detects errors and automatically recreate the path in less than the TCP timeout 46 Iosif Legrand August

47 Controlling Optical Planes Automatic Path Recovery
CERN Geneva CALTECH Pasadena Starlight Manlan USLHCnet Internet2 200+ MBytes/sec From a 1U Node FDT Transfer “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 4 2 3 1 4 fiber cut emulations 4 Fiber cuts simulations Iosif Legrand August

48 “On-Demand”, Dynamic Circuits Channel and Path Allocation
APPLICATION >FDT A/fileX B/path/ path or channel allocation Configuring interfaces Starting Data Transfer Regular IP path Regular IP path Local VLANs MAP Local VLANs to WAN channels or light paths Recommended to use two NICs one for management /one for data - bonding two NICs to the same IP Iosif Legrand August

49 The Need for Planning and Scheduling for Large Data Transfers
In Parallel Sequential 2.5 X Faster to perform the two reading tasks sequentially Iosif Legrand August

50 Dynamic Path Provisioning Queueing and Scheduling
Channel allocation based on VO/Priority, [ + Wait time, etc.] Create on demand a End-to-end path or Channel & configure end-hosts Automatic recovery (rerouting) in case of errors Dynamic reallocation of throughputs per channel: to manage priorities, control time to completion, where needed Reallocate resources requested but not used Request User Scheduling Realtime Feedback Control Monitoring End Host Agents Iosif Legrand August

51 Dynamic priority for FDT Transfers on common segments
Iosif Legrand August

52 Bandwidth Challenge at SC2005
151 Gbs ~ 500 TB Total in 4h Iosif Legrand August

53 FDT & MonLISA Used at SC 2006 17.7 Gb/s Disk to Disk
on 10 Gb/s link used in Both directions from Florida to Caltech April Iosif Legrand Iosif Legrand August

54 SC2006 Hyper BWC Official BWC April 2007 Iosif Legrand
Iosif Legrand August

55 SC2007 +80 Gb/s disk to disk Iosif Legrand August

56 SC2008 – Bandwidth Challenge, Austin, TX
Using FDT &MonALISA Storage to Storage WAN transfers: Bi-directional peak of 114Gbps (sustained 110Gbps) between Caltech, Michigan, CERN, FermiLab, Brazil, Korea, Estonia, Chicago, New York and Amsterdam Iosif Legrand August

57 SC2008 – Bandwidth Challenge, Austin, TX
CIENA – Caltech Booths FDT TRANSFERS Ciena, OTU-4 standard link carrying a 100 Gbps payload (or 200 Gbps bidirectional) with forward error correction. 10x10Gbps multiplex over an 80km fibre spool Pick : Gb/s Mean 191 Gb/s We transferred : ~ 1PB in ~12 hours Iosif Legrand August

58 The eight fallacies of distributed computing
It is fair to say that at the beginning of this project we underestimated some of the potential problems in developing a large distributed system in WAN, and indeed the “eight fallacies of distributed computing” are very important lessons: 1) The network is reliable. 2) Latency is zero. 3) Bandwidth is infinite. 4) The network is secure. 5) Topology doesn't change. 6) There is one administrator. 7) Transport cost is zero. 8) The network is homogeneous. Iosif Legrand August

59 MonALISA Usage USLHCnet VRVS ALICE OSG EVO MonALISA Today
Running 24 X 7 at ~360 Sites Collecting ~ 1.5 million “persistent” parameters in real-time 60 million “volatile” parameters per day Update rate of ~25,000 parameter updates/sec Monitoring 40,000 computers > 100 WAN Links > 6,000 complete end-to-end network path measurements Tens of Thousands of Grid jobs running concurrently Controls jobs summation, different central services for the Grid, EVO topology, FDT … The MonALISA repository system serves ~6 million user requests per year. VRVS ALICE USLHCnet Major Communities ALICE CMS ATLAS PANDA EVO LGC RUSSIA OSG MXG RoEduNet USLHCNET ULTRALIGHT Enlightened - - U OSG EVO 59 Iosif Legrand August

60 http://monalisa.caltech.edu Summary
Modeling and simulations is important to understand the way distributed systems scale, the limits, and help to develop and test optimization algorithms. The algorithms to create resilient distributed systems are not easy (Handling the asymmetric information) For data intensive applications, is important to move to more synergetic relationships between the applications, computing and storage facilities and the NETWORK. Iosif Legrand August


Download ppt "ECSAC- August 2009, Veli Losink California Institute of Technology"

Similar presentations


Ads by Google