Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.

Slides:



Advertisements
Similar presentations
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
Advertisements

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
High Performance Computing Course Notes Grid Computing.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
CS CS 5150 Software Engineering Lecture 19 Performance.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
MCell Usage Scenario Project #7 CSE 260 UCSD Nadya Williams
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Sergey Belov, Tatiana Goloskokova, Vladimir Korenkov, Nikolay Kutovskiy, Danila Oleynik, Artem Petrosyan, Roman Semenov, Alexander Uzhinskiy LIT JINR The.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Polish Infrastructure for Supporting Computational Science in the European Research Space QoS provisioning for data-oriented applications in PL-Grid D.
DISTRIBUTED COMPUTING
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
Grid Computing - AAU 14/ Grid Computing Josva Kleist Danish Center for Grid Computing
J OINT I NSTITUTE FOR N UCLEAR R ESEARCH OFF-LINE DATA PROCESSING GRID-SYSTEM MODELLING FOR NICA 1 Nechaevskiy A. Dubna, 2012.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
Performance Evaluation of Computer Systems Introduction
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
The PROGRESS Grid Service Provider Maciej Bogdański Portals & Portlets 2003 Edinburgh, July 14th-17th.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
LOGO PROOF system for parallel MPD event processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computing for LHC Physics 7th March 2014 International Women's Day - CERN- GOOGLE Networking Event Maria Alandes Pradillo CERN IT Department.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Grid technologies for large-scale projects N. S. Astakhov, A. S. Baginyan, S. D. Belov, A. G. Dolbilov, A. O. Golunov, I. N. Gorbunov, N. I. Gromova, I.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
The Tier-1 center for CMS experiment at LIT JINR N. S. Astakhov, A. S. Baginyan, S. D. Belov, A. G. Dolbilov, A. O. Golunov, I. N. Gorbunov, N. I. Gromova,
CASTOR: possible evolution into the LHC era
OPERATING SYSTEMS CS 3502 Fall 2017
Virtual laboratories in cloud infrastructure of educational institutions Evgeniy Pluzhnik, Evgeniy Nikulchev, Moscow Technological Institute
Clouds of JINR, University of Sofia and INRNE Join Together
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
University of Technology
ExaO: Software Defined Data Distribution for Exascale Sciences
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical experiments Dubna, 2014 Authors: Korenkov V., Nechaevskiy A., Ososkov G., Pryahina D., Trofimov V., Uzhinskiy A. Supported by RFBR

The Large Hadron Collider (LHC) delivered in Run 1 ( ) billions of recorded collisions to the experiments ~ 100 PB of data stored at CERN on tape The Worldwide LHC Computing Grid (WLCG) provides compute and storage resources for data processing, simulation and analysis in > 200 Tier1 and tier 2 centers ~ 300k cores, ~200 PB disk, ~200 PB tape The computing challenge resulted in a great success Unprecedented data volume analyzed in record time delivering great scientific results (e.g. Higgs boson discovery) The LHC Computing Challenge 2

Considerable LHC renovation is going on now to start Run 2 in 2015 when the event reconstruction time will be doubled while the event rate to storage is expected to be 2.5 times faster. We are coming to the Big Data era! It is needed for potentially new physics, but faces with great challenges in LHC computing: large increase of CPU and WLCG resources; combined grid and cloud access; distributed parallel computing; overhaul most of simulation and analysis software codes. We are BIG! LHC computing challenges for Run2 Such a fast evolution of LHC distributed computing demands the careful dynamical simulations of all processes for data storage, transfer and analysis 3

Grid and Cloud Question: How to predict the system behavior after some of modifications? Make continuous simulation process. Use the statistics accumulated during operation as an additional input parameter to improve accuracy of results. Solution: 4 (see N.Kutovsky PhD thesis)

What is simulation for? Identification of system “weak points” (channels, queues, network capacity and other "bottlenecks"); Simulation of the different system configuration (grid infrastructure architecture, resources configuration, number of resource centers, etc); Approbation of new algorithms of data access and replication strategies; Optimize equipment needed for data transfers and storage taking into account minimum cost and other project requirements; Optimize resource distribution between users groups; Test the system functioning to define its "bottlenecks" and many other weak points. etc. Simulation allows: Predict Grid system performance under various changes: – Different workloads – System configuration – Different scheduling heuristics Predict and prevent a number of unexpected situations; 5

Analytical or imitative Simulations Analytical approaches to grid and cloud systems simulation: -system is considered as a multi-channel queuing system with state controlled by Markov process; -system is considered as a dynamic stochastic network, described by the system of equations that allow to consider both routing and resource allocation in a network. Both approaches give the simulation result in the form of asymptotic distributions, and because of the limited theoretical assumptions can not be applied to specific modeling of complex multi- tier architecture of computer networks with real distributions of the input streams of tasks, difficult multipriority discipline their service and dynamic resource allocation. We propse to use imitative simulations. 6

Simulation Toolkits GridSim Allows to simulate various classes of heterogeneous resources, users, applications and brokers There are no restrictions on jobs number which can be sent on a resource; System supports simulation of statistical and dynamic schedulers; Statistics of all or the chosen operations can be registered Implemented in Java Configuration files are used to set simulation’s parameters Source code is available Multilevel architecture allows to add new components easily 7 It is a job scheduling simulator! It couldn't simulate data transfers! Alea Based on the latest GridSim Visualization output Include shedulers Supports single CPU and multi CPU jobs Intended for study of advanced scheduling techniques in Grid environment

Challenge Tasks:  Continuous development of Grid systems requires continuous adjustments of simulation parameters. 2. Simulation validation - the simulation produced the correct results and does not differ from the behavior of the real system. Approach: 1.Simulation of the existing computational systems: input data is a system statistics from the monitoring and accounting tools ( wlcg-transfers.cern.ch/ui/, wlcg-transfers.cern.ch/ui/ simulation results should match within the error limits with the results of monitoring for the same tasks. 2. Simulation of the new computational systems: hypothesis about the input data flow parameters and procedures of it processing; analysis of event time distribution that are generated by the input data flow processing; distributions are compared with the results obtained from the monitoring of the existing system;  It is necessary to predict the behavior of the system with significant changes.  It’s possible to use exploitation system statistics obtained from the monitoring tools. 1. Providing matches the source data for the simulation with the real one; 8

Monitoring system Set of hardware and software for the analysis and control of a distributed computing system. Monitoring main tasks: Continuous monitoring of the grid services, both basic (common to the entire infrastructure) and related to individual resource centers; Obtaining information about the computing resources (number of compute nodes to perform the tasks, the architecture of a computer system, installed software, specialized software packages available) and CPU consumption; Data on access to the resources of virtual organizations and their use of quotas for computing resources; Monitoring parameters required for simulation: 1. number of tasks (simulation, analysis, reconstruction) coming into the system; 2. RAM is used; 3. CPU time used; 4. number of processed events; 5. Job execution time; 6. amount of data used. Monitoring of the computational jobs and tasks implementation of and tasks (start, status changes, exit codes, etc.). 9

Synthesis of monitoring and simulation (SyMSim) scheme Real Grid 1. Submit jobs to the server 2. Jobs request 3. Submit jobs for execution 4. Write statistic to DB Simulation 5. Simulation based on actual monitoring data, generating input data for the simulation 6. Jobs flow imitation 7. Jobs request 8. Submit jobs for execution 9. Simulation result analysis, write statistic information. 10

Under study structure composition: tape robot, Disk array, CPU Cluster. Cost: 1 drive - 5 conventional units, 1 CPU - 3 conventional units. Problem statement Baseline: Outcome criteria: Aim Test flow of 99 jobs passage time. Find the optimal ratio of the number of processors and the number of drives on a limited budget of 1000 units. 11

Jobs flow analysis ( ATLAS) а) Simulation (Intensive use of CPU) b) Reconstruction (memory-intensive) c) Analysis (intensive data replication, loading of communication channels). Approximate ratio of different job types: CPU consumption Number of processed files 12 Our job flow generator produces 3 types of jobs with parameters (number of events, CPU time, RAM, etc) closed to Atlas one.

Jobs flow scheme 13 1.If slot and file are available Job is executed; 2.If file stored in tape library Job reserved slot but waiting for necessary file to the disk; 3.From tape to disk: tape cartridge move to the drive, cartridge's file system mounting to the drive, file is copied to the disk.

Interaction model components A set of these classes allows one to simulate all the processes occurring with a file copy on tapes: load and unload of tape by manipulator, assembling on drive, search for a file on the tape and its reading/writing. 14 JobFlow Generator

Configuration provides the minimum execution time: 18 CPUs, 9 drives Simulation results Set of tasks execution time depending on the CPU number and drives. 15

Determination of the cluster loading degree. Cluster loading W = T100/Tа T100 – CPU runtime package Ta –astronomical runtime package Loading falls with a large number of CPUs. They idle waiting mount tapes with data on drives. Consequently, it is necessary to choose the optimal ratio of the number of processors and drives. Result analysis 16

Conclusion The proposed approach to simulation and analysis of computational structures is universal. It can also be used to solve design problems and the subsequent development of data repositories, not limited to the physical experiments area. 17

Thank you! 18