PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Introduction CSCI 444/544 Operating Systems Fall 2008.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
June 21, PROOF - Parallel ROOT Facility Maarten Ballintijn, Rene Brun, Fons Rademakers, Gunter Roland Bring the KB to the PB.
Workload Management Massimo Sgaravatto INFN Padova.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
PROOF Status and Perspectives G. GANIS CERN / LCG VII ROOT Users workshop, CERN, March 2007.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
Interactive Data Analysis with PROOF Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers CERN.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
MapReduce How to painlessly process terabytes of data.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Grid Workload Management Massimo Sgaravatto INFN Padova.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
1 Marek BiskupACAT2005PROO F Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
1 PROOF The Parallel ROOT Facility Gerardo Ganis / CERN CHEP06, Computing in High Energy Physics 13 – 17 Feb 2006, Mumbai, India Bring the KB to the PB.
ROOT-CORE Team 1 PROOF xrootd Fons Rademakers Maarten Ballantjin Marek Biskup Derek Feichtinger (ARDA) Gerri Ganis Guenter Kickinger Andreas Peters (ARDA)
Flexibility, Manageability and Performance in a Grid Storage Appliance John Bent, Venkateshwaran Venkataramani, Nick Leroy, Alain Roy, Joseph Stanley,
PROOF in Atlas Tier 3 model Sergey Panitkin 1 BNL.
Test Results of the EuroStore Mass Storage System Ingo Augustin CERNIT-PDP/DM Padova.
Using Map-reduce to Support MPMD Peng
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Summary Distributed Data Analysis Track F. Rademakers, S. Dasu, V. Innocente CHEP06 TIFR, Mumbai.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
March, PROOF - Parallel ROOT Facility Maarten Ballintijn Bring the KB to the PB not the PB to the KB.
Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland / MIT Rene Brun, Fons Rademakers / CERN Philippe Canal.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
1 Status of PROOF G. Ganis / CERN Application Area meeting, 24 May 2006.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
September, 2002CSC PROOF - Parallel ROOT Facility Fons Rademakers Bring the KB to the PB not the PB to the KB.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.
Lyon Analysis Facility - status & evolution - Renaud Vernet.
Diskpool and cloud storage benchmarks used in IT-DSS
Status of the CERN Analysis Facility
PROOF – Parallel ROOT Facility
GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.
PROOF in Atlas Tier 3 model
Ákos Frohner EGEE'08 September 2008
Partner: LMU (Atlas), GSI (Alice)
G. Ganis, 2nd LCG-France Colloquium
Ch 4. The Evolution of Analytic Scalability
Support for ”interactive batch”
PROOF - Parallel ROOT Facility
Simulation in a Distributed Computing Environment
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University of Warsaw

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT2 Outline Introduction to Parallel ROOT Facility Introduction to Parallel ROOT Facility Packetizer – load balancing Packetizer – load balancing Resource Scheduling Resource Scheduling

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT3 Analysis of the Large Hadron Collier data Necessity of distributed analysis Necessity of distributed analysis ROOT – popular particle physics data analysis framework ROOT – popular particle physics data analysis framework PROOF (ROOT’s extension) – automatically parallelizes processing to computing clusters or multicore machines PROOF (ROOT’s extension) – automatically parallelizes processing to computing clusters or multicore machines

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT4 Who is using PROOF PHOBOS PHOBOS –MIT, dedicated cluster, interfaced with Condor –Real data analysis, in production ALICE ALICE –CERN Analysis Facility (CAF) CMS CMS –Santander group, dedicated cluster –Physics TDR analysis Very positive experience functionality, large speedup, efficient functionality, large speedup, efficient But not really the LHC scenario Usage limited to a few experienced users Usage limited to a few experienced users

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT5 Using PROOF: example PROOF is designed for analysis of independent objects, e.g. ROOT Trees (basic data format in partice physics) PROOF is designed for analysis of independent objects, e.g. ROOT Trees (basic data format in partice physics) Example of processing a set of ROOT trees: Example of processing a set of ROOT trees: // Create a chain of trees root[0] TChain *c = CreateMyChain(); // MySelec is a TSelector root[1] c->Process(“MySelec.C+”); // Create a chain of trees root[0] TChain *c = CreateMyChain(); // Start PROOF and tell the chain // to use it root[1] TProof::Open(“masterURL”); root[2] c->SetProof() // Process goes via PROOF root[3] c->Process(“MySelec.C+”); PROOFLocal ROOT

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT6 Classic batch processing Storage Batch farm queues manager catalog query submit files jobs data file splitting myAna.C merging final analysis  static use of resources  jobs frozen: 1 job / worker node  external splitting, merging outputs

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT7 PROOF processing catalog Storage PROOF farm scheduler query MASTER PROOF job: data file list, myAna.C files final outputs (merged) feedbacks (merged)  farm perceived as extension of local PC  same syntax as in local session  more dynamic use of resources  real time feedback  automated splitting and merging

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT8 Challenges for PROOF Remain efficient under heavy load Remain efficient under heavy load 100% exploitation of resources 100% exploitation of resources Reliability Reliability

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT9 Levels of scheduling The packetizer The packetizer –Load balancing on the level of a job Resource scheduling Resource scheduling (assigning resources to different jobs) –Introducing a central scheduler –Priority based scheduling on worker nodes

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT10 Packetizer’s role Lookup – check locations of all files and initiate staging, if needed Lookup – check locations of all files and initiate staging, if needed Workers contact packetizer and ask for new packets (pull architecture) Workers contact packetizer and ask for new packets (pull architecture) A Packet has info on A Packet has info on –which file to open –which part of file to process Packetizer keeps assigning packets until the dataset is processed Packetizer keeps assigning packets until the dataset is processed

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT11 PROOF dynamic load balancing Pull architecture guarantees scalability Pull architecture guarantees scalability Adapts to variations in performance Adapts to variations in performance Worker 1Worker N Master packet: unit of work distribution Time

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT12 TPacketizer: the original packetizer Strategy Strategy –Each worker processes its local files and then processes remaining remote files –Fixed size packets –Avoid overloading data server by allowing max 4 remote files to be served Problems with the TPacketizer Problems with the TPacketizer –Long tails with some I/O bound jobs

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT13 Performance tests with ALICE 35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk 35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk –Standard CERN hardware for LHC Machine pools managed by xrootd Machine pools managed by xrootd –Data of Physics Data Challenge ’06 distributed (~ 1 M events) Tests performed Tests performed –Speedup (scalability) tests –System response when running a combination of job types for increasing # of concurrent users

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT14 Example of problems with some I/O bound jobs Processing rate during a query: Resource utilization:

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT15 How to improve Focus on I/O based jobs Focus on I/O based jobs –Limited by hard drive or network bandwidth Predict which data servers can become bottlenecks Predict which data servers can become bottlenecks Make sure that other workers help analyzing data from those servers Make sure that other workers help analyzing data from those servers Use time-based packet sizes Use time-based packet sizes

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT16 TAdaptivePacketizer Strategy Strategy –Predicting the processing time of local files for each worker –For the workers that are expected to finish faster, keep assigning remote files from the beginning of the job. –Assign remote files from the most heavily loaded file servers –Variable packet size

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT17 Improvement by up to 30% TPacketizerTAdaptivePacketizer

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT18 Scaling comparison for randomly distributed data set

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT19 Resource scheduling Motivation Motivation Central scheduler Central scheduler –Model –Interface Priority based scheduling on worker nodes Priority based scheduling on worker nodes

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT20 Why scheduling? Controlling resources and how they are used Controlling resources and how they are used Improving efficiency Improving efficiency –assigning to a job those nodes that have data which needs to be analyzed. Implementing different scheduling policies Implementing different scheduling policies – e.g. fair share, group priorities & quotas Efficient use even in case of congestion Efficient use even in case of congestion

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT21 PROOF specific requirements Interactive system Interactive system –Jobs should be processed as soon as submitted. –However when max. system throughput is reached some jobs has to postponed I/O bound jobs use more resources at the start and less at the end (file distribution) I/O bound jobs use more resources at the start and less at the end (file distribution) Try to process data locally Try to process data locally User defines a dataset not the #workers User defines a dataset not the #workers Possibility to remove/add workers during a job Possibility to remove/add workers during a job

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT22 Starting a query with a central scheduler (planed) Dataset Lookup Client Master External Scheduler job packetizer Start workers Cluster status User priority, history

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT23 Plans Interface for scheduling "per job” Interface for scheduling "per job” –Special functionality will allow to change the set of nodes during a session without loosing user libraries and other settings Removing workers during a job Removing workers during a job Integration with a scheduler Integration with a scheduler –Maui, LSF?

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT24 Priority based scheduling on nodes Priority-based worker level load balancing Priority-based worker level load balancing –Simple and solid implementation, no central unit –Group priorities defined in the configuration file Performed on each worker node independently Performed on each worker node independently Lower priority processes slowdown Lower priority processes slowdown – sleep before next packet request

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT25 Summary The adaptive packetizer is working very well in current environment. Will be further tuned after introducing the scheduler The adaptive packetizer is working very well in current environment. Will be further tuned after introducing the scheduler Advanced work on PROOF interface to scheduler. Advanced work on PROOF interface to scheduler. Priority-based scheduling on nodes is being tested Priority-based scheduling on nodes is being tested

ACAT th of April 2007Jan Iwaszkiewicz, CERN PH/SFT26 The PROOF Team Maarten Ballintijn Maarten Ballintijn Bertrand Bellenot Bertrand Bellenot Rene Brun Rene Brun Gerardo Ganis Gerardo Ganis Jan Iwaszkiewicz Jan Iwaszkiewicz Andreas Peters Andreas Peters Fons Rademakers Fons Rademakershttp://root.cern.ch