Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
June 21, PROOF - Parallel ROOT Facility Maarten Ballintijn, Rene Brun, Fons Rademakers, Gunter Roland Bring the KB to the PB.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Workload Management Massimo Sgaravatto INFN Padova.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Ch 4. The Evolution of Analytic Scalability
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Computer System Architectures Computer System Software
1 port BOSS on Wenjing Wu (IHEP-CC)
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Interactive Data Analysis with PROOF Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers CERN.
Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
1 Marek BiskupACAT2005PROO F Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
ROOT-CORE Team 1 PROOF xrootd Fons Rademakers Maarten Ballantjin Marek Biskup Derek Feichtinger (ARDA) Gerri Ganis Guenter Kickinger Andreas Peters (ARDA)
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
March, PROOF - Parallel ROOT Facility Maarten Ballintijn Bring the KB to the PB not the PB to the KB.
Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland / MIT Rene Brun, Fons Rademakers / CERN Philippe Canal.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
1 Status of PROOF G. Ganis / CERN Application Area meeting, 24 May 2006.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
Background Computer System Architectures Computer System Software.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
Next Generation of Apache Hadoop MapReduce Owen
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Status of the CERN Analysis Facility
PROOF – Parallel ROOT Facility
Grid Computing.
Simulation use cases for T2 in ALICE
TYPES OFF OPERATING SYSTEM
Ch 4. The Evolution of Analytic Scalability
Support for ”interactive batch”
Simulation in a Distributed Computing Environment
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN

2 LHC data: DVD stack after 1 year! (~ 20 Km) Mt. Blanc (4.8 Km) LHC Data Challenge The LHC generates 40  10 6 collisions / s Combined the 4 experiments record: ■ 100 interesting collision per second ■ 1 ÷ 12 MB / collision  0.1 ÷ 1.2 GB / s ■ ~ 10 PB (10 16 B) per year (10 10 collisions / y) ■ LHC data correspond to 20  10 6 DVD’s / year! ■ Space equivalent to large PC disks ■ Computing power ~ 10 5 of today’s PC Balloon (30 Km) Airplane (10 Km) Using parallelism is the only way to analyze this amount of data in a reasonable amount of time

3 HEP Data Analysis Typical HEP analysis needs a continuous algorithm refinement cycle Ranging from I/O bound to CPU bound Need many disks to get the needed I/O rate Need many CPUs for processing Need a lot of memory to cache as much as possible Implement algorithm Run over data set Run over data set Make improvements

4 The Traditional Batch Approach File catalog BatchScheduler Storage CPU’s Query Split analysis job in N stand-alone sub-jobs Collect sub-jobs and merge into single output Batch cluster Split analysis task in N batch jobs Job submission sequential Very hard to get feedback during processing Analysis finished when last sub-job finished Job splitter Job Job Job Job Job Merger Job Job Job Job QueueJob Job Job Job

5 The PROOF Approach File catalog Master Scheduler Storage CPU’s Query PROOF query: data file list, mySelector.C Feedback, merged final output PROOF cluster Cluster perceived as extension of local PC Same macro and syntax as in local session More dynamic use of resources Real-time feedback Automatic splitting and merging

6 PROOF Design Goals System for running ROOT queries in parallel on a large number of distributed computers or multi-core machines Transparent, scalable and adaptable extension of the local interactive ROOT analysis session Support for running long queries (“interactive batch”) CERN Analysis Facility (CAF) Departmental workgroups (Tier-2’s) Multi-core, multi-disk desktops (Tier-3/4’s) Where to Use PROOF

7 PROOF Scalability on Multi-Core Machines Running on MacPro with dual Quad Core CPU’s.

8 Multi-Tier Architecture Adapts to wide area virtual clusters Geographically separated domains, heterogeneous machines Network performance Less important VERY important Optimize for data locality or high bandwidth data server access

9 Recent Developments Dataset management ■ Global and user data sets ■ Disk quotas ■ For more see talk 444 on ALICE CAF developments Load balancing ■ New packetizers Scheduling ■ User priority handling on worker level ■ Central resource scheduler ■ Abstract interface ■ Selection of workers based on load (CPU, memory, I/O) Generic task processing ■ CPU instead of data driven

10 Load Balancing: the Packetizer The packetizer is the heart of the system It runs on the master and hands out work to the workers Pull architecture: workers ask for work No complex worker state in the master Different packetizers allow for different data access policies ■ All data on disk, allow network access ■ All data on disk, no network access ■ Data on mass storage, go file-by-file ■ Data on Grid, distribute per Storage Element ■ … The goal is to have all workers end at the same time

11 Original Packetizer Strategy Each worker processes its local files and processes packets from the remaining remote files (if any) Fixed packet size Avoid data servers overload by allowing max 4 remote workers to be served concurrently Works generally fine, but shows tail effects for I/O bound queries, due to a reduction of the effective number of workers when access to non-local files is required

12 Issues with Original Packetizer Strategy Processing rate during a query Resource utilization

13 Where to Improve Focus on I/O bound jobs ■ Limited by disk or network bandwidth Predict which data servers can become bottlenecks Make sure that other workers help analyzing data from those servers Use variable packet sizes (smaller at end of query)

14 Improved Packetizer Strategy Predict processing time of local files for each worker For the workers that are expected to finish faster, keep assigning remote packets from the beginning of the job Assign remote packets from the most heavily loaded file servers Variable packet size

15 Improved Packetizer: Results Processing rate during a query Resource utilization Up to 30% improvemen t

16 Why Scheduling? Controlling resources and how they are used Improving efficiency ■ Assigning to a job those nodes that have data which needs to be analyzed Implementing different scheduling policies ■ E.g. fair share, group priorities & quotas Avoid congestion and cluster grinding to a halt

17 PROOF Specific Requirements Interactive system ■ Jobs should be processed as soon as submitted ■ However when max. system throughput is reached some jobs have to be postponed I/O bound jobs use more resources at the start and less at the end (file distribution) Try to process data locally User defines a dataset not the number of workers Possibility to remove/add workers during a job

18 Starting a Query With a Central Scheduler Dataset Lookup Client Master External Scheduler job packetizer Start workers Cluster status User priority, history Workers

19 Scheduler Development Plans Interface for scheduling "per job” ■ Special functionality will allow to change the set of nodes during a session without loosing user libraries and other settings Removing workers during a job Integration with a third-party scheduler ■ Maui, LSF

20 User Priority Based Scheduling User priority based worker level scheduling ■ Simple and solid implementation, no global state ■ Group priorities defined in a configuration file ■ Group priorities can also be obtained from a central scheduler via the master ■ Configuration tested currently at the CAF by ALICE Scheduling performed on each worker independently Lower priority processes slowdown ■ Sleep before next packet request ■ Use Round-Robin Linux process scheduler

21 Generic Task Processing CPU instead of data driven Uses the established PROOF infrastructure to distribute jobs (i.e. selectors, input lists, output lists, PAR files, etc.) ■ Monte Carlo, image analysis, etc. ■ Output files in the output list will be automatically merged First version will be coming later this year

22 Growing Interest by LHC Experiments The ideal solution for fast AOD analysis, easy to deploy on a bunch of multi-core machines ■ ALICE CAF ■ ATLAS ■ BNL, Wisconsin ■ CMS ■ FNAL

23 Conclusions The LHC will generate data on a scale not seen anywhere before LHC experiments will critically depend on parallel solutions to analyze their enormous amounts of data Grids will very likely not provide the needed stability and reliability we need for repeatable high statistics analysis