UPPSALA DATABASE LABORATORY Managing Scientific Queries over Distributed Data in a Grid Environment Ruslan Fomkin.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Kelly Davis Architecture of GAT Kelly Davis AEI-MPG.
A tool to enable CMS Distributed Analysis
14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space Cracow Grid Workshop’10 Kraków, October 11-13,
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
JetWeb on the Grid Ben Waugh (UCL), GridPP6, What is JetWeb? How can JetWeb use the Grid? Progress report The Future Conclusions.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
DIANE Project CHEP 03 DIANE Distributed Analysis Environment for semi- interactive simulation and analysis in Physics Jakub T. Moscicki,
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
The european ITM Task Force data structure F. Imbeaux.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Kelly Davis and Tom Goodale Architecture of GAT Kelly Davis and Tom Goodale and
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Nguyen Tuan Anh. VN-Grid: Goals  Grid middleware (focus of this presentation)  Tuan Anh  Grid applications  Hoai.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
Website: Answering Continuous Queries Using Views Over Data Streams Alasdair J G Gray Werner.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
EGI Technical Forum Amsterdam, 16 September 2010 Sylvain Reynaud.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
OGSA-DQP Steven Lynden University of Manchester. Data access & integration with OGSA-DAI: GGF 17 2 Introduction OGSA-DQP is a service based distributed.
Developing GRID Applications GRACE Project
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Agreement-based Workload and Resource Management Tiziana Ferrari, Elisabetta Ronchieri Mar 30-31, 2006.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Introduction to Grid Technology
Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory
a VO-oriented perspective
Data, Databases, and DBMSs
Support for ”interactive batch”
Presentation transcript:

UPPSALA DATABASE LABORATORY Managing Scientific Queries over Distributed Data in a Grid Environment Ruslan Fomkin

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 2 Uppsala DataBase Laboratory (UDBL)  Supervisor prof. T. Risch  Database research How to make extensible middleware query processing allowing scalable and application oriented search to different kinds of wrapped information sources 

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 3 AMOS II Virtual Mediator Database Simulation VisualizationAnalysis Patient Monitoring GRID hist. Measurments Relational Databases Plug-ins Wrappers Queries and views Queries Data sources Applications Continuous Queries

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 4 Ongoing Research at UDBL Stream Queries on BlueGene Erik Zeitler, MSc FEM Databases Kjell Orsborn, PhD Mediating Web Services Manivasakan Sabesan, BSc Semantic Web Queries to Hidden Web Johan Petrini, MSc Stream Data Manager Milena Ivanova, PhD UDBL Expensive GRID Queries Ruslan Fomkin, MSc

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 5 Outline  Introduction  The project  Test application  Developed framework  Conclusion  Future work

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 6 Scientific Applications, Grid and Databases  A lot of scientific data Complex structure Stored in files distributed in Grid  Scientific analyses can be represented as declarative queries Complex queries with numerical computations Long running or batch queries  Utilization of computational resources of Grid

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 7 Parallel Object Query System for Expensive Computations (POQSEC)  Query processor for scientific applications high-level interface to specify the analyses automatically generates execution plans and evaluates them  Requirements Scalable, efficient, flexible, transparent  Properties Distributed and parallel

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 8 Layered Architecture of the System  POQSEC provides scientific query management  Grid provides computation management file management NorduGrid Middleware  Application area provides computational libraries data management libraries ROOT library POQSEC Application libraries Grid DataClusters User ROOTNorduGrid

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 9 Our Test Application  From Particle Physics  Analysis of collision events for presence of Higgs particles  Data produced by ATLAS simulation software stored in files distributed in the Grid (e.g. NorduGrid) managed by ROOT library

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 10 Object-Relational Schema of the Application Data EventParticle Lepton MuonElectronJet particles 1 n PxMissPyMiss PxPyPz Kf Ee inheritance relationship

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 11 General Query of the Analysis  Selection of those events that satisfy predicates containing numerical operations SELECT ev FROM Event ev WHERE jetvetocut(ev) AND zvetocut(ev) AND topcut(ev) AND misseecuts(ev) AND leptoncuts(ev)AND threeleptoncut(ev);  Each predicate called cut in application area  Predicates are defined as queries

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 12 Example of a predicate: Z-veto cut  Either event does not have a pair of opposite charged leptons  or invariant mass of the pair is not close to the mass of a Z particle CREATE FUNCTION zvetocut(Event ev)-> Event AS SELECT ev WHERE NOTANY(oppositeLeptons(ev)) OR abs(invMass(oppositeLeptons(ev)) - zMass) >= minZMass; CREATE FUNCTION oppositeLeptons (Event ev) -> bag of AS SELECT l1, l2 FROM Lepton l1, Lepton l2 WHERE l1 = particles(ev) AND l2 = particles(ev) AND Kf(l1) = -Kf(l2);

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 13 Current Framework  Basic tool for utilizing NorduGrid through Advanced Resource Connector (ARC)  Submission mechanism submit query parallelize query to several subqueries generate job scripts (one per subquery)  Babysitter functionality  Data exchange mechanism through files

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 14 Client and Coordinator Part POQSEC client  personal database with application schema  ROOT wrapper Coordinator server  receives queries  creates jobs Grid Meta-Database  computational resources  data files Babysitter Coordinator server Grid Meta- Database Submission Database Job queue Query Coordinator Local Storage ARC Client Grid Client Node POQSEC Client Submission Database  received submissions  created jobs Babysitter  interactions with ARC

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 15 Query Submission Query submission  query  file name selection  degree of parallelism  CPU time for each job  Submission and its jobs saved in Submission Database  Created jobs added to Job queue  Script files saved to Local Storage Babysitter Coordinator server Grid Meta- Database Submission Database Job queue Query Coordinator Local Storage ARC Client Grid Client Node POQSEC Client Coordinator server creates jobs  same query  partitions of data with equal size  same CPU time provided by user  corresponding job script files

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 16 Jobs Submission Babysitter Coordinator server Grid Meta- Database Submission Database Job queue Query Coordinator Local Storage ARC Client Grid Client Node POQSEC Client Babysitter  Takes jobs from Job queue  Submits each job to ARC client  Change status of submitted jobs in Submission DB ARC Grid Manager CE ARC Grid Manager CE ARC client  finds Computing Element  submits job to corresponding ARC Grid manager

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 17 Job Execution ARC Grid Manager  downloads input files  submits job to Local Batch System After some delay LBS starts Executor on allocated a CE node Executor during execution  execute given subquery  accesses data through ROOT wrapper  saves result to files on CE Storage CE Storage Executor wrapper CE node ARC Grid Manager SE LBSQueue

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 18 Downloading Result Babysitter Coordinator server Grid Meta- Database Submission Database Job queue Query Coordinator Local Storage ARC Client Grid Client Node POQSEC Client ARC Grid Manager CE Storage ARC Grid Manager CE Storage Babysitter  polls ARC client for jobs statuses  requests to download results for finished jobs Results downloaded to Local Storage User can retrieve result when all jobs are ready

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 19 Conclusion  We provide declarative query interface for representation scientific queries parallel query execution in Grid (generating scripts) babysitter to keep track of job execution  Query parallelization is important Standalone desktopGrid, one jobGrid, four jobs Response time190 min225 min24 min Requested CPU time-200 min20 min

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 20 Future work  Estimation time of executing query  Dealing with underestimation of execution time  Automatic making decision on degree of parallelism and resource brokering adaptive based on current load and job statistics  Dealing with failures in Grid  POOL wrapper

UU- IT - UDBLRuslan Fomkin January 20, 2006NGN workshop Uppsala 21 Thank you for attention! Your questions?