CERN-IT-DB Exabyte-Scale Data Management Using an Object-Relational Database: The LHC Project at CERN Jamie Shiers CERN, Switzerland

Slides:



Advertisements
Similar presentations
1 AMY Detector (eighties) A rather compact detector.
Advertisements

31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
First results from the ATLAS experiment at the LHC
Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
The Biggest Experiment in History. Well, a tiny piece of it at least… And a glimpse 12bn years back in time To the edge of the observable universe So.
Database Processing: Fundamentals, Design and Implementation, 9/e by David M. KroenkeChapter 1/1 Copyright © 2004 Please……. No Food Or Drink in the class.
Highest Energy e + e – Collider LEP at CERN GeV ~4km radius First e + e – Collider ADA in Frascati GeV ~1m radius e + e – Colliders.
Randall Sobie The ATLAS Experiment Randall Sobie Institute for Particle Physics University of Victoria Large Hadron Collider (LHC) at CERN Laboratory ATLAS.
Grids: Why and How (you might use them) J. Templon, NIKHEF VLV T Workshop NIKHEF 06 October 2003.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Fundamentals, Design, and Implementation, 9/e Chapter 1 Introduction to Database Processing.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
CERN/IT/DB Multi-PB Distributed Databases Jamie Shiers IT Division, DB Group, CERN, Geneva, Switzerland February 2001.
D. Duellmann, CERN Data Management at the LHC1 Data Management at CERN’s Large Hadron Collider (LHC) Dirk Düllmann CERN IT/DB, Switzerland
Hall D Online Data Acquisition CEBAF provides us with a tremendous scientific opportunity for understanding one of the fundamental forces of nature. 75.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
25 February 2000Tim Adye1 Using an Object Oriented Database to Store BaBar's Terabytes Tim Adye Particle Physics Department Rutherford Appleton Laboratory.
Particle Physics and the Grid Randall Sobie Institute of Particle Physics University of Victoria Motivation Computing challenge LHC Grid Canadian requirements.
Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.
LHC Computing Review - Resources ATLAS Resource Issues John Huth Harvard University.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Fermilab June 29, 2001Data collection and handling for HEP1 Matthias Kasemann Fermilab Overview of Data collection and handling for High Energy Physics.
LOGO PROOF system for parallel MPD event processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
… where the Web was born 11 November 2003 Wolfgang von Rüden, IT Division Leader CERN openlab Workshop on TCO Introduction.
CERN/IT/DB A Strawman Model for using Oracle for LHC Physics Data Jamie Shiers, IT-DB, CERN.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
V.A. Ilyin,, RIGF, 14 May 2010 Internet and Science: LHC view V.A. Ilyin SINP MSU, e-ARENA.
Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
The KLOE computing environment Nuclear Science Symposium Portland, Oregon, USA 20 October 2003 M. Moulson – INFN/Frascati for the KLOE Collaboration.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Computing for LHC Physics 7th March 2014 International Women's Day - CERN- GOOGLE Networking Event Maria Alandes Pradillo CERN IT Department.
LHC Computing, CERN, & Federated Identities
Data Processing and the LHC Computing Grid (LCG) Jamie Shiers Database Group, IT Division CERN, Geneva, Switzerland
LCG LHC Computing Grid Project From the Web to the Grid 23 September 2003 Jamie Shiers, Database Group IT Division, CERN, Geneva, Switzerland
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
CERN/IT/DB DB US Visit Oracle Visit August 20 – [ plus related news ]
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
05 Novembre years of research in physics European Organization for Nuclear Research.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
1 Particle Physics Data Grid (PPDG) project Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99.
Grid technologies for large-scale projects N. S. Astakhov, A. S. Baginyan, S. D. Belov, A. G. Dolbilov, A. O. Golunov, I. N. Gorbunov, N. I. Gromova, I.
Managing Data Resources File Organization and databases for business information systems.
CrossGrid Workshop, Kraków, 5 – 6 Nov-2001 Distributed Data Analysis in HEP Piotr MALECKI Institute of Nuclear Physics Kawiory 26A, Kraków, Poland.
PROOF system for parallel NICA event processing
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
The LHC Computing Challenge
CERN, the LHC and the Grid
Nuclear Physics Data Management Needs Bruce G. Gibbard
Using an Object Oriented Database to Store BaBar's Terabytes
Chapter 1 Introduction to Database Processing
The LHC Computing Grid Visit of Professor Andreas Demetriou
Presentation transcript:

CERN-IT-DB Exabyte-Scale Data Management Using an Object-Relational Database: The LHC Project at CERN Jamie Shiers CERN, Switzerland

EB Scale DBs Overview  Brief introduction to CERN & LHC  Why we have massive data volumes  The role of Object-Relational DBs  A Possible Solution…

CERN - The European Organisation for Nuclear Research The European Laboratory for Particle Physics  Fundamental research in particle physics  Designs, builds & operates large accelerators  Financed by 20 European countries (member states) + others (US, Canada, Russia, India, ….)  ~€650M budget - operation + new accelerators  2000 staff users (researchers) from all over the world  LHC (starts ~2005) experiment: 2000 physicists, 150 universities, apparatus costing ~€300M, computing ~€250M to setup, ~€60M/year to run  10+ year lifetime

EB Scale DBs

airport Computer Centre Geneva  27km 

EB Scale DBs

The LHC machine Two counter- circulating proton beams Collision energy TeV 27 Km of magnets with a field of 8.4 Tesla Super-fluid Helium cooled to 1.9°K The world’s largest superconducting structure

EB Scale DBs The LHC Detectors CMS ATLAS LHCb

EB Scale DBs online system multi-level trigger filter out background reduce data volume from 40TB/s to 100MB/s level 1 - special hardware 40 MHz (40 TB/sec) level 2 - embedded processors level 3 - PCs 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz (100 MB/sec) data recording & offline analysis 1000TB/s according to recent estimates

EB Scale DBs Higgs Search H  ZZ    Start with protons (quarks + gluons) Accelerate & collide Observe in massive detectors

EB Scale DBs LHC Data Challenges  4 large experiments, year lifetime  Data rates: ~500MB/s – 1.5GB/s  Data volumes: ~5PB / experiment / year  Several hundred PB total !  Data reduced from “raw data” to “analysis data” in a small number of well-defined steps  Analysed by thousands of users world-wide

CERN-IT-DB LHC Other experiments LHC Other experiments Moore’s law Planned capacity evolution at CERN Mass Storage Disk CPU

EB Scale DBs RAWRAW ESDESD AODAOD TAG random seq. 1PB/yr 100TB/yr 10TB/yr 1TB/yr Data Users Tier0 Tier1

interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) event filter (selection & reconstruction) processed data CER N

EB Scale DBs LHC Data Models  LHC data models are complex!  Typically hundreds ( ) of structure types (classes)  Many relations between them  Different access patterns  LHC experiments rely on OO technology  OO applications deal with networks of objects  Pointers (or references) are used to describe relations Event TrackList TrackerCalor. Track Track Track Track Track HitList Hit Hit Hit Hit Hit

EB Scale DBs CMS:1800 physicists 150 institutes 32 countries World Wide Collaboration  distributed computing & storage capacity

EB Scale DBs physics group regional group CERN Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x Tier3 physics department    Desktop Germany Tier 1 USA UK France Italy ………. CERN Tier 1 ………. The LHC Computing Centre

CERN-IT-DB Why use DBs? OK, you have lots of data, but what have databases, let alone Object- Relational DBs got to do with it?

EB Scale DBs Why Not: file = object + GREP ?  It works if you have thousands of objects (and you know them all)  But hard to search millions/billions/trillions with GREP  Hard to put all attributes in file name.  Minimal metadata  Hard to do chunking right.  Hard to pivot on space/time/version/attributes. Internet site:

EB Scale DBs The Reality: its build vs buy  If you use a file system you will eventually build a database system :  metadata,  Query,  parallel ops,  security,….  reorganize,  recovery,  distributed,  replication,

EB Scale DBs OK: so I’ll put lots of objects in a file Do It Yourself Database  Good news:  Your implementation will be 10x faster (at least!)  easier to understand and use  Bad news:  It will cost 10x more to build and maintain  Someday you will get bored maintaining/evolving it  It will lack some killer features: Parallel search Self-describing via metadata SQL, XML, … Replication Online update – reorganization Chunking is problematic (what granularity, how to aggregate)

EB Scale DBs Top 10 reasons to put Everything in a DB 1.Someone else writes the million lines of code 2.Captures data and Metadata, 3.Standard interfaces give tools and quick learning 4.Allows Schema Evolution without breaking old apps 5.Index and Pivot on multiple attributes space-time-attribute-version…. 6.Parallel terabyte searches in seconds or minutes 7.Moves processing & search close to the disk arm (moves fewer bytes (qestons return datons). 8.Chunking is easier (can aggregate chunks at server). 9.Automatic geo-replication 10.Online update and reorganization. 11.Security 12.If you pick the right vendor, ten years from now, there will be software that can read the data.

CERN-IT-DB How to build multi-PB DBs Total LHC data volume: ~300PB VLDBs today: ~3TB Just 5 orders of magnitude to solve… (one per year)

EB Scale DBs Divide & Conquer  Split data from different experiments  Split different data types  Different schema, users, access patterns,…  Focus on mainstream technologies & low- risk solutions  VLDB target: 100TB databases 1.How do we build 100TB databases? 2.How do we use 100TB databases to solve 100PB problem?

EB Scale DBs Why 100TB DBs?  Possible today  Vendors must provide support  Expected to be mainstream within a few years

EB Scale DBs Decision Support (2000) CompanyDB Size* (TB) DBMS Partner Server PartnerStorage Partner SBC10.50NCR LSI First Union Nat. Bank4.50InformixIBMEMC Dialog4.25ProprietaryAmdahlEMC Telecom Italia (DWPT) 3.71IBM Hitachi FedEx Services3.70NCR EMC Office Depot3.08NCR EMC AT & T2.83NCR LSI SK C&C2.54OracleHPEMC NetZero2.47OracleSunEMC Telecom Italia (DA) 2.32InformixSiemensTerraSystems *Database size = sum of user data + summaries and aggregates + indexes

EB Scale DBs Size of the Largest RDBMS in Commercial Use for DSS Source: Database Scalability Program 2000 Terabytes Projected By Respondents

EB Scale DBs BT Visit – July 2001  Oracle VLDB site: Enormous Proof of Concept test in 1999  80TB disk, 40TB mirrored, 37TB usable  Performed using Oracle 8i, EMC storage  “Single instance” – i.e. not cluster  Same techniques as being used at CERN  Demonstrated > 2 years ago!  No concerns for building 100TB today!

EB Scale DBs Physics DB Deployment  Currently run 1-3TB / server  Dual processor Intel/Linux  Scale to ~10TB in a few years sounds plausible  10-node cluster: 100TB  ~100 disks in 2005!  Can we achieve close to linear scalability?  Fortunately, our data is write-once, read-many  Should be good match for shared-disk clusters

EB Scale DBs 100TB DBs & LHC Data Analysis data: 100TB ok for ~10 years One DB cluster  Intermediate: 100TB ~1 year’s data  ~40 DB clusters  RAW data: 100TB = 1 month’s data  400 DB clusters to handle all RAW data 10 / year, 10 years, 4 experiments

EB Scale DBs RAW Data  Processed sequentially ~once / year  Need only current + historic window online  Solution: partitioning + offline tablespaces  100TB = 10 days data  Ample for (re-)processing  Partition the tables  “Old” data  transportable TBS  copy to tape  Drop from catalog  Reload, eventually to a different server, on request

EB Scale DBs Intermediate Data  ~ TB / experiment / year  Yotta-byte DBs predicted by 2020!  1000,000,000 TB ?Can DBMS capabilities grow fast enough to permit just 1 server / experiment?  ++500TB / year  An open question …

EB Scale DBs DB Deployment DAQ cluster: current data – no history export tablespaces to RAW cluster to/from MSS ESD cluster: 1/year? 1? AOD/TAG 1 total? to RCs to/from RCs reconstructanalysis

EB Scale DBs Come & Visit Us!

EB Scale DBs Come Join Us!

EB Scale DBs Summary  Existing DB technologies can be used to build 100TB databases  Familiar data warehousing techniques can be used to handle much larger volumes of historic data  A paper solution to the problems of LHC data management exists: now we just have to implement it