Download presentation
Presentation is loading. Please wait.
Published byRuby McKinney Modified over 8 years ago
1
Applications and the Grid EDG Tutorial @ CERN 19.03.2003 Ingo Augustin CERN DataGrid HEP Applications
2
19. March 2003I. Augustin, CERN2 Introduction You’ve heard much about WHAT the Grid is, but not much about WHY the Grid is, or will be, or should be or whatever…. The Rationale behind the Grid * ) Size: The Large Hadron Collider Experiments Geographical Distribution: The Monarc Computing Model Complexity: Earth Observation Applications User Community: Biomedical Applications * ) I am a physicist! All mistakes in EO & Bio applications are due to my ignorance.
3
19. March 2003I. Augustin, CERN3 Electrical Power Grid Metaphor Power on demand User unaware of actual provider Resilience Re-routing Redundancy Simple interface Wall socket Standardised protocols 230 V, 50 Hz
4
19. March 2003I. Augustin, CERN4 LHC Experiments
5
19. March 2003I. Augustin, CERN5 More Complex Events < 2000 > 2007
6
19. March 2003I. Augustin, CERN6 Typical HEP Software Scheme Typical HEP Software Scheme Generate Events Generate Events Simulate Events Simulate Events Simulation geometry Build Simulation Geometry Build Simulation Geometry Reconstuction geometry Build Reconstruction Geometry Build Reconstruction Geometry Detector description Detector alignment Detector calibration Reconstruction parameters Reconstruct Events Reconstruct Events ESD AOD Analyze Events Analyze Events Physics Raw Data ATLAS Detector
7
19. March 2003I. Augustin, CERN7 Characteristics of HEP computing Eventindependence Event independence Data from each collision is processed independently Mass of independent problems with no information exchange Massivedatastorage Massive data storage Modest event size: 1-25 MB Total is very large - Petabytes for each experiment. Mostlyreadonly Mostly read only Data never changed after recording to tertiary storage But is read often ! cf.. magnetic tape as an archive medium Modestfloatingpointneeds Modest floating point needs HEP computations involve decision making rather than calculation Computational requirements in SPECint95 secs
8
19. March 2003I. Augustin, CERN8 Typical Layout of a Computing Farm (up to several hundred nodes) tape servers disk servers application servers to external network local network servers
9
19. March 2003I. Augustin, CERN9 The Constraints Taken from: LHC Computing Review, CERN/LHCC/2001-004 Needed during a year of LHC operations TapeDiskCPU 29’400 TB9’600 TB6.2 * 10 6 SI95 In today’s units: 60 STK Silos 160’000 60GB disks 150’000 800 MHz CPUs
10
World Wide Collaboration distributed computing & storage capacity LHC:> 5000physicists > 270 institutes > 60 countries
11
19. March 2003I. Augustin, CERN11 World-wide computing Two problems: Funding will funding bodies place all their investment at CERN? Geography does a geographically distributed model better serve the needs of the world-wide distributed community? No Maybe – if it is reliable and easy to use Need to provide physicists with the best possible access to LHC data irrespective of location
12
19. March 2003I. Augustin, CERN12 Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x Present LHC Computing Model Present LHC Computing Model les.robertson@cern.ch CERN Physics Department Desktop Karlsruhe Tier 1 USA FermiLab UK France Italy CERN USA Brookhaven ……….
13
19. March 2003I. Augustin, CERN13 Regional Center
14
19. March 2003I. Augustin, CERN14 The Dungeon Pain (administration) money manpower reduction by ~ 30% before start of LHC commodity PC & Network &... Torture (users & history) anarchic user community legacy (software & structures) evolution instead of projects Execution (deadline) 2006/7 start-up of LHC
15
19. March 2003I. Augustin, CERN15 Earth Observation (WP9) Global Ozone (GOME) Satellite Data Processing and Validation by KNMI, IPSL and ESA The DataGrid testbed provides a collaborative processing environment for 3 geographically distributed EO sites (Holland, France, Italy) 19. March 200315
16
19. March 2003I. Augustin, CERN16 ENVISAT 3500 MEuro programme cost Launched on February 28, 2002 10 instruments on board 200 Mbps data rate to ground 400 Tbytes data archived/year ~100 “standard” products 10+ dedicated facilities in Europe ~700 approved science user projects 3500 MEuro programme cost Launched on February 28, 2002 10 instruments on board 200 Mbps data rate to ground 400 Tbytes data archived/year ~100 “standard” products 10+ dedicated facilities in Europe ~700 approved science user projects
17
19. March 2003I. Augustin, CERN17 Earth Observation Two different GOME processing techniques will be investigated OPERA (Holland) - Tightly coupled - using MPI NOPREGO (Italy) - Loosely coupled - using Neural Networks The results are checked by VALIDATION (France). Satellite Observations are compared against ground-based LIDAR measurements coincident in area and time.
18
19. March 2003I. Augustin, CERN18 Level-1 data (raw satellite measurements) are analysed to retrieve actual physical quantities : Level-2 data Level-2 data provides measurements of OZONE within a vertical column of atmosphere at a given lat/lon location above the Earth’s surface Coincident data consists of Level-2 data co-registered with LIDAR data (ground-based observations) and compared using statistical methods GOME OZONE Data Processing Model
19
19. March 2003I. Augustin, CERN19 ESA – KNMI Processing of raw GOME data to ozone profiles With OPERA and NNO IPSL Validate GOME ozone profiles With Ground Based measurements Raw satellite data from the GOME instrument Visualization The EO Data challenge: Processing and validation of 1 year of GOME data LIDAR data DataGrid Level 1 Level 2
20
19. March 2003I. Augustin, CERN20 DataNumber of files to be processed and replicated Size Level 1 (Satellite data) 4,72415 Mb Level 2 (NNO) 9,448,00010 kb Level 2 (Opera) 9,448,00012 kb Coincident (Validation) 122.5 Mb Total: 18,900,736 files267 Gbyte Part of a 5-year, global dataset 1 Year of GOME data EO Use-Case File Numbers
21
19. March 2003I. Augustin, CERN21 User Interface Replica Manager Submit job input data Mydata input data input data input data CESE Site H CESE Site G CESE Site F CESE Site E CESE Site D CESE Site C CESE Site B Replica Catalog Replicate DataMetaData Step 1:Transfer Level1 data to the Grid Storage Element User GOME Processing Steps (1-2) Step 2:Register Level1 data with the ReplicaManager Replicate to other SEs if necessary
22
19. March 2003I. Augustin, CERN22 Resource Broker CESE Site H CESE Site G CESE Site F CESE Site E CESE Site D CESE Site C CESE Site B Certificate Authorities User Interface JDL script Executable Myjob Submit job Request status Check certificate Search Information Index Replica Catalog Search input data input data input data LFN Retrieve result LFN PFN Logical filename Physical filename LFNPFN :: LFNPFN :: LFNPFN :: User Step 3: Submit jobs to process Level1 data, produce Level2 data Step 4:Transfer Level2 data products to the Storage Element GOME Processing Steps (3-4)
23
19. March 2003I. Augustin, CERN23 perform VALIDATION Step 5: Produce Level-2 / LIDAR Coincident data GOME Processing Steps (5-6) Step 6: Visualize Results Level 2LIDAR COINCIDENT DATA Validation
24
19. March 2003I. Augustin, CERN24 Genomics and Bioinformatics (WP10) Genomics and Bioinformatics (WP10)
25
19. March 2003I. Augustin, CERN25 Challenges for a biomedical grid The biomedical community has NO strong center of gravity in Europe No equivalent of CERN (High-Energy Physics) or ESA (Earth Observation) Many high-level laboratories of comparable size and influence without a practical activity backbone (EMB-net, national centers,…) leading to: Little awareness of common needs Few common standards Small common long-term investment The biomedical community is very large (tens of thousands of potential users) The biomedical community is often distant from computer science issues
26
19. March 2003I. Augustin, CERN26 Biomedical requirements Large user community(thousands of users) anonymous/group login Data management data updates and data versioning Large volume management (a hospital can accumulate TBs of images in a year) Security disk / network encryption Limited response time fast queues High priority jobs privileged users Interactivity communication between user interface and computation Parallelization MPI site-wide / grid-wide Thousands of images Operated on by 10’s of algorithms Pipeline processing pipeline description language / scheduling
27
19. March 2003I. Augustin, CERN27 The grid impact on data handling DataGrid will allow mirroring of databases An alternative to the current costly replication mechanism Allowing web portals on the grid to access updated databases Biomedical Replica Catalog Trembl(EBI) Swissprot (Geneva)
28
19. March 2003I. Augustin, CERN28 Web portals for biologists Biologist enters sequences through web interface Pipelined execution of bio-informatics algorithms Genomics comparative analysis (thousands of files of ~Gbyte) Genome comparison takes days of CPU (~n**2) Phylogenetics 2D, 3D molecular structure of proteins… The algorithms are currently executed on a local cluster Big labs have big clusters … But growing pressure on resources – Grid will help More and more biologists compare larger and larger sequences (whole genomes)… to more and more genomes… with fancier and fancier algorithms !!
29
19. March 2003I. Augustin, CERN29 The Visual DataGrid Blast A graphical interface to enter query sequences and select the reference database A script to execute the BLAST algorithm on the grid A graphical interface to analyze result Accessible from the web portal genius.ct.infn.it
30
19. March 2003I. Augustin, CERN30 Summary of added value provided by Grid for BioMed applications Data mining on genomics databases (exponential growth). Indexing of medical databases (Tb/hospital/year). Collaborative framework for large scale experiments (e.g. epidemiological studies). Parallel processing for Databases analysis Complex 3D modelling
31
19. March 2003I. Augustin, CERN31 Conclusions Grid or Grid-like systems are clearly needed EDG is a start that has to be followed up EDG is nowhere near to be the “real thing” Currently key focus is resilience and scalability
32
19. March 2003I. Augustin, CERN32 References Some interesting WEB sites and documents LHC Review http://lhc-computing-review-public.web.cern.ch/lhc-computing-review- public/Public/Report_final.PDF (LHC Computing Review)http://lhc-computing-review-public.web.cern.ch/lhc-computing-review- public/Public/Report_final.PDF LCG http://lcg.web.cern.ch/LCGhttp://lcg.web.cern.ch/LCG http://lcg.web.cern.ch/LCG/SC2/RTAG6 (model for regional centres)http://lcg.web.cern.ch/LCG/SC2/RTAG6 http://lcg.web.cern.ch/LCG/SC2/RTAG4 (HEPCAL Grid use cases)http://lcg.web.cern.ch/LCG/SC2/RTAG4 GEANT http://www.dante.net/geant/ (European Research Networks)http://www.dante.net/geant/ POOL http://lcgapp.cern.ch/project/persist/http://lcgapp.cern.ch/project/persist/ WP8 http://datagrid-wp8.web.cern.ch/DataGrid-WP8/http://datagrid-wp8.web.cern.ch/DataGrid-WP8/ http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332409 ( Requirements)http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332409 WP9 http://styx.srin.esa.it/gridhttp://styx.srin.esa.it/grid http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332411 (Reqts)http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332411 WP10 http://marianne.in2p3.fr/datagrid/wp10/http://marianne.in2p3.fr/datagrid/wp10/ http://www.healthgrid.org http://www.creatis.insa-lyon.fr/MEDIGRID/ http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332412 http://edmsoraweb.cern.ch:8001/cedar/doc.info?document_id=332412 (Reqts)
33
19. March 2003I. Augustin, CERN33 The End (finally)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.