Presentation is loading. Please wait.

Presentation is loading. Please wait.

Steve LloydThe Data Deluge and the GridSlide 1 The Data Deluge and the Grid The Data Deluge The Large Hadron Collider The LHC Data Challenge The Grid Grid.

Similar presentations


Presentation on theme: "Steve LloydThe Data Deluge and the GridSlide 1 The Data Deluge and the Grid The Data Deluge The Large Hadron Collider The LHC Data Challenge The Grid Grid."— Presentation transcript:

1 Steve LloydThe Data Deluge and the GridSlide 1 The Data Deluge and the Grid The Data Deluge The Large Hadron Collider The LHC Data Challenge The Grid Grid Applications GridPP Conclusion Steve Lloyd Queen Mary University of London s.l.lloyd@qmul.ac.uk

2 Steve LloydThe Data Deluge and the GridSlide 2 The Data Deluge Expect massive increases in amount of data being collected in several diverse fields over the next few years: –Astronomy - Massive sky surveys –Biology - Genome databases etc. –Earth Observing –Digitisation of paper, film, tape records etc to create Digital Libraries, Museums... –Particle Physics - Large Hadron Collider –... 1PByte ~1000 TBytes ~ 1M GBytes ~ 1.4M CDs [Petabyte Terabyte Gigabyte]

3 Steve LloydThe Data Deluge and the GridSlide 3 Digital Sky Project Federating new astronomical surveys: ~ 40,000 square degrees ~ 1/2 trillion pixels (1 arc second) ~ 1 TB x multi-wavelengths > 1 billion sources Integrated catalogue and image database: –Digital Palomer Observatory Sky Survey –2 All Sky Survey –NRAO VLA Sky Survey –VLA FIRST Radio Survey Later: –ROSAT –IRAS –Westerbork 327 MHz Survey

4 Steve LloydThe Data Deluge and the GridSlide 4 Sloan Digital Sky Survey ~ 1 million spectra positions and images of 100 million objects 5 wavelength bands ~ 40 TB Survey 10,000 square degrees of Northern Sky over 5 years

5 Steve LloydThe Data Deluge and the GridSlide 5 VISTA Visible and Infrared Survey Telescope for Astronomy

6 Steve LloydThe Data Deluge and the GridSlide 6 Virtual Observatories Crab Nebula Optical Radio Infra-red X-ray Jet in M87 HST optical Gemini mid-IR VLA radio Chandra X-ray

7 Steve LloydThe Data Deluge and the GridSlide 7 NASAs Earth Observing System 1 TB/day Galapagos Oil Spill:

8 Steve LloydThe Data Deluge and the GridSlide 8 ESA EO Facilities ESRIN MATERA (I) NEUSTREL.ITZ (D) KIRUNA (S) - ESRANGE MASPALOMAS (E) TROMSO (N) MATERA (I) SEAWIFS SPOT IRS-P3 LANDSAT 7 TERRA/MODIS STANDARD PRODUCTION CHAINS USERS HISTORICAL ARCHIVES USERS PRODUCTS AVHRR GOME analysis detected ozone thinning over Europe 31 Jan 2002

9 Steve LloydThe Data Deluge and the GridSlide 9 Species 2000 To enumerate all ~1.7 million known species of plants, animals, fungi and microbes on Earth A federation of initially 18 taxonomic databases - eventually ~ 200 databases

10 Steve LloydThe Data Deluge and the GridSlide 10 Genomics

11 Steve LloydThe Data Deluge and the GridSlide 11 The LHC The Large Hadron Collider (LHC) will be a 14 TeV centre of mass proton proton collider operating in the existing 26.7Km LEP tunnel at CERN. Due to start operation > 2006 –1,232 superconducting main dipoles of 8.3Tesla –788 quadrupoles –2,835 bunches of 10 11 protons per bunch spaced by 25ns

12 Steve LloydThe Data Deluge and the GridSlide 12 Particle Physics Questions Need to discover (confirm) Higgs Particle –Study its properties –Prove that Higgs couplings depend on masses Other unanswered questions: –Does Supersymmetry exist? –How are quarks and leptons related? –Why are there 3 sets of quarks and leptons? –What about Gravity? –Anything unexpected?

13 Steve LloydThe Data Deluge and the GridSlide 13 The LHC

14 Steve LloydThe Data Deluge and the GridSlide 14 The LEP/LHC Tunnel

15 Steve LloydThe Data Deluge and the GridSlide 15 LHC Experiments LHC will house 4 experiments: –ATLAS and CMS are large 'General Purpose' detectors designed to detect everything and anything –LHCb is a specialised experiment designed to study CP violation in the b quark system –ALICE is a dedicated Heavy Ion Physics Detector

16 Steve LloydThe Data Deluge and the GridSlide 16 Schematic View of the LHC

17 Steve LloydThe Data Deluge and the GridSlide 17 The ATLAS Experiment ATLAS Consists of –An inner tracker to measures the momentum of each charged particle –A calorimeter to measure the energies carried by the particles –A muon spectrometer to identify and measure muons –A huge magnet system for bending charged particles for momentum measurement A total of > 10 8 electronic channels

18 Steve LloydThe Data Deluge and the GridSlide 18 The ATLAS Detector

19 Steve LloydThe Data Deluge and the GridSlide 19 Simulated ATLAS Higgs Event

20 Steve LloydThe Data Deluge and the GridSlide 20 LHC Event Rates The LHC proton bunches collide every 25ns and each collision yields ~20 proton proton interactions superimposed in the Detector i.e. –40 MHz x 20 = 8x10 8 pp interactions/sec The (110 GeV) Higgs cross section is 24.2pb. A good channel is H with a branching ratio of 0.19% and a detector acceptance ~50% –At full (10 34 cm -2 s -1 ) LHC luminosity this gives 10 34 x 24.2x10 -12 x 10 -24 x 0.0019 x 0.5 = 2x10 -4 H per second A 2x10 -4 needle in a 8x10 8 Haystack

21 Steve LloydThe Data Deluge and the GridSlide 21 'Online' Data Reduction Collision Rate 40 MHz Level 1 Special Hardware Trigger Level 2 Embedded Processor Trigger Level 3 Processor Farm Raw Data Storage 10 4 - 10 5 Hz 10 2 - 10 3 Hz 10 - 100 Hz Offline Data Reconstruction Selecting interesting events based on progressively more detector information 10-100 GB/sec 40 TB/sec 1-10 GB/sec 100-200 MB/sec

22 Steve LloydThe Data Deluge and the GridSlide 22 Offline Analysis Raw Data from Detector Physics Analysis 1-2 MB/event @ 100-400 Hz Data Reconstruction (Digits to Energy/momentum etc) Event Summary Data 0.5 MB/event Analysis Event Selection Analysis Object Data 10 kB/event Total Data per year from one experiment 1 to 8 PBytes (10 15 Bytes)

23 Steve LloydThe Data Deluge and the GridSlide 23 Computing Resources Required CPU Power (Reconstruction, Simulation, User Analysis etc) –20 Million SpecInt2000 –(A 1 GHz PC is rated at ~400 SpecInt2000) –i.e. 50,000 of yesterday/today's PCs 'Tape' Storage –20,000 TB Disk Storage –2,500 TB Analysis carried out throughout the world by hundreds of Physicists

24 Steve LloydThe Data Deluge and the GridSlide 24 Worldwide Collaboration CMS:1800 physicists 150 institutes 32 countries

25 Steve LloydThe Data Deluge and the GridSlide 25 Solutions Distributed solution: –exploit established computing expertise & infrastructure in national labs and universities –reduce dependence on links to CERN –tap additional funding sources (spin off) Is the Grid the solution? Centralised Solution: –Put all resources at CERN Funding agencies certainly won't place all their investment at CERN Sociological problems

26 Steve LloydThe Data Deluge and the GridSlide 26 What is The Grid? Analogy with the Electricity Power Grid: –Unlimited ubiquitous distributed computing –Transparent access to multipetabyte distributed databases –Easy to plug in –Complexity of infrastructure hidden

27 Steve LloydThe Data Deluge and the GridSlide 27 The Grid Ian Foster and Carl Kesselman, editors, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 1999, http://www.mkp.com/grids Five emerging models: Distributed Computing - synchronous processing High-Throughput Computing - asynchronous processing On-Demand Computing - dynamic resources Data-Intensive Computing - databases Collaborative Computing - scientists

28 Steve LloydThe Data Deluge and the GridSlide 28 The Grid Ian Foster / Carl Kesselman: "A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities."

29 Steve LloydThe Data Deluge and the GridSlide 29 The Grid Dependable - Need to rely on remote equipment as much as the machine on your desk Consistency - Machines need to communicate so need consistent environments and interfaces Pervasive - The more resources that participate in the same system the more useful they all are Inexpensive - Important for pervasiveness - i.e. built using commodity PCs and disks

30 Steve LloydThe Data Deluge and the GridSlide 30 The Grid You simply submit your job to the 'Grid'- you shouldn't have to know where the data you want is or where the job will run. The Grid software (Middleware) will take care of: –running the job where the data is or –moving the data to where there is CPU power available

31 Steve LloydThe Data Deluge and the GridSlide 31 @#%&*! The Grid for the Scientist E = mc 2 Grid Middleware Putting the bottleneck back in the Scientists mind

32 Steve LloydThe Data Deluge and the GridSlide 32 Grid Tiers For the LHC we envisage a 'Hierarchical' structure based on several 'Tiers' since the data mostly originates at one place: –Tier-0 - CERN - the source of the data –Tier-1 - ~ 10 Major Regional Centres (inc UK) –Tier-2 - smaller more specialised Regional Centres (4 in UK?) –Tier-3 - University Groups –Tier-4 – My laptop? Mobile Phone? Doesn't need to be hierarchical e.g. for Biologists probably not desirable

33 Steve LloydThe Data Deluge and the GridSlide 33 Grid Services Resource-specific implementations of basic services e.g., Transport protocols, name servers, differentiated services, CPU schedulers, public key infrastructure, site accounting, directory service, OS bypass Resource-independent and application-independent services authentication, authorization, resource location, resource allocation, events, accounting, remote data access, information, policy, fault detection Distributed Computing Toolkit Grid Fabric (Resources) Grid Services (Middleware) Application Toolkits Data- Intensive Applications Toolkit Collaborative Applications Toolkit Remote Visualization Applications Toolkit Problem Solving Applications Toolkit Remote Instrumentation Applications Toolkit Applications Chemistry Biology Cosmology Particle Physics Environment

34 Steve LloydThe Data Deluge and the GridSlide 34 Problems Scalability –Will it scale to thousands of processors, thousands of disks, PetaBytes of data, Terabits/sec of IO? Wide-area distribution –How to distribute, replicate, cache, synchronise, catalogue the data? –How to balance local ownership of resources with the requirements of the whole? Adaptability/Flexibility –Need to adapt to rapidly changing hardware and costs, new analysis methods etc.

35 Steve LloydThe Data Deluge and the GridSlide 35 SETI@home A distributed computing project - not really a Grid project You pull the data from them rather than they submit the job to you –total of 4,591,332 users –963,646,331 results received –1,545,634 years of cpu time –3.3x10 21 floating point operations –125 different cpu types –143 different operating systems Arecibo telescope in Puerto Rico

36 Steve LloydThe Data Deluge and the GridSlide 36 SETI@home

37 Steve LloydThe Data Deluge and the GridSlide 37 Entropia Uses idle cycles on Home PCs for profit and non-profit projects: Mersenne Prime Search 42,519 machines active 560 years of cpu per day FightAIDS@Home 60,000 Machines 1,400 years of cpu time

38 Steve LloydThe Data Deluge and the GridSlide 38 NASA Information Power Grid Knit together widely distributed computing, data, instrumentation and human resources to address complex large scale computing and data analysis problems

39 Steve LloydThe Data Deluge and the GridSlide 39 Collaborative Engineering Real-time collection Multi-source Data Analysis Unitary Plan Wind Tunnel Archival storage

40 Steve LloydThe Data Deluge and the GridSlide 40 Other Grid Applications Distributed Supercomputing –Simultaneous execution across multiple supercomputers Smart Instruments –Enhance the power of scientific instruments by providing access to data archives and online processing capabilities and visualisation e.g. coupling Argonnes Photon Source to a supercomputer

41 Steve LloydThe Data Deluge and the GridSlide 41 GridPP http://www.gridpp.ac.uk

42 Steve LloydThe Data Deluge and the GridSlide 42 GridPP Overview Provide architecture and middleware Use the Grid with simulation data Use the Grid with real data Future LHC Experiments Running US Experiments Build prototype Tier-1 and Tier-2s in the UK and implement middleware in experiments

43 Steve LloydThe Data Deluge and the GridSlide 43 The Prototype UK Tier-1 March 2003 560 CPUs (450Mhz-1.4GHz) 50 TB Disk 35 TB Tape in use (theoretical tape capacity 366 TB)

44 Steve LloydThe Data Deluge and the GridSlide 44 Conclusions Enormous data challenges in next few years. The Grid is likely solution. The Web gives ubiquitous access to distributed information. The Grid will give ubiquitous access to computing resources and hence knowledge. Many Grid projects and testbeds starting to take off. GridPP is building a UK Grid for Particle Physicists to prepare for future LHC Data.


Download ppt "Steve LloydThe Data Deluge and the GridSlide 1 The Data Deluge and the Grid The Data Deluge The Large Hadron Collider The LHC Data Challenge The Grid Grid."

Similar presentations


Ads by Google