# Slide 1 Steve Lloyd Culham - July 2006 The Data Deluge and the Grid Steve Lloyd Queen Mary, University of London Culham July 2006.

## Presentation on theme: "Slide 1 Steve Lloyd Culham - July 2006 The Data Deluge and the Grid Steve Lloyd Queen Mary, University of London Culham July 2006."— Presentation transcript:

Slide 1 Steve Lloyd Culham - July 2006 The Data Deluge and the Grid Steve Lloyd Queen Mary, University of London Culham July 2006

Slide 2 Steve Lloyd Culham - July 2006 Outline What is Data? Where it comes from – e-Science The CERN LHC and Experiments What is the Grid? GridPP Challenges ahead

Slide 3 Steve Lloyd Culham - July 2006 What is Data? Anything that can be expressed as numbers Raw Information Numbers Binary Digits Pictures Electrical Signals Sound Store amount of Red, Green and Blue Store loudness at each time Lots of Pictures + Sound = DVD Video Store voltage or current Text Every character has a numerical code

Slide 4 Steve Lloyd Culham - July 2006 Digital Data Numbers are stored as Binary digits 1 Bit = 0 or 1 1 Byte = 8 bits Can store yes/no or on/off Can store numbers from 0 to 255 (Enough for a character a-z, A-Z, *£\$@<... ) 25 = 0x128 + 0x64 + 0x32 + 1x16 +1x8 + 0x4 + 0x2 + 1x1 = 00011001 1 kiloByte = ~1,000 Bytes Typical Word Document ~30kB 1 MegaByte = ~1,000,000 Bytes A Floppy Disk ~1.4MB A CD ~700MB 1 GigaByte = ~1,000,000,000 Bytes Typical PC Hard Drive 120-400 GB 1 TeraByte = ~1,000,000,000,000 Bytes 1 PetaByte = ~1,000,000,000,000,000 Bytes ~1.4 Million CDs 1 ExaByte = ~1,000,000,000,000,000,000 Bytes World Annual Information Production World Annual Book Production

Slide 5 Steve Lloyd Culham - July 2006 Data Analysis What is done with data? NothingRead itListen to it Watch it Analyse it 2323 Read A Read B C = A + B Print C 5 Computer Program "Job" Calculate how proteins fold Calculate what the weather is going to do

Slide 6 Steve Lloyd Culham - July 2006 e-Science In the UK this sort of activity has become known as "e-Science" "e-Science will change the dynamic of the way Science is undertaken" "Science increasingly done through distributed global collaborations enabled by the internet using very large data collections, terascale computing resources and high performance visualisation" Dr John Taylor - Director General of Research Councils: "e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it"

Slide 7 Steve Lloyd Culham - July 2006 Astronomy Crab Nebula Optical Radio Infra-red X-ray Jet in M87 HST optical Gemini mid-IR VLA radio Chandra X-ray Virtual Observatories

Slide 8 Steve Lloyd Culham - July 2006 Earth Observation 1 TB/day Ozone map Ottawa The Institute of Physics (Google Earth)

Slide 9 Steve Lloyd Culham - July 2006 Bioinformatics

Slide 10 Steve Lloyd Culham - July 2006 Healthcare Dynamic Brain Atlas Breast Screening Scanning Remote Consultancy

Slide 11 Steve Lloyd Culham - July 2006 Collaborative Engineering Real-time collection Multi-source Data Analysis Unitary Plan Wind Tunnel Archival storage

Slide 12 Steve Lloyd Culham - July 2006 Digital Curation Digitization of almost anything To create Digital Libraries and Museums

Slide 13 Steve Lloyd Culham - July 2006 The CERN LHC 4 Large Experiments The worlds most powerful particle accelerator - 2007

Slide 14 Steve Lloyd Culham - July 2006 ALICE - heavy ion collisions, to create quark-gluon plasmas - 50,000 particles in each collision LHCb - to study the differences between matter and antimatter - detect over 100 million b and b-bar mesons each year ATLAS - General purpose - Origin of mass - Supersymmetry - 2,000 scientists from 34 countries CMS - General purpose - 1,800 scientists from over 150 institutes The Experiments

Slide 15 Steve Lloyd Culham - July 2006 7,000 tonnes 42m long 22m wide 22m high 2,000 Physicists 150 Institutes 34 Countries ATLAS Detector (About the height of a 5 storey building)

Slide 16 Steve Lloyd Culham - July 2006

Slide 17 Steve Lloyd Culham - July 2006 The Higgs Primary objective of the LHC - What is the origin of Mass? Is it the Higgs Particle? Massless Particle – Travels at the speed of light Low Mass Particle – Travels slower High Mass Particle – Travels slower still

Slide 18 Steve Lloyd Culham - July 2006 Starting from this event… We are looking for this signature Selectivity: 1 in 10 13 Like looking for 1 person in a thousand world populations Or for a needle in 20 million haystacks! The LHC Data Challenge ~100,000,000 electronic channels 800,000,000 proton- proton interactions per second 0.0002 Higgs per second 10 PBytes of data a year (10 Million GBytes = 14 Million CDs)

Slide 19 Steve Lloyd Culham - July 2006 LHC Computing Requirements CPU Power (Reconstruction, Simulation, User Analysis etc) - 100,000 of today's PCs Distributed Computing Solution – "The Grid" 'Tape' Storage 20 PetaBytes (= 20 M GBytes) Disk Storage – 2.5 PetaBytes (= 2.5 M GBytes)

Slide 20 Steve Lloyd Culham - July 2006 Web: Information Sharing Invented at CERN by Tim Berners-Lee Agreed protocols: HTTP, HTML, URLs Anyone can access information and post their own Quickly crossed over into public use No. of Internet hosts (millions) Year

Slide 21 Steve Lloyd Culham - July 2006 Distributed Resource Sharing @Home Projects Uses home PCs to run numerous calculations with dozens of variables. Distributed computing project, not a Grid Some @home projects –BBC Climate Change Experiment SETI @ Home –FightAIDS@home Distributed File Sharing Peer To Peer Networks Peer-to-peer network No centralised database of files Legal problems with sharing copyrighted material Security problems

Slide 22 Steve Lloyd Culham - July 2006 SETI@home A distributed computing project - not really a Grid project You pull the data from them rather than they submit the job to you Arecibo telescope in Puerto Rico Users - 5,240,038 Results received – 1,632,106,991 Years of CPU Time – 2,121,057 Extraterrestrials found – 0

Slide 23 Steve Lloyd Culham - July 2006 The Grid Ian Foster / Carl Kesselman: "A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities." 'Grid' means different things to different people All agree it's a funding opportunity!

Slide 24 Steve Lloyd Culham - July 2006 Electricity Grid Analogy with the Electricity Power Grid 'Standard Interface' Power Stations Distribution Infrastructure

Slide 25 Steve Lloyd Culham - July 2006 Computing Grid Computing and Data Centres Fibre Optics of the Internet

Slide 26 Steve Lloyd Culham - July 2006 Middleware MIDDLEWARE CPU Disks, CPU etc PROGRAMS OPERATING SYSTEM Word/Excel Email/Web Your Program Games CPU Cluster User Interface Machine CPU Cluster CPU Cluster Resource Broker Information Service Single PC Grid Disk Server Your Program Middleware is the Operating System of a distributed computing system Replica Catalogue Bookkeeping Service

Slide 27 Steve Lloyd Culham - July 2006 The Grid Vision From this: To this:

Slide 28 Steve Lloyd Culham - July 2006 19 UK Universities, CERN and CCLRC (RAL & Daresbury) Funded by PPARC: GridPP1 2001-2004 From Web to Grid GridPP2 2004-2007 From Prototype to Production GridPP3 2007-2011 (proposed) From Production to Exploitation Who are GridPP? Developed a working, highly functional Grid

Slide 29 Steve Lloyd Culham - July 2006 International Context LHC Computing Grid (LCG) Grid Deployment Project for LHC EU Enabling Grids for e-Science (EGEE) 2004-2008 Grid Deployment Project for all disciplines GridPP LCG EGEE GridPP is part of EGEE and LCG (currently the largest Grid in the world) UK National Grid Service UKs core production computational and data Grid Open Science Grid (USA) Science applications from HEP to biochemistry Nordugrid (Scandinavia) Grid Research and Development collaboration

Slide 30 Steve Lloyd Culham - July 2006 GridPP Middleware Development Workload Management Storage Interfaces Network Monitoring SecurityInformation Services Grid Data Management

Slide 31 Steve Lloyd Culham - July 2006 What you need to use the Grid 1. Get a digital certificate (UK Certificate Authority) 2. Join a Virtual Organisation (VO) 3. Get access to a local User Interface Machine (UI) and copy your files and certificate there Authentication – who you are Authorisation – what you are allowed to do 4. Write some Job Description Language (JDL) and scripts to wrap your programs ############# HelloWorld.jdl ################# Executable = "/bin/echo"; Arguments = "Hello welcome to the Grid "; StdOutput = "hello.out"; StdError = "hello.err"; OutputSandbox = {"hello.out","hello.err"}; #########################################

Slide 32 Steve Lloyd Culham - July 2006 How it works Job Description UI Machine Resource Broker Input Sandbox Script you want to run Other files (Job Options, Source...) Storage Element Compute Element Storage Element Grid Proxy Certificate Output Sandbox Output files (Plots, Logs...) Input Data Output Data Job Output Sandbox

Slide 33 Steve Lloyd Culham - July 2006 Tier Structure Tier 0 Tier 1 National centres Tier 2 Regional groups Institutes Workstations Offline farm Online system CERN computer centre RAL,UK ScotGridNorthGridSouthGridLondon FranceItalyGermanyUSA GlasgowEdinburghDurham Useful model for Particle Physics but not necessary for others

Slide 34 Steve Lloyd Culham - July 2006 UK Tier-1 Centre at RAL High quality data services National and International Role UK focus for International Grid development 1000 Dual CPU 200 TB Disk 220 TB Tape (Capacity 1PB) Grid Operations Centre

Slide 35 Steve Lloyd Culham - July 2006 UK Tier-2 Centres ScotGrid Durham, Edinburgh, Glasgow NorthGrid Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid Birmingham, Bristol, Cambridge, Oxford, RAL PPD London Brunel, Imperial, QMUL, RHUL, UCL Mostly funded by HEFCE

Slide 36 Steve Lloyd Culham - July 2006 The Grid at Queen Mary The Queen Mary e-Science High Throughput Cluster 174 PCs (348 CPUs) 40 TByte Disk Storage Part of the London Tier-2 Centre Currently adding another 285 PCs

Slide 37 Steve Lloyd Culham - July 2006 The LCG Grid Status Worldwide 182 Sites 23,438 CPUs 9.2 PB Disk 2,200 Years of CPU time UK 21 Sites 4,482 CPUs 180 TB Disk 593 Years of CPU time

Slide 38 Steve Lloyd Culham - July 2006 "GridPP has been developed to help answer questions about the conditions in the Universe just after the Big Bang," said Professor Keith Mason, head of the Particle Physics and Astronomy Research Council (PPARC). "But the same resources and techniques can be exploited by other sciences with a more direct benefit to society."

Slide 39 Steve Lloyd Culham - July 2006 Future Challenges (Ex-)Concorde (15 km) CD stack with 1 year LHC data (~ 20 km) We are here (4 km) Scaling to full size ~20,000 100,000 CPUs Stability, Robustness etc Security (Hackers Paradise!) Sharing resources (in RAE environment!) International Collaboration Increased Industrial take-up Spread beyond Science Continued funding beyond start of LHC!

Slide 40 Steve Lloyd Culham - July 2006 Further Info … http://www.gridpp.ac.uk RSS News feed

Similar presentations