Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003.

Similar presentations

Presentation on theme: "Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003."— Presentation transcript:

1 Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003

2 What is e-Science?

3 Foundation for e-Science sensor nets Shared data archives computers software colleagues instruments Grid e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) enabling a whole-system approach Diagram derived from Ian Fosters slide

4 Three-way Alliance Computing Science Systems, Notations & Formal Foundation Process & Trust Theory Models & Simulations Shared Data Experiment & Advanced Data Collection Shared Data Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Requires Much Engineering, Much Innovation Changes Culture, New Mores, New Behaviours New Opportunities, New Results, New Rewards

5 Biochemical Pathway Simulator (Computing Science, Bioinformatics, Beatson Cancer Research Labs) DTI Bioscience Beacon Project Harnessing Genomics Programme Slide from Professor Muffy Calder, Glasgow

6 Why is Data Important?

7 Information, knowledge, decisions & designs Derived & Synthesised Data Data as Evidence – all disciplines Hypothesis, Curiosity or … Collections of Data Analysis & Models Driven by creativity, imagination and perspiration

8 Information, knowledge, decisions & designs Data as Evidence - Historically Hypothesis, Curiosity or … Collections of Data Analysis & Models Derived & Synthesised Data Individuals idea Personal collection Personal effort Lab Notebook Driven by creativity, imagination, perspiration & personal resources

9 Information, knowledge, decisions & designs Data as Evidence – Enterprise Scale Hypothesis, Curiosity, Business Goals Collections of Digital Data Analysis & Computational Models Derived & Synthesised Data Agreed Hypothesis or Goals Enterprise databases & (archive) file stores Data Production Pipelines Data Products & Results Driven by creativity, imagination, perspiration & companys resources

10 Information, knowledge, decisions & designs Data as Evidence – e-Science Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Communities and Challenges Multi-enterprise & Public Curation Synthesis from Multiple Sources Multi-enterprise Models, Computation & Workflow Shared Data Products & Results Driven by creativity, imagination, perspiration & shared resources

11 global in-flight engine diagnostics in-flight data airline maintenance centre ground station global network eg SITA internet, e-mail, pager DS&S Engine Health Center data centre Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York 100,000 aircraft 0.5 GB/flight 4 flights/day 200 TB/day

12 LHC Distributed Simulation & Analysis Tier2 Centre ~1 TIPS Online System Offline Farm ~20 TIPS CERN Computer Centre >20 TIPS RAL Regional Centre US Regional Centre French Regional Centre Italian Regional Centre Institute Institute ~0.25TIPS Workstations ~100 MBytes/sec 100 - 1000 Mbits/sec One bunch crossing per 25 ns 100 triggers per second Each event is ~1 Mbyte Physicists work on analysis channels Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~ Gbits/sec or Air Freight Tier2 Centre ~1 TIPS ~Gbits/sec Tier 0 Tier 1 Tier 3 Tier 4 1 TIPS = 25,000 SpecInt95 PC (1999) = ~15 SpecInt95 ScotGRID++ ~1 TIPS Tier 2 1. CERN

13 DataGrid Testbed Dubna Moscow RAL Lund Lisboa Santander Madrid Valencia Barcelona Paris Berlin Lyon Grenoble Marseille Brno Prague Torino Milano BO-CNAF PD-LNL Pisa Roma Catania ESRIN CERN HEP sites ESA sites IPSL Estec KNMI (>40) - Testbed Sites

14 Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Multiple overlapping communities Shared Goals Multiple hypotheses Collections of Published & Private Data Analysis Computation Annotation Derived & Synthesised Data Supported by common standards & shared infrastructure

15 Life-science Examples

16 Database Growth PDB Content Growth Bases 45,356,382,990

17 Wellcome Trust: Cardiovascular Functional Genomics Glasgow Edinburgh Leicester Oxford London Netherlands Shared data Public curated data BRIDGES IBM Depends on building & maintaining security, privacy & trust

18 Comparative Functional Genomics Large amounts of data Highly heterogeneous Data types Data forms community Highly complex and inter-related Volatile my Grid Project: Carole Goble, University of Manchester

19 UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign

20 DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago tomographic reconstruction real-time collection wide-area dissemination desktop & VR clients with shared controls Advanced Photon Source Online Access to Scientific Instruments archival storage From Steve Tuecke 12 Oct. 01

21 Community = 1000s of home computer users Philanthropic computing vendor (Entropia) Research group (Scripps) Common goal= advance AIDS research Home Computers Evaluate AIDS Drugs From Steve Tuecke 12 Oct. 01


23 Astronomy Examples


25 Global Knowledge Communities driven by Data: e.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins

26 Sloan Digital Sky Survey Production System Slide from Ian Fosters ssdbm 03 keynote

27 Supernova Cosmology Requires Complex, Widely Distributed Workflow Management

28 Engineering Examples

29 whole-system simulations braking performance steering capabilities traction dampening capabilities landing gear models lift capabilities drag capabilities responsiveness wing models deflection capabilities responsiveness stabilizer models airframe models crew capabilities - accuracy - perception - stamina - reaction times - SOPs human models thrust performance reverse thrust performance responsiveness fuel consumption engine models NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson

30 Mathematicians Solve NUG30 Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientists Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,17,30,6,20,19, 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin From Miron Livny 7 Aug. 01

31 Network for Earthquake Engineering Simulation NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC From Steve Tuecke 12 Oct. 01

32 National Airspace Simulation Environment NASA Information Power Grid: aircraft, flight paths, airport operations and the environment are combined to get a virtual national airspace Virtual National Air Space VNAS GRC engine models LaRC airframe models landing gear models ARC wing models stabilizer models human models FAA ops data weather data airline schedule data digital flight data radar tracks terrain data surface data 22,000 commercial US flights a day 50,000 engine runs 22,000 airframe impact runs 132,000 landing/ take-off gear runs 48,000 human crew runs 66,000 stabilizer runs 44,000 wing runs simulation drivers

33 Data Challenges

34 Derived from Ian Fosters slide at ssdbM July 03 Its Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Security and Trust are crucial issues Ultra-high-speed networks: >10 Gb/s Global optical networks Bottlenecks: last kilometre & firewalls Huge quantities of computing: >100 Top/s Moores law gives us all supercomputers Organising their effective use is the challenge Moores law everywhere Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge

35 Tera Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost 7 disks = $5000 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m 2 May 2003 Approximately Correct See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24

36 The Story so Far Technology enables Grids, More Data & … Information Grids will dominate Collaboration essential Combining approaches Combining skills Sharing resources (Structured) Data is the language of Collaboration Data Access & Integration a Ubiquitous Requirement Many hard technical challenges Scale, heterogeneity, distribution, dynamic variation Intimate combinations of data and computation With unpredictable (autonomous) development of both

37 Scientific Data Opportunities Global Production of Published Data Volume Diversity Combination Analysis Discovery Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility

38 UK e-Science

39 From presentation by Tony Hey

40 e-Science Programmes Vision UK will lead the in the exploitation of e-Infrastructure New, faster and better research Engineering design, medical diagnosis, decision support, … e-Business, e-Research, e-Design & e- Decision Depends on Leading e-Infrastructure development & deployment

41 e-Science and SR2002 Research Council 2004-62001-4 Medical£13.1M (£8M) Biological£10.0M (£8M) Environmental £8.0M (£7M) Eng & Phys£18.0M(£17M) HPC £2.5M (£9M) Core Prog.£16.2M(£15M) + £20M Particle Phys & Astro £31.6M(£26M) Economic & Social £10.6M (£3M) Central Labs £5.0M (£5M)

42 National e- Science Centre HPC(x) National e-Science Institute International relationships Engineering Task Force Grid Support Centre Architecture Task Force OGSA-DAI One of 11 Centre Projects GridNet to support standards work One of 6 administration projects Training team 5 Application projects 15 Fundamental Research projects EGEE Globus Alliance

43 NeSI in Edinburgh National e-Science Centre

44 NeSI Events held in the 2 nd Year (from 1 Aug 2002 to 31 Jul 2003) We have had 86 events: (Year 1 figures in brackets) 11 project meetings ( 4) 11 research meetings ( 7) 25 workshops (17 + 1) 2 summer schools(0) 15 training sessions(8) 12 outreach events(3) 5 international meetings(1) 5 e-Science management meetings (7) (though the definitions are fuzzy!) > 3600 Participant Days Suggestions always welcome Establishing a training team Investing in community building, skill generation & knowledge development

45 NeSI Workshops Space for real work Crossing communities Creativity: new strategies and solutions Written reports Scientific Data Mining, Integration and Visualisation Grid Information Systems Portals and Portlets Virtual Observatory as a Data Grid Imaging, Medical Analysis and Grid Environments Grid Scheduling Provenance & Workflow GeoSciences & Scottish Bioinformatics Forum

46 E-Infrastructure

47 OGSA Infrastructure Architecture OGSI: Interface to Grid Infrastructure Data Intensive Applications for Application area X Compute, Data & Storage Resources Distributed Simulation, Analysis & Integration Technology for Application area X Data Intensive Users Virtual Integration Architecture Generic Virtual Data Access and Integration Layer Structured Data Integration Structured Data Access Structured Data Relational XML Semi-structured- Transformation Registry Job Submission Data TransportResource Usage Banking BrokeringWorkflow Authorisation

48 1a. Request to Registry for sources of data about x 1b. Registry responds with Factory handle 2a. Request to Factory for access to database 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3b. GDS interacts with database 3c. Results of query returned to client as XML SOAP/HTTP service creation API interactions RegistryFactory 2b. Factory creates GridDataService to manage access Grid Data Service Client XML / Relationa l database Data Access & Integration Services

49 GDTS 2 GDS 3 2 GDTS 1 S x S y 1a. Request to Registry for sources of data about x & y 1b. Registry responds with Factory handle 2a. Request to Factory for access and integration from resources Sx and Sy 2b. Factory creates GridDataServices network 2c. Factory returns handle of GDS to client 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation SOAP/HTTP service creation API interactions Data Registry Data Access & Integration master Client Analyst XML database Relational database GDS GDTS 3b. Client tells analyst GDS 1 Future DAI Services? scientific Application coding scientific insights Problem Solving Environment Semantic Meta data Application Code

50 Integration is our Focus Supporting Collaboration Bring together disciplines Bring together people engaged in shared challenge Inject initial energy Invent methods that work Supporting Collaborative Research Integrate compute, storage and communications Deliver and sustain integrated software stack Operate dependable infrastructure service Integrate multiple data sources Integrate data and computation Integrate experiment with simulation Integrate visualisation and analysis High-level tools and automation essential Fundamental research as a foundation

51 Take Home Message Data is a Major Source of Challenges AND an Enabler of New Science, Engineering, Medicine, Planning, … Information Grids Support for collaboration Support for computation and data grids Structured data is fundamental Integrated strategies & technologies needed E-Infrastructure is Here – More to do – technically & socio-economically Join in – explore the potential – develop the methods & standards NeSC would like to help you develop e-Science We seek suggestions and collaboration

Download ppt "Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003."

Similar presentations

Ads by Google