Presentation is loading. Please wait.

Presentation is loading. Please wait.

New ways of exploring environmental data or: Letting do the hard work Jon Blower (ESSC and Reading e-Science Centre)

Similar presentations


Presentation on theme: "New ways of exploring environmental data or: Letting do the hard work Jon Blower (ESSC and Reading e-Science Centre)"— Presentation transcript:

1 New ways of exploring environmental data or: Letting do the hard work Jon Blower (ESSC and Reading e-Science Centre) jdb@mail.nerc-essc.ac.uk

2 Motivation The environmental sciences are very data-intensive –Satellite data (high resolution, several spectral bands) –Numerical model output data –Raw data -> analysis -> re-analysis –Ensembles –Easy to get up to terabytes of data Data are expensive to produce and are economically valuable –Strong real-time requirement in many cases Need ways to cope with large datasets and make sense of them Computers get faster and disks get bigger –But we can always fill them But our brains stay the same size!

3 Technical barriers Each data provider has its own preferred data format –NetCDF, HDF, HDF5, GRIB, PP, GeoTIFF, more –and there are many varieties of the above Data exist on a variety of grids –Latitude-longitude –Rotated-pole –Tri-polar –Or might not be on a grid at all (spectral format) Data providers choose different naming conventions –e.g. “temperature”, “temp”, “T” This makes even simple tasks hard –users should not have to care about any of these details

4 Solutions Expose data using standard interfaces –irrespective of how data are ultimately stored –Defining these interfaces is a community effort Provide simple tools for simple tasks –e.g. simple Web interface Use distributed computing to work with very large datasets –more of this later…

5 GADS Grid Access Data Service GADS is a software library for accessing gridded data Hides details of storage from users –user’s don’t have to know internal data formats or naming conventions Uses standard names Can make queries about data … –e.g. “what variables are there in dataset X?” … and get data subsets DATA GADS library META DATA Applications

6 GODIVA web portal The GODIVA Web portal provides a graphical interface to data at ESSC Uses GADS to query and extract data sets Users can make simple visualisations –pictures and movies

7 GADS as a Web Service Web Services are a standard way of building distributed systems “Black box” subroutines that are executed over the Internet Platform-independent –strong interoperability GADS has a Web Service interface Means that external applications can use the GADS routines at ESSC DATA GADS library META DATA WS interface External applications

8 GADS application: Search and Rescue British Maritime Technology produce software (SARIS) to help the Coastguard with Search and Rescue Predicts drift patterns of people and objects that have fallen overboard –This significantly cuts the time to rescue Have worked with BMT to produce prototype that uses live Met Office data from GADS to improve its predictions –Uses forecasts of surface winds and surface currents Can also be applied to oil spills

9 Geographical Information Systems (GIS) Many companies produce GIS software for manipulating and visualizing geographical data –e.g. ArcInfo, Maptitude, many more –Big business! Very sophisticated and powerful –Spatial statistics, geoprocessing, mapping… –e.g. identify high-risk flood zones, assess effectiveness of ambulance centres Historically very map-oriented (2-d or “2.5d”) –Hence not so useful in ocean/atmosphere sciences (need 4-d) Vendors typically used proprietary formats and interfaces –Users “locked in” to a particular vendor, hard to share information The Open Geospatial Consortium is addressing these issues

10 OGC Web Services Web ServicePurpose Web Map Server (WMS)Serves map images (cf. Streetmap, Multimap) Web Feature Server (WFS)Serves geographical features (roads, rivers, hospital locations etc) Web Coverage Server (WCS)Serves multidimensional data (e.g. numerical model output) Web Processing Server (WPS)Processes data Lots more! (roughly in decreasing order of maturity) Services can be composed to create a distributed geospatial application

11 NERC Data Grid (NDG) NERC e-Science project led by BADC Will provide software for discovery and delivery of data Data will be distributed between NDG and other groups (NDG won’t hold everything) Vast diversity of data types (all NERC data!) Rigidly standards-based (ISO) –Metadata is all-important: enables data discovery –Have created CSML (Climate Science Markup Language) – describes 7 feature types Producing whole array of OGC-compliant Web Services –Key task is to add proper security http://ndg.nerc.ac.uk/

12 ProfileFeature GridFeature ProfileSeriesFeature Some CSML features

13 NDG: data extractor and GeoSPLaT

14 Other uses of OGC Web Services DEWS project (Delivering Environmental Web Services) –Deliver Met Office data to end users in marine and health sectors –Marine applications: Search and rescue –Health application: Chronic Obstructive Pulmonary Disease (COPD prediction) –Re-engineering GADS to be WCS-compliant –Using NDG security layer –Will hopefully influence Met Office’s data provision in future GDEVIL project (Data Assimilation Research Centre) –In conjunction with RSI (makers of ENVI and IDL) –Made WCS server and client software for extracting and visualizing large datasets

15 The story so far: summary We can look forward to much easier access to data –Allows more end-users (e.g. industry) to get data in real time and at lower cost Data providers will work with the same OGC standards Web Services are a key technology NERC, Met Office, ECMWF data (and more) will be available to you through the NERC DataGrid Still lots of work to do –e.g. descriptions of community-specific datasets

16 The next generation…

17 Google Maps Web-based “widget” for viewing map data –or any images in fact Like Streetmap, Multimap etc but much slicker –draggable map –fast response time Can mark locations

18 Google Earth “Mapping for the masses” –According to Nature Desktop application (Windows and Mac) for displaying geographical data –Satellite images –Earthquake locations –Live data! All on a 3-D spinning globe Can view data at all scales Very easy to incorporate new data –easy as writing a simple Web page

19 Example of a KML file

20 How it renders

21 More examples of Google Earth data Post-Katrina satellite images Sea ice cover and ice velocity Locations of ARGO floats Bird flu outbreaks

22 Google Maps vs Google Earth Google MapsGoogle Earth Web-based – works on any modern browser (with Javascript) Standalone application – Windows and Mac only Only two layers of pictures per map (base plus overlay). As many layers of pictures as you like Some specialist knowledge required to incorporate your own data Easy to distribute new data via the web (just write a KML file) or incorporate data from local disk Relatively feature-poorFeature-rich Code has been released to publicClosed-source (black box) Both load data from servers on-the fly Neither deal with animations very well (if at all)

23 “GODIVA Two” (currently under development) Near-instantaneous previews of data Draggable Google Map for easy navigation Adjustable scale links to Google Earth Now we really are exploring data! An AJAX application (all donkey work is still done by GADS)

24 What can be done with Godiva2? Search through data very quickly using the Web interface Pick your own scale range –crude identification of isotherms Having identified data, explore further in Google Earth –Incorporate multiple data sources into GE –Overlay a lat-lon grid –Measure the size of features –much more! Download data into your application of choice (IDL, Matlab) Future modifications to Godiva2: –Other slices through data e.g. xt (Hovmuller) –Movies –Collaborative GE? –Simple data processing e.g. statistical calculations

25 ESSC Data serving architecture DATA GADS library META DATA Web Service interface Google Maps interface Google Earth interface Tomcat Application Server SARIS Other apps Google Maps Google Earth SOAP messaging HTTP GET

26 Geospatial databases A lot of the above relies on fast access to data in a multi-user environment This is the sort of thing that databases do well But most databases don’t deal well with geospatial data –Some exceptions, e.g. PostGIS –Gridded data is still a problem for most systems We have been evaluating software from Barrodale Computing Services –Very advanced geospatial database that supports gridded data –Versions for PostgreSQL, Informix, Oracle –Demos exist at www.barrodale.com Results are very promising –Faster than our system especially for small data extractions –Caches recently-used data for extra speed But this is commercial software –We have an evaluation version, in return for feeding back requirements

27 “New” methods for data processing

28 Data processing Environmental datasets are typically large and distributed In many cases data processing can be sped up through parallel processing Can also help with problem of dealing with multiple users on a data-intensive website –Website must be responsive Often tasks can be “trivially parallelized” –But even this is often awkward Let’s look at some tools we can use to make this easy

29 Condor Mature technology for scheduling jobs (programs) on ordinary desktop machines –“Cycle stealing” Makes good use of existing resources Ideal for applications where you need to run the same executable lots of times on different data sets –Monte Carlo simulations –Parameter sweeps Can also run MPI jobs Very popular world-wide

30 Condor application: TRACK TRACK identifies and tracks storms in numerical model output –Identifies pressure lows and vorticity highs Use Condor to run TRACK over large numbers of datasets –Datasets are downloaded from the Internet on-demand Then produce statistics and diagnostics using the results –Tells us about the predictability of storms Web interface Lizzie Froude and Kevin Hodges

31 BOINC Berkeley Open Infrastructure for Network Computing Used by ClimatePrediction.net and SETI@home Run code on volunteer computers (i.e. home computers) –In background or as a screensaver –Windows, Linux, Mac OSX Each computer downloads a chunk of data to process –In CP.net, each computer runs a simulation of evolution of Earth’s climate Then uploads results Volunteers join BOINC, then decide which projects they want to be involved in Have to deal with users dropping out –Also some volunteers have been known to tamper with results Some users use CP.net running speed for bragging about their computers!

32 ClimatePrediction.net on the BBC

33 Distributed Parallel Processing Environment for Java (DPPEJ) Run jobs in parallel by creating a number of Java threads Each thread runs on a different machine Easy to get started –If you’re a Java programmer Test case: search through 250 OCCAM ¼ degree ocean data files (5 GB total) looking for files that contain extreme temperatures –No point in using more than 4 machines for this task –Limited by disk access speed Time Number of threads 4 threads

34 MapReduce Google have written papers on how they do some of their distributed computing –All done on clusters of commodity machines –Have to take into account machine failures A key concept is the “Map-Reduce” programming model –One routine maps input data to intermediate output –Another routine reduces this to a final result E.g. Map names of data files to locations of storms contained therein Then plot these data on a single plot (reduce) Open source implementation of this programming model in Java (Hadoop) Programmers don’t have to worry about details of parallelization and fault tolerance –Just write a Map function and a Reduce function

35 Parallel processing tools: summary Condor –uses spare power of desktop machines –for running a program lots of times –run compiled executables – can write in any language –not real-time (jobs might not run immediately) Many other systems –Sun GridEngine, PBS, etc (often installed with clusters) BOINC (also World Community Grid and others) –Potentially lots of computers involved –Issue of trust in results –Good way to reach general public DPPEJ, MapReduce –Must program in Java, but easy if you know how –Idea is to reduce development time –MapReduce has fault-tolerance –Would probably sit behind a website like Godiva2 – most scientists wouldn’t use these directly

36 What resources are available? ESSC Condor pool Reading Campus Grid –Currently a Condor pool in Computer Science Dept –Will incorporate other resources in future (e.g. library machines, clusters) National Grid Service –2000 processors, and over 36TB –CPUs heavily used, data capacity under-used OxGrid (in future) –Intend to connect this to RCG In ideal world all these would be linked –You would then submit jobs via a single portal –this is Grid computing!

37 Where do we go from here?

38 Environmental e-Science toolkit The Reading e-Science Centre is building a “toolkit” for environmental e-Science Will incorporate many of the ideas we have seen today –Fast web access to data (“Godiva2”) –Google Maps and Google Earth interfaces –Parallel data processing at back end (for common processing tasks) –Perhaps IDL/Matlab/CDAT interfaces to the same back-end –Fast searches through data Easy access to resources such as the National Grid Service, Reading Campus Grid We will work closely with the NERC DataGrid Please tell us what you would like!

39 Stuff that you can do now Think about exposing your data through Google Earth –Easy to do –Reaches a wide range of people including the public –Great for demos –Useful for teaching? Think about what you could achieve if you had more processing power –And easy access to it If you are a data provider, look at the OGC standards and seriously consider using them Talk to us (resc@rdg.ac.uk)! –I would especially like to hear about real science use cases

40 Thank you


Download ppt "New ways of exploring environmental data or: Letting do the hard work Jon Blower (ESSC and Reading e-Science Centre)"

Similar presentations


Ads by Google