Presentation is loading. Please wait.

Presentation is loading. Please wait.

NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,

Similar presentations


Presentation on theme: "NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,"— Presentation transcript:

1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS Annual Review August 20, 2013 Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest Melton

2 Background: NASA Earth Exchange
Vision: To engage and enable the Earth science community to address global Earth science challenges. NEX is a collaborative compute platform that improves the availability of Earth science data, models, analysis tools and scientific results through a centralized environment that fosters knowledge sharing, collaboration, innovation and direct access to compute resources. Engage: Network, share and collaborate Discuss and formulate new ideas Portal, Virtual Institute Enable: Access to data Access to computing Access to knowledge

3 NEX Infrastructure View

4 Outline Project background Updated quad chart
Review of schedule and milestones Description of work accomplished and results Technical reports and presentations Discussion of next 6 month activity Schedule and budget summary

5 Project Background Main focus of the projects is on supporting the NEX community by continuously improving access to data, tools, computing and knowledge. By improving the above, we can engage more users and teams and provide them with better and faster support Need to be able to respond quickly to new requirements Focus on knowledge acquisition, and access We can also help our users to significantly scale their projects

6 PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center
NASA Earth Exchange: Improving access to large-scale data and computational infrastructure PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center Goals and Objectives Enhance access, discovery and integration of data, models and services for the NEX communities Provide integrated system view of NEX data, metadata, processing libraries, models and QA information Provide API and client libraries to NEX tools, datasets and search capabilities Provide streamlined way for researchers to share their results with the community Architecture Overview Approach Key Milestones Inventory current NEX datasets, tools and models and engage the community in gathering requirements and use cases. Design a common database schema for existing NEX datasets. Develop API that facilitates search and access to data, tools and models and use it to implement client libraries Develop migration and dissemination tools for NEX users Preliminaries completed 07/2012 Data integration completed 11/2012 Process integration completed 01/2013 System interface completed 08/2013 Migration tools completed 01/2014 Client libraries and tools completed 02/2014 Co-Is/Partners Co-I: Petr Votava, Andrew Michaelis, Dr. Hirofumi Hashimoto, Forrest Melton/CSU Monterey Bay TRLin = 6 08/13

7 Project Schedule

8 Project Goal To enhance access, discovery and integration of data, models and tools for the NEX communities.

9 Objectives for Activity During Review Period
Complete inventory of current NEX data, metadata, tools and libraries Engage NEX users to gather additional data and tools requirements Complete initial data integration with the key NEX datasets and the existing infrastructure Continue rapid prototyping of database access tools based on user requirements. Continue integration of utilities and tools with NEX system. Prototype integration with NEX semantic infrastructure.

10 Project Drivers = Why To directly support large-scale NASA projects such as WELD, NAFD, NCA, MEASURES, CMS, CMAC and projects in applied sciences Efficiently support fast growing NEX community both inside and outside of NASA Earth science research is a global undertaking and we aim to engage the largest possible community Large global collaboratory Global knowledge pool Need critical mass -> everybody benefits Support for large-scale science while engaging large community Place for community contributions and access to these contributions: Knowledge, tools, data, workflows, …

11 NEX User and Project Evolution
Number of active compute/data users at the beginning of this ACCESS project: less than 50 Current number of active compute/data users: 158 Largest data requirements at the beginning of this ACCESS project: 10s of TB (per project) Current data requirements: 100s of TB – 1PB+ (per project) On the NEX portal – currently 404 users and 1,252 projects (not all active)

12 ACCESS Project Overview
Data Provide integrated view of NEX data and metadata through API, command-line tools and query services. Tools Provide mechanism to discover and manage environments for tools and utilities required by different projects and provide APIs Knowledge Cross-reference and provide access to information about datasets, tools, users, projects, publications and other docs. Dissemination Establish process, policies and infrastructure for dissemination of data produced on NEX. Infrastructure Components and solutions that enable the above within security and policy constraints.

13 Data Organization Started with inventory
Currently over 450TB on-line and 500+TB near-line Feedback from summer school 2012 users, summer interns in 2013 and NEX users and PIs Two rounds of “Query Requirements” with the NEX science team Two-to-three tier system Primary on-line fast storage, secondary on-line cache, near-line tape accessed through DMF

14 Query Categories and Requirements
“Standard” queries Temporal, spatial, match region by name, what data are available, … Data provenance How was data produced (process/workflow)? What were the inputs into the process? Who created this dataset? Knowledge queries Which projects work with dataset X? In what geographic region? Which publications are relevant to dataset X? Administration queries How often is the dataset updated? From where? Analytics queries (not addressed by this project) Filter based on internal QA, Landcover or statistics Large number of requests for these capabilities

15 Data Organization Details
Keep metadata in the original format/naming conventions Researchers are used to the metadata names At times extensive documentation exists to describe the metadata Metadata are processed by custom parsers Different for different sensors (MODIS, Landsat, NAIP, …) Each datasets is stored in a separate set of tables and when it is added to NEX a custom plug-in is written Overrides abstract methods from the DB class It is manageable, because the class of the datasets in not that large (few dozens at most) and writing a generic code in this case while maintaining the original metadata would take longer in this case We are experimenting with semantic layer that describes and maps terms in different DBs to common taxonomy, but it requires dynamic query rewriting and it’s suitability for this problem is questionable. Best solution in this case seems either fully relational (current) or fully graph-based (future). Needs to hide the implementation behind an API, however users at times want access to a full RDBMS in which case maintaining two consistent copies seems the best answer.

16 Tools/Utilities/Models
Tried number of approaches Users often want custom solutions with specific library/tool versions Management of this gets quickly complicated Using “modules” infrastructure to provide custom environments for NEX teams We can easily mix and match versions as per team’s requirements Also good for easy reproduction/packaging of environments Will be basis for tool contribution setup (nex/contrib) Access to almost all tools through a Python API or through regular command-line invocation Great for integration with VisTrails workflow management system Mechanism to query a list of modules to be built or request a new module to be built. Working on adding better search and documentation capabilities Also, exposing documentation externally on the NEX portal

17 Knowledge Organization
Internal NEX Knowledge graph Spans data, content, web portal, tools Provenance RDF/OWL representation Triple and quad-store (MySQL and Virtuoso) Knowledge Acquisition Manual = Documentation, blogs etc. (internal and external) Automatic = entity extraction from text and metadata using natural language processing Location, datasets used by project, sensors Build relationships Improves search – who is doing what where Who is doing work in Amazon, what sensors are they using? What are the most frequent sensors used by NEX projects Can generate project concepts, so that projects can be easily related to each other (LSI)

18 Relating Entities Queries Link to Link to/ Define new
NEX Projects, wikis,… (NEX web portal) GCMD Concepts Extract entities Link to NEX Graph Data Store Publications (NEX Web Portal Harvard Database, …) Extract entities NEX Extension (Additional concepts outside the GCMD hierarchy – data hierarchy, …) Link to/ Define new Link to resources Links to external docs (LP DAAC, …) Record provenance Provenance from running process Queries

19 Example queries What is the provenance of file X?
What is the bounding box of region R? Get sorted (by number of projects) the usage of each of the NASA instruments in the NEX projects? What instruments are used by projects doing research in the Amazon? What are the most cited datasets in the remote sensing publications? Now that NEX portal has been migrated to NAS we can start to integrate this information with the portal a lot easier.

20 Data Dissemination Number of faucets
Large-scale data distribution (CMIP-5 for NCA) Web-services application support (SIMS) Open Access – Amazon Focus not only on the mechanics and implementation, but also on protocols and policies development Often more time-consuming than implementation

21 CMIP-5 Dissemination Downscaled climate dataset produced on NEX (17TB)
Important and highly requested by the community First process for NEX data -> NASA distribution facility Established DOI mining capabilities (through UC Digital Library) Established a technique for DOI dataset verification through checksums without extensive web services even when underlying naming changes. Data available at: And internally on NEX Data had to be aggregated and reformatted for use by NCCS This raises issues of verifications with original datasets as well as the fact that there are effectively two copies of the data in different formats Needed to work extensive work with users + many lessons learned = update protocol with NCCS, but will be different with different facilities

22 NASA Satellite Irrigation Management Support (SIMS)
ACCESS software infrastructure directly supports the SIMS project (NASA Applied Sciences) Build partially on efforts from last ACCESS project Provides access to near-real-time Landsat data time-series through a data cube interface The goal of the SIMS project is to develop new information products from satellite data to support growers in optimizing irrigation Currently tested by 12 partner growers Data visualization and queries via web services built on OPeNDAP Both web-based and mobile interfaces

23 crop cond. % cover crop coeff crop water requirement
An example of the SIMS web / mobile data interface, which is designed to enhance grower access to satellite-derived measures of crop condition and crop water requirements across 3.7 million ha of irrigated land in California.

24 Amazon Web Services Space Act Agreement
Prototype process for providing access to NEX data through public cloud facilities Open access to data and workflows We are reaching capacity on NEX and have restrictions on access Different cost model – billing for computing is under users control We can add complete Virtual Machines with packaged environments and workflows developed and managed on NEX and accessible through the NEX web portal Prototyping effort includes NCA-related activities NCA downscaled data (CMIP-5) NEX portal linked with Amazon Web Services (open) or internal (NEX-members only) NEX work environment

25 Infrastructure Database setup Supercomputing setup
Access to database systems from all NEX components Mostly MySQL-based, experimenting with Virtuoso, Neo4j and re-visiting MongoDB Supercomputing setup Work with NAS system group to enable access even from within Pleiades supercomputer Needed for easier streaming of provenance information Applications support Separate OpenDAP, THREDDS and FTP server Security considerations Moderate system = 2-factor authentication required Waiver for NEX portal for OpenID and NDC users One of the drivers for testing public cloud solutions to improve access

26 Immediate Benefits for Many NEX Projects (Examples)
Web-Enabled Landsat Data (WELD) Acquisition, organization and access to data and processing capabilities for monthly Landsat vegetation composites – 800+TB total data requirements North America Forest Dynamics (NAFD) Acquisition, organization and access to data, QA, metadata and processing capabilities for Landsat (80TB) BIOCLIM Acquisition and organization of global MODIS land and atmospheric products including swath mapping to acquisition regions (15 TB).

27 Web Enabled Landsat Data: Going Global, Roy et al.,
Creating Global Monthly Landsat Composites, Present April 2010 Takes over 10,000 scenes each month using WELD system October 2010

28 North American Forest Disturbance (NAFD, Goward et al.,)
Expanding from 23 samples to Wall-to-wall coverage Processing scenes from on NEX

29 NEX Software View - Current

30 NEX Software View – Overall Goal
Currently prototyping NASA Cloud/AWS/OpenStack implementation/…

31 Instantiate

32 Snapshots IP Ready… Instance Type START Status Monitor Booting
Instantiate Status Monitor Booting IP Ready… INSTANCE READY

33 Summary of Activity During Review Period (1)
Inventory of NEX tools and datasets. Started with 25 existing datasets on NEX comprising about 300TB of data. Work with NEX users to better understand: How they use the data, metadata and QA information Which tools and utilities they are using the most and what functionality is missing from the existing tools and utilities. We have prototyped the database access for number of use cases and some parts of it are already being used by NEX science teams. As the science teams have developed a highly sought-after downscaled climate datasets, we have prototyped a process through which the data will are distributed by NASA’s NCCS facility

34 Summary of Activity During Review Period (2)
Set-up initial NEX-wide repository based on the “module” utilities that enables us to customize environments for specific user’s needs in terms of tool/software versions and dependencies. Started to integrate some of the tools and utilities for data manipulation with the NEX semantic infrastructure and prototyped an end-to-end process of the semantic data and process integration with MODIS climatology processes that also include provenance capture. Work closely with several NEX projects to establish initial NEX database and tools API, which is currently in use mainly for access to Landsat and MODIS data and metadata for both gridded and swath datasets.

35 Summary of Activity During Review Period (3)
Added a new metadata collection capability for some datasets that enable us to better estimate future data requirements as well as provide users with additional information, mainly for QA screening purposes. Prototyped an automated process through which users can submit requests for data, tools and models to be included on NEX using PivotalTracker

36 Papers and Presentations
“NASA Earth Exchange (NEX): Earth science collaborative for global change science“. Presented at IGARSS 2012. “NASA Earth Exchange (NEX)”, Presented at Supercomputing 2012. “Connecting Provenance and Semantic Descriptions in NASA Earth Exchange (NEX)”, Presented at AGU 2012.

37 ESDSWG Participation Participated at 2012 ESDSWG meeting
Participated in Semantics Working Group until it was dissolved Currently participate in the Cloud Computing Working group Plan to attend 2013 ESDSWG meeting and expand participation to Earth Science Collaboratory WG

38 Relationship to other funded activities
AIST Facilitate access to tools and knowledge through API for workflow integration CMAC (Data Mining) Facilitates access to data and pre-processing tools CMAC (Recommendations) Facilitates access to tools through workflows National Climate Assessment (NCA) Facilitates process for NEX-produced data distribution for NCA BIOCLIM Facilitates access to tools, data and libraries for several BIOCLIM projects.

39 Relationship to NEX Provides foundation for user/project work environments Provides access to metadata for integration with the NEX knowledge system Provides the overarching metadata architecture for data and processes integrated through a semantic layer

40 Project Schedule

41 Summary of Work During Next Review Period (through 2/14)
Continuous integration of tools and utilities with the NEX infrastructure based on user’s requirements Continuous integration of data with the NEX infrastructure based on user’s requirements Continue to work on the data and process interface (API) – the initial API is in Python, but we are also working with users for access to data and tools through R and MATLAB The extent of this will be driven by user requirements Work with users in order to continue integration of documentation, FAQs and code samples for the tools and datasets so that they are available both on the computing platform and on the NEX web portal.

42 Cumulative Budget (3/2012 – 8/2013)
FY12: $141,750 All funds have been obligated FY13: $145,200 Does it match your numbers? 42

43 Glossary API: Application Programming Interface
BIOCLIM: Climate and Biological Response: Research and Applications CMAC: Computational Modeling Algorithms and Cyberinfrastructure CMS: Carbon Monitoring System DMF: Data Migration Facility HEC: High-End Computing HPC: High-Performance Computing NAFD: North American Forest Disturbance NCCS: NASA Center for Climate Simulations NEX: NASA Earth Exchange OWL: Web Ontology Language RDF: Resource Description Framework SIMS: Satellite Irrigation Management Support


Download ppt "NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,"

Similar presentations


Ads by Google