Presentation on theme: "1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,"— Presentation transcript:
1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS Annual Review August 20, 2013 Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest Melton
2 Vision: To engage and enable the Earth science community to address global Earth science challenges. NEX is a collaborative compute platform that improves the availability of Earth science data, models, analysis tools and scientific results through a centralized environment that fosters knowledge sharing, collaboration, innovation and direct access to compute resources. Engage: Network, share and collaborate Discuss and formulate new ideas Portal, Virtual Institute Enable: Access to data Access to computing Access to knowledge Background: NASA Earth Exchange
3 NEX Infrastructure View
4 Outline Project background Updated quad chart Review of schedule and milestones Description of work accomplished and results Technical reports and presentations Discussion of next 6 month activity Schedule and budget summary
5 Project Background Main focus of the projects is on supporting the NEX community by continuously improving access to data, tools, computing and knowledge. By improving the above, we can engage more users and teams and provide them with better and faster support -Need to be able to respond quickly to new requirements -Focus on knowledge acquisition, and access We can also help our users to significantly scale their projects
NASA Earth Exchange: Improving access to large-scale data and computational infrastructure Key Milestones Goals and Objectives Enhance access, discovery and integration of data, models and services for the NEX communities Provide integrated system view of NEX data, metadata, processing libraries, models and QA information Provide API and client libraries to NEX tools, datasets and search capabilities Provide streamlined way for researchers to share their results with the community Approach Inventory current NEX datasets, tools and models and engage the community in gathering requirements and use cases. Design a common database schema for existing NEX datasets. Develop API that facilitates search and access to data, tools and models and use it to implement client libraries Develop migration and dissemination tools for NEX users Co-I: Petr Votava, Andrew Michaelis, Dr. Hirofumi Hashimoto, Forrest Melton/CSU Monterey Bay PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center Co-Is/Partners Architecture Overview TRL in = 6 Preliminaries completed07/2012 Data integration completed11/2012 Process integration completed01/2013 System interface completed08/2013 Migration tools completed01/2014 Client libraries and tools completed02/ /13
7 Project Schedule
8 Project Goal To enhance access, discovery and integration of data, models and tools for the NEX communities.
9 Objectives for Activity During Review Period Complete inventory of current NEX data, metadata, tools and libraries Engage NEX users to gather additional data and tools requirements Complete initial data integration with the key NEX datasets and the existing infrastructure Continue rapid prototyping of database access tools based on user requirements. Continue integration of utilities and tools with NEX system. Prototype integration with NEX semantic infrastructure.
10 Project Drivers = Why 1.To directly support large-scale NASA projects such as WELD, NAFD, NCA, MEASURES, CMS, CMAC and projects in applied sciences 2.Efficiently support fast growing NEX community both inside and outside of NASA -Earth science research is a global undertaking and we aim to engage the largest possible community -Large global collaboratory Global knowledge pool Need critical mass -> everybody benefits -Support for large-scale science while engaging large community 3.Place for community contributions and access to these contributions: -Knowledge, tools, data, workflows, …
11 NEX User and Project Evolution Number of active compute/data users at the beginning of this ACCESS project: less than 50 Current number of active compute/data users: 158 Largest data requirements at the beginning of this ACCESS project: 10s of TB (per project) Current data requirements: 100s of TB – 1PB+ (per project) On the NEX portal – currently 404 users and 1,252 projects (not all active)
12 ACCESS Project Overview Data Tools Knowledge Provide integrated view of NEX data and metadata through API, command-line tools and query services. Cross-reference and provide access to information about datasets, tools, users, projects, publications and other docs. Dissemination Establish process, policies and infrastructure for dissemination of data produced on NEX. Provide mechanism to discover and manage environments for tools and utilities required by different projects and provide APIs Infrastructure Components and solutions that enable the above within security and policy constraints.
13 Data Organization Started with inventory Currently over 450TB on-line and 500+TB near- line Feedback from summer school 2012 users, summer interns in 2013 and NEX users and PIs Two rounds of “Query Requirements” with the NEX science team Two-to-three tier system -Primary on-line fast storage, secondary on-line cache, near-line tape accessed through DMF
14 Query Categories and Requirements “Standard” queries -Temporal, spatial, match region by name, what data are available, … Data provenance -How was data produced (process/workflow)? -What were the inputs into the process? -Who created this dataset? Knowledge queries -Which projects work with dataset X? In what geographic region? -Which publications are relevant to dataset X? Administration queries -How often is the dataset updated? From where? Analytics queries (not addressed by this project) -Filter based on internal QA, Landcover or statistics -Large number of requests for these capabilities
15 Data Organization Details Keep metadata in the original format/naming conventions -Researchers are used to the metadata names -At times extensive documentation exists to describe the metadata Metadata are processed by custom parsers -Different for different sensors (MODIS, Landsat, NAIP, …) Each datasets is stored in a separate set of tables and when it is added to NEX a custom plug-in is written -Overrides abstract methods from the DB class -It is manageable, because the class of the datasets in not that large (few dozens at most) and writing a generic code in this case while maintaining the original metadata would take longer in this case -We are experimenting with semantic layer that describes and maps terms in different DBs to common taxonomy, but it requires dynamic query rewriting and it’s suitability for this problem is questionable. -Best solution in this case seems either fully relational (current) or fully graph-based (future). Needs to hide the implementation behind an API, however users at times want access to a full RDBMS in which case maintaining two consistent copies seems the best answer.
16 Tools/Utilities/Models Tried number of approaches -Users often want custom solutions with specific library/tool versions -Management of this gets quickly complicated Using “modules” infrastructure to provide custom environments for NEX teams -We can easily mix and match versions as per team’s requirements -Also good for easy reproduction/packaging of environments -Will be basis for tool contribution setup (nex/contrib) Access to almost all tools through a Python API or through regular command-line invocation -Great for integration with VisTrails workflow management system Mechanism to query a list of modules to be built or request a new module to be built. Working on adding better search and documentation capabilities -Also, exposing documentation externally on the NEX portal
17 Knowledge Organization Internal NEX Knowledge graph -Spans data, content, web portal, tools -Provenance RDF/OWL representation -Triple and quad-store (MySQL and Virtuoso) Knowledge Acquisition -Manual = Documentation, blogs etc. (internal and external) -Automatic = entity extraction from text and metadata using natural language processing Location, datasets used by project, sensors Build relationships Improves search – who is doing what where -Who is doing work in Amazon, what sensors are they using? What are the most frequent sensors used by NEX projects Can generate project concepts, so that projects can be easily related to each other (LSI)
18 Relating Entities NEX Projects, wikis,… (NEX web portal) Publications (NEX Web Portal Harvard Database, …) GCMD Concepts NEX Extension (Additional concepts outside the GCMD hierarchy – data hierarchy, …) NEX Graph Data Store Extract entities Link to Link to/ Define new Queries Links to external docs (LP DAAC, …) Link to resources Provenance from running process Record provenance
19 Example queries What is the provenance of file X? What is the bounding box of region R? Get sorted (by number of projects) the usage of each of the NASA instruments in the NEX projects? What instruments are used by projects doing research in the Amazon? What are the most cited datasets in the remote sensing publications? Now that NEX portal has been migrated to NAS we can start to integrate this information with the portal a lot easier.
20 Data Dissemination Number of faucets -Large-scale data distribution (CMIP-5 for NCA) -Web-services application support (SIMS) -Open Access – Amazon Focus not only on the mechanics and implementation, but also on protocols and policies development -Often more time-consuming than implementation
21 CMIP-5 Dissemination Downscaled climate dataset produced on NEX (17TB) -Important and highly requested by the community First process for NEX data -> NASA distribution facility -Established DOI mining capabilities (through UC Digital Library) -Established a technique for DOI dataset verification through checksums without extensive web services even when underlying naming changes. Data available at: -http://dataserver.nccs.nasa.gov/thredds/idd/bypass.html -And internally on NEX -Data had to be aggregated and reformatted for use by NCCS This raises issues of verifications with original datasets as well as the fact that there are effectively two copies of the data in different formats Needed to work extensive work with users + many lessons learned = update protocol with NCCS, but will be different with different facilities
22 NASA Satellite Irrigation Management Support (SIMS) ACCESS software infrastructure directly supports the SIMS project (NASA Applied Sciences) -Build partially on efforts from last ACCESS project -Provides access to near-real-time Landsat data time- series through a data cube interface -The goal of the SIMS project is to develop new information products from satellite data to support growers in optimizing irrigation Currently tested by 12 partner growers -Data visualization and queries via web services built on OPeNDAP -Both web-based and mobile interfaces
23 crop cond. % cover crop coeff crop water requirement An example of the SIMS web / mobile data interface, which is designed to enhance grower access to satellite-derived measures of crop condition and crop water requirements across 3.7 million ha of irrigated land in California.
24 Amazon Web Services Space Act Agreement Prototype process for providing access to NEX data through public cloud facilities -Open access to data and workflows We are reaching capacity on NEX and have restrictions on access -Different cost model – billing for computing is under users control -We can add complete Virtual Machines with packaged environments and workflows developed and managed on NEX and accessible through the NEX web portal -Prototyping effort includes NCA-related activities -NCA downscaled data (CMIP-5) -NEX portal linked with Amazon Web Services (open) or internal (NEX-members only) NEX work environment
25 Infrastructure Database setup -Access to database systems from all NEX components -Mostly MySQL-based, experimenting with Virtuoso, Neo4j and re- visiting MongoDB Supercomputing setup -Work with NAS system group to enable access even from within Pleiades supercomputer -Needed for easier streaming of provenance information Applications support -Separate OpenDAP, THREDDS and FTP server Security considerations -Moderate system = 2-factor authentication required -Waiver for NEX portal for OpenID and NDC users -One of the drivers for testing public cloud solutions to improve access
26 Immediate Benefits for Many NEX Projects (Examples) Web-Enabled Landsat Data (WELD) -Acquisition, organization and access to data and processing capabilities for monthly Landsat vegetation composites – 800+TB total data requirements North America Forest Dynamics (NAFD) -Acquisition, organization and access to data, QA, metadata and processing capabilities for Landsat (80TB) BIOCLIM -Acquisition and organization of global MODIS land and atmospheric products including swath mapping to acquisition regions (15 TB).
27 Takes over 10,000 scenes each month using WELD system Creating Global Monthly Landsat Composites, Present April 2010 October 2010 Web Enabled Landsat Data: Going Global, Roy et al.,
28 North American Forest Disturbance (NAFD, Goward et al.,) Expanding from 23 samples to Wall-to-wall coverage Processing scenes from on NEX
29 NEX Software View - Current
30 NEX Software View – Overall Goal NASA Cloud/AWS/OpenStack implementation/… Currently prototyping
32 Instantiate Instance Type START INSTANCE READY Status Monitor Booting Status Monitor Booting IP Ready…
33 Summary of Activity During Review Period (1) Inventory of NEX tools and datasets. -Started with 25 existing datasets on NEX comprising about 300TB of data. -Work with NEX users to better understand: How they use the data, metadata and QA information Which tools and utilities they are using the most and what functionality is missing from the existing tools and utilities. We have prototyped the database access for number of use cases and some parts of it are already being used by NEX science teams. As the science teams have developed a highly sought-after downscaled climate datasets, we have prototyped a process through which the data will are distributed by NASA’s NCCS facility
34 Summary of Activity During Review Period (2) Set-up initial NEX-wide repository based on the “module” utilities that enables us to customize environments for specific user’s needs in terms of tool/software versions and dependencies. Started to integrate some of the tools and utilities for data manipulation with the NEX semantic infrastructure and prototyped an end-to-end process of the semantic data and process integration with MODIS climatology processes that also include provenance capture. Work closely with several NEX projects to establish initial NEX database and tools API, which is currently in use mainly for access to Landsat and MODIS data and metadata for both gridded and swath datasets.
35 Summary of Activity During Review Period (3) Added a new metadata collection capability for some datasets that enable us to better estimate future data requirements as well as provide users with additional information, mainly for QA screening purposes. Prototyped an automated process through which users can submit requests for data, tools and models to be included on NEX using PivotalTracker
36 Papers and Presentations “NASA Earth Exchange (NEX): Earth science collaborative for global change science“. Presented at IGARSS “NASA Earth Exchange (NEX)”, Presented at Supercomputing “Connecting Provenance and Semantic Descriptions in NASA Earth Exchange (NEX)”, Presented at AGU 2012.
37 ESDSWG Participation Participated at 2012 ESDSWG meeting Participated in Semantics Working Group until it was dissolved Currently participate in the Cloud Computing Working group Plan to attend 2013 ESDSWG meeting and expand participation to Earth Science Collaboratory WG
38 Relationship to other funded activities AIST -Facilitate access to tools and knowledge through API for workflow integration CMAC (Data Mining) -Facilitates access to data and pre-processing tools CMAC (Recommendations) -Facilitates access to tools through workflows National Climate Assessment (NCA) -Facilitates process for NEX-produced data distribution for NCA BIOCLIM -Facilitates access to tools, data and libraries for several BIOCLIM projects.
39 Relationship to NEX Provides foundation for user/project work environments Provides access to metadata for integration with the NEX knowledge system Provides the overarching metadata architecture for data and processes integrated through a semantic layer
40 Project Schedule
41 Summary of Work During Next Review Period (through 2/14) Continuous integration of tools and utilities with the NEX infrastructure based on user’s requirements Continuous integration of data with the NEX infrastructure based on user’s requirements Continue to work on the data and process interface (API) – the initial API is in Python, but we are also working with users for access to data and tools through R and MATLAB -The extent of this will be driven by user requirements Work with users in order to continue integration of documentation, FAQs and code samples for the tools and datasets so that they are available both on the computing platform and on the NEX web portal.
42 Cumulative Budget (3/2012 – 8/2013) FY12: $141,750 -All funds have been obligated FY13: $145,200 -All funds have been obligated Does it match your numbers?
43 Glossary API: Application Programming Interface BIOCLIM: Climate and Biological Response: Research and Applications CMAC: Computational Modeling Algorithms and Cyberinfrastructure CMS: Carbon Monitoring System DMF: Data Migration Facility HEC: High-End Computing HPC: High-Performance Computing NAFD: North American Forest Disturbance NCCS: NASA Center for Climate Simulations NEX: NASA Earth Exchange OWL: Web Ontology Language RDF: Resource Description Framework SIMS: Satellite Irrigation Management Support