Presentation on theme: "NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS-11-0034 Annual Review August 20, 2013 Ramakrishna Nemani,"— Presentation transcript:
1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure ACCESS Annual Review August 20, 2013Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest Melton
2 Background: NASA Earth Exchange Vision: To engage and enable the Earth science community to address global Earth science challenges.NEX is a collaborative compute platform that improves the availability of Earth science data, models, analysis tools and scientific results through a centralized environment that fosters knowledge sharing, collaboration, innovation and direct access to compute resources.Engage:Network, share and collaborateDiscuss and formulate new ideasPortal, Virtual InstituteEnable:Access to dataAccess to computingAccess to knowledge
4 Outline Project background Updated quad chart Review of schedule and milestonesDescription of work accomplished and resultsTechnical reports and presentationsDiscussion of next 6 month activitySchedule and budget summary
5 Project BackgroundMain focus of the projects is on supporting the NEX community by continuously improving access to data, tools, computing and knowledge.By improving the above, we can engage more users and teams and provide them with better and faster supportNeed to be able to respond quickly to new requirementsFocus on knowledge acquisition, and accessWe can also help our users to significantly scale their projects
6 PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center NASA Earth Exchange: Improving access to large-scale data and computational infrastructurePI: Ramakrishna Nemani Ph.D., NASA Ames Research CenterGoals and ObjectivesEnhance access, discovery and integration of data, models and services for the NEX communitiesProvide integrated system view of NEX data, metadata, processing libraries, models and QA informationProvide API and client libraries to NEX tools, datasets and search capabilitiesProvide streamlined way for researchers to share their results with the communityArchitecture OverviewApproachKey MilestonesInventory current NEX datasets, tools and models and engage the community in gathering requirements and use cases.Design a common database schema for existing NEX datasets.Develop API that facilitates search and access to data, tools and models and use it to implement client librariesDevelop migration and dissemination tools for NEX usersPreliminaries completed 07/2012Data integration completed 11/2012Process integration completed 01/2013System interface completed 08/2013Migration tools completed 01/2014Client libraries and tools completed 02/2014Co-Is/PartnersCo-I: Petr Votava, Andrew Michaelis, Dr. Hirofumi Hashimoto, Forrest Melton/CSU Monterey BayTRLin = 608/13
8 Project GoalTo enhance access, discovery and integration of data, models and tools for the NEX communities.
9 Objectives for Activity During Review Period Complete inventory of current NEX data, metadata, tools and librariesEngage NEX users to gather additional data and tools requirementsComplete initial data integration with the key NEX datasets and the existing infrastructureContinue rapid prototyping of database access tools based on user requirements.Continue integration of utilities and tools with NEX system.Prototype integration with NEX semantic infrastructure.
10 Project Drivers = WhyTo directly support large-scale NASA projects such as WELD, NAFD, NCA, MEASURES, CMS, CMAC and projects in applied sciencesEfficiently support fast growing NEX community both inside and outside of NASAEarth science research is a global undertaking and we aim to engage the largest possible communityLarge global collaboratoryGlobal knowledge poolNeed critical mass -> everybody benefitsSupport for large-scale science while engaging large communityPlace for community contributions and access to these contributions:Knowledge, tools, data, workflows, …
11 NEX User and Project Evolution Number of active compute/data users at the beginning of this ACCESS project: less than 50Current number of active compute/data users: 158Largest data requirements at the beginning of this ACCESS project: 10s of TB (per project)Current data requirements: 100s of TB – 1PB+ (per project)On the NEX portal – currently 404 users and 1,252 projects (not all active)
12 ACCESS Project Overview DataProvide integrated view of NEX data and metadata through API,command-line tools and query services.ToolsProvide mechanism to discover and manage environments fortools and utilities required by different projects and provide APIsKnowledgeCross-reference and provide access to information about datasets, tools, users, projects, publications and other docs.DisseminationEstablish process, policies and infrastructure for dissemination of data produced on NEX.InfrastructureComponents and solutions that enable the above within security and policy constraints.
13 Data Organization Started with inventory Currently over 450TB on-line and 500+TB near-lineFeedback from summer school 2012 users, summer interns in 2013 and NEX users and PIsTwo rounds of “Query Requirements” with the NEX science teamTwo-to-three tier systemPrimary on-line fast storage, secondary on-line cache, near-line tape accessed through DMF
14 Query Categories and Requirements “Standard” queriesTemporal, spatial, match region by name, what data are available, …Data provenanceHow was data produced (process/workflow)?What were the inputs into the process?Who created this dataset?Knowledge queriesWhich projects work with dataset X? In what geographic region?Which publications are relevant to dataset X?Administration queriesHow often is the dataset updated? From where?Analytics queries (not addressed by this project)Filter based on internal QA, Landcover or statisticsLarge number of requests for these capabilities
15 Data Organization Details Keep metadata in the original format/naming conventionsResearchers are used to the metadata namesAt times extensive documentation exists to describe the metadataMetadata are processed by custom parsersDifferent for different sensors (MODIS, Landsat, NAIP, …)Each datasets is stored in a separate set of tables and when it is added to NEX a custom plug-in is writtenOverrides abstract methods from the DB classIt is manageable, because the class of the datasets in not that large (few dozens at most) and writing a generic code in this case while maintaining the original metadata would take longer in this caseWe are experimenting with semantic layer that describes and maps terms in different DBs to common taxonomy, but it requires dynamic query rewriting and it’s suitability for this problem is questionable.Best solution in this case seems either fully relational (current) or fully graph-based (future). Needs to hide the implementation behind an API, however users at times want access to a full RDBMS in which case maintaining two consistent copies seems the best answer.
16 Tools/Utilities/Models Tried number of approachesUsers often want custom solutions with specific library/tool versionsManagement of this gets quickly complicatedUsing “modules” infrastructure to provide custom environments for NEX teamsWe can easily mix and match versions as per team’s requirementsAlso good for easy reproduction/packaging of environmentsWill be basis for tool contribution setup (nex/contrib)Access to almost all tools through a Python API or through regular command-line invocationGreat for integration with VisTrails workflow management systemMechanism to query a list of modules to be built or request a new module to be built.Working on adding better search and documentation capabilitiesAlso, exposing documentation externally on the NEX portal
17 Knowledge Organization Internal NEX Knowledge graphSpans data, content, web portal, toolsProvenanceRDF/OWL representationTriple and quad-store (MySQL and Virtuoso)Knowledge AcquisitionManual = Documentation, blogs etc. (internal and external)Automatic = entity extraction from text and metadata using natural language processingLocation, datasets used by project, sensorsBuild relationshipsImproves search – who is doing what whereWho is doing work in Amazon, what sensors are they using? What are the most frequent sensors used by NEX projectsCan generate project concepts, so that projects can be easily related to each other (LSI)
18 Relating Entities Queries Link to Link to/ Define new NEX Projects, wikis,…(NEX web portal)GCMD ConceptsExtract entitiesLink toNEX GraphData StorePublications(NEX Web PortalHarvard Database,…)Extract entitiesNEX Extension(Additional conceptsoutside the GCMDhierarchy –data hierarchy, …)Link to/Define newLink to resourcesLinks to externaldocs(LP DAAC, …)Record provenanceProvenance fromrunning processQueries
19 Example queries What is the provenance of file X? What is the bounding box of region R?Get sorted (by number of projects) the usage of each of the NASA instruments in the NEX projects?What instruments are used by projects doing research in the Amazon?What are the most cited datasets in the remote sensing publications?Now that NEX portal has been migrated to NAS we can start to integrate this information with the portal a lot easier.
20 Data Dissemination Number of faucets Large-scale data distribution (CMIP-5 for NCA)Web-services application support (SIMS)Open Access – AmazonFocus not only on the mechanics and implementation, but also on protocols and policies developmentOften more time-consuming than implementation
21 CMIP-5 Dissemination Downscaled climate dataset produced on NEX (17TB) Important and highly requested by the communityFirst process for NEX data -> NASA distribution facilityEstablished DOI mining capabilities (through UC Digital Library)Established a technique for DOI dataset verification through checksums without extensive web services even when underlying naming changes.Data available at:And internally on NEXData had to be aggregated and reformatted for use by NCCSThis raises issues of verifications with original datasets as well as the fact that there are effectively two copies of the data in different formatsNeeded to work extensive work with users + many lessons learned = update protocol with NCCS, but will be different with different facilities
22 NASA Satellite Irrigation Management Support (SIMS) ACCESS software infrastructure directly supports the SIMS project (NASA Applied Sciences)Build partially on efforts from last ACCESS projectProvides access to near-real-time Landsat data time-series through a data cube interfaceThe goal of the SIMS project is to develop new information products from satellite data to support growers in optimizing irrigationCurrently tested by 12 partner growersData visualization and queries via web services built on OPeNDAPBoth web-based and mobile interfaces
23 crop cond. % cover crop coeff crop water requirement An example of the SIMS web / mobile data interface, which is designed to enhance grower access to satellite-derived measures of crop condition and crop water requirements across 3.7 million ha of irrigated land in California.
24 Amazon Web Services Space Act Agreement Prototype process for providing access to NEX data through public cloud facilitiesOpen access to data and workflowsWe are reaching capacity on NEX and have restrictions on accessDifferent cost model – billing for computing is under users controlWe can add complete Virtual Machines with packaged environments and workflows developed and managed on NEX and accessible through the NEX web portalPrototyping effort includesNCA-related activitiesNCA downscaled data (CMIP-5)NEX portal linked with Amazon Web Services (open) or internal (NEX-members only) NEX work environment
25 Infrastructure Database setup Supercomputing setup Access to database systems from all NEX componentsMostly MySQL-based, experimenting with Virtuoso, Neo4j and re-visiting MongoDBSupercomputing setupWork with NAS system group to enable access even from within Pleiades supercomputerNeeded for easier streaming of provenance informationApplications supportSeparate OpenDAP, THREDDS and FTP serverSecurity considerationsModerate system = 2-factor authentication requiredWaiver for NEX portal for OpenID and NDC usersOne of the drivers for testing public cloud solutions to improve access
26 Immediate Benefits for Many NEX Projects (Examples) Web-Enabled Landsat Data (WELD)Acquisition, organization and access to data and processing capabilities for monthly Landsat vegetation composites – 800+TB total data requirementsNorth America Forest Dynamics (NAFD)Acquisition, organization and access to data, QA, metadata and processing capabilities for Landsat (80TB)BIOCLIMAcquisition and organization of global MODIS land and atmospheric products including swath mapping to acquisition regions (15 TB).
27 Web Enabled Landsat Data: Going Global, Roy et al., Creating Global Monthly Landsat Composites, PresentApril 2010Takes over 10,000 scenes each month using WELD systemOctober 2010
28 North American Forest Disturbance (NAFD, Goward et al.,) Expanding from 23 samples to Wall-to-wall coverageProcessing scenes from on NEX
32 Snapshots IP Ready… Instance Type START Status Monitor Booting InstantiateStatus MonitorBootingIP Ready…INSTANCE READY
33 Summary of Activity During Review Period (1) Inventory of NEX tools and datasets.Started with 25 existing datasets on NEX comprising about 300TB of data.Work with NEX users to better understand:How they use the data, metadata and QA informationWhich tools and utilities they are using the most and what functionality is missing from the existing tools and utilities. We have prototyped the database access for number of use cases and some parts of it are already being used by NEX science teams.As the science teams have developed a highly sought-after downscaled climate datasets, we have prototyped a process through which the data will are distributed by NASA’s NCCS facility
34 Summary of Activity During Review Period (2) Set-up initial NEX-wide repository based on the “module” utilities that enables us to customize environments for specific user’s needs in terms of tool/software versions and dependencies.Started to integrate some of the tools and utilities for data manipulation with the NEX semantic infrastructure and prototyped an end-to-end process of the semantic data and process integration with MODIS climatology processes that also include provenance capture.Work closely with several NEX projects to establish initial NEX database and tools API, which is currently in use mainly for access to Landsat and MODIS data and metadata for both gridded and swath datasets.
35 Summary of Activity During Review Period (3) Added a new metadata collection capability for some datasets that enable us to better estimate future data requirements as well as provide users with additional information, mainly for QA screening purposes.Prototyped an automated process through which users can submit requests for data, tools and models to be included on NEX using PivotalTracker
36 Papers and Presentations “NASA Earth Exchange (NEX): Earth science collaborative for global change science“. Presented at IGARSS 2012.“NASA Earth Exchange (NEX)”, Presented at Supercomputing 2012.“Connecting Provenance and Semantic Descriptions in NASA Earth Exchange (NEX)”, Presented at AGU 2012.
37 ESDSWG Participation Participated at 2012 ESDSWG meeting Participated in Semantics Working Group until it was dissolvedCurrently participate in the Cloud Computing Working groupPlan to attend 2013 ESDSWG meeting and expand participation to Earth Science Collaboratory WG
38 Relationship to other funded activities AISTFacilitate access to tools and knowledge through API for workflow integrationCMAC (Data Mining)Facilitates access to data and pre-processing toolsCMAC (Recommendations)Facilitates access to tools through workflowsNational Climate Assessment (NCA)Facilitates process for NEX-produced data distribution for NCABIOCLIMFacilitates access to tools, data and libraries for several BIOCLIM projects.
39 Relationship to NEXProvides foundation for user/project work environmentsProvides access to metadata for integration with the NEX knowledge systemProvides the overarching metadata architecture for data and processes integrated through a semantic layer
41 Summary of Work During Next Review Period (through 2/14) Continuous integration of tools and utilities with the NEX infrastructure based on user’s requirementsContinuous integration of data with the NEX infrastructure based on user’s requirementsContinue to work on the data and process interface (API) – the initial API is in Python, but we are also working with users for access to data and tools through R and MATLABThe extent of this will be driven by user requirementsWork with users in order to continue integration of documentation, FAQs and code samples for the tools and datasets so that they are available both on the computing platform and on the NEX web portal.
42 Cumulative Budget (3/2012 – 8/2013) FY12: $141,750All funds have been obligatedFY13: $145,200Does it match your numbers?42
43 Glossary API: Application Programming Interface BIOCLIM: Climate and Biological Response: Research and ApplicationsCMAC: Computational Modeling Algorithms and CyberinfrastructureCMS: Carbon Monitoring SystemDMF: Data Migration FacilityHEC: High-End ComputingHPC: High-Performance ComputingNAFD: North American Forest DisturbanceNCCS: NASA Center for Climate SimulationsNEX: NASA Earth ExchangeOWL: Web Ontology LanguageRDF: Resource Description FrameworkSIMS: Satellite Irrigation Management Support