Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman.

Similar presentations


Presentation on theme: "1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman."— Presentation transcript:

1 1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman

2 2 Outline Motivation Motivation Data integration needs in scientific applications Data integration needs in scientific applications Distributed computing in grids Distributed computing in grids Problem statement Problem statement Artemis architecture Artemis architecture Evaluation Evaluation Related Work Related Work Conclusions and future work Conclusions and future work

3 3 Scientific Data Integration Large-scale, cross-disciplinary scientific data collection, storage, and analysis exacerbates heterogeneity and dynamics Large-scale, cross-disciplinary scientific data collection, storage, and analysis exacerbates heterogeneity and dynamics National Virtual Observatory (NVO) National Virtual Observatory (NVO) Earth System Grid (ESG) Earth System Grid (ESG)

4 4 Grid Computing [Foster & Kesselman 04] Grids provide middleware services for distributed computing: Grids provide middleware services for distributed computing: Seamless integration and management of resources – OGSA Seamless integration and management of resources – OGSA Job submission and execution management – Condor Job submission and execution management – Condor Resource availability & performance – Monitoring and Directory Svc (MDS) Resource availability & performance – Monitoring and Directory Svc (MDS) Data replication for robustness and efficiency – Replica Loc Svc (RLS) Data replication for robustness and efficiency – Replica Loc Svc (RLS) Descriptions of data sources – Metadata Catalog Services (MCS) Descriptions of data sources – Metadata Catalog Services (MCS) R Discovery Many sources of data, services, computation R Registries organize services of interest to a community Access Data integration activities may require access to, & exploration/analysis of, data at many locations Exploration & analysis may involve complex, multi-step workflows RM Resource management is needed to ensure progress & arbitrate competing demands Security service Security service Policy service Policy service Security & policy must underlie access & management decisions From [Kesselman 04]:

5 5 Scientific Data Storage and Access Data sources are very heterogeneous Data sources are very heterogeneous Data that results from various instruments, disciplines, and types of analyses Data that results from various instruments, disciplines, and types of analyses Wide variety of data storage systems (files, DBs, servers, etc) Wide variety of data storage systems (files, DBs, servers, etc) Data sources are highly distributed Data sources are highly distributed Data stored in different locations on the grid Data stored in different locations on the grid Data is replicated in multiple locations Data is replicated in multiple locations Data sources are highly dynamic Data sources are highly dynamic Data grows continuously, new data models are routine Data grows continuously, new data models are routine New data sources regularly appear New data sources regularly appear Data sources may become unavailable sporadically Data sources may become unavailable sporadically Data available at unprecedented scale Data available at unprecedented scale Very soon petabytes Very soon petabytes These challenges are in the way of scientific progress in many disciplines

6 6 Data Storage and Access in Grids Data described with metadata attributes Data described with metadata attributes Attribute names may not be consistent across different sources Attribute names may not be consistent across different sources Metadata descriptions often stored separately from the data itself Metadata descriptions often stored separately from the data itself Metadata Catalog Service (MCS) [Moore et al 01, Singh et al 03] Metadata Catalog Service (MCS) [Moore et al 01, Singh et al 03] Stores descriptive metadata and allows users to query based on desired attributes Stores descriptive metadata and allows users to query based on desired attributes Addresses heterogeneity of data source implementations and access Addresses heterogeneity of data source implementations and access

7 7 Sample Query search constraints: search constraints: keywords = "atmospheric data" or "climate data“ keywords = "atmospheric data" or "climate data“ or "climate model“ or "climate model“ model type = "CCSM" or "PCM“ model type = "CCSM" or "PCM“ period = 2001 period = 2001 search results: Files, collections, or views: /CCSM2/b20.007/atm /PCM/B06.62/atm /PCM/B06.20/atm /PCM/B06.21/atm search results: Files, collections, or views: /CCSM2/b20.007/atm /PCM/B06.62/atm /PCM/B06.20/atm /PCM/B06.21/atm

8 8 Problem Statement Users should have seamless single point access Users should have seamless single point access Should not have to formulate a different query for each source Should not have to formulate a different query for each source Should not manage the unavailability of data sources Should not manage the unavailability of data sources Users need assistance formulating the queries Users need assistance formulating the queries Data models may have different attribute names and representations (even from the same source) Data models may have different attribute names and representations (even from the same source) New data models/metadata attributes created all the time New data models/metadata attributes created all the time MCS1 MCS2 MCS3 DB1 DB2 DB3 q1 q2 q3 stime etime starttime endtime descr sub currently unavailable

9 9 Artemis A mixed-initiative data integration system that aims to: A mixed-initiative data integration system that aims to: Abstracts users from diversity in attribute representations Abstracts users from diversity in attribute representations Assists users to formulate queries step-by-step Assists users to formulate queries step-by-step Manages the access and availability of dynamic collections of data sources Manages the access and availability of dynamic collections of data sources Integrates and extends various AI techniques: Integrates and extends various AI techniques: Data Integration Data Integration Ontology Ontology Dialogue wizards Dialogue wizards

10 10 Approach stime etime … starttime endtime … description subject stimestarttimeetimeendtime Time Start timeEnd time ONTOLOGY Query Mediator Query Formulation Wizard Start time > 500000 ^ End time 500000 ^ End time < 600000 Data Source Metadata Catalog2 Data Source Data Source Metadata Catalog3 Metadata Catalog1

11 11 Artemis Architecture Entity selection Filters MCS Wizard Dynamic Model Generator Prometheus Query Mediator Metadata Catalog Service Metadata Catalog Service Metadata Catalog Service Data Source Ontology Model Mappings Models

12 12 MCS Wizard Based on the Agent Wizard [Tuchinda 2003] Based on the Agent Wizard [Tuchinda 2003] Domain experts create mappings between Ontologies and meta-data attributes Domain experts create mappings between Ontologies and meta-data attributes users can then pick the ontology and the mappings relevant to their domain. users can then pick the ontology and the mappings relevant to their domain. Guides the user through available operations and filters consistent with the models of the data. Guides the user through available operations and filters consistent with the models of the data.

13 13 Prometheus Query Mediator Data integration system from earlier research [Thakkar et. al. 2004] [Knoblock et al 2003] Data integration system from earlier research [Thakkar et. al. 2004] [Knoblock et al 2003] Provides unified query interface to a wide variety of data sources Provides unified query interface to a wide variety of data sources Relational model Relational model Requires pre-defined domain model relating sources to domain relations Requires pre-defined domain model relating sources to domain relations Extended in Artemis to support: Extended in Artemis to support: Source relations: Various MCSs Source relations: Various MCSs Domain relations Domain relations File, View, Collection File, View, Collection Dynamic domain model based on availability of data sources Dynamic domain model based on availability of data sources

14 14 Dynamic Model Generation Generate mediator model dynamically by querying MCSs Generate mediator model dynamically by querying MCSs Convert object oriented model of MCSs to relational model of the mediator Convert object oriented model of MCSs to relational model of the mediator Handles dynamic nature of data by generating new domain models at query time Handles dynamic nature of data by generating new domain models at query time Intuitive idea Intuitive idea Query MCSs one at a time for all possible attributes of different objects Query MCSs one at a time for all possible attributes of different objects Create domain relation for each object type with all possible attributes Create domain relation for each object type with all possible attributes Create rules defining each MCS as data source Create rules defining each MCS as data source Relate various data sources to domain relations Relate various data sources to domain relations

15 15 Dynamic Model Generator (Cont’d) Example Example MCS 1: MCS 1: File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, amplitude) File1(starttime, endtime, frequency), File2(starttime, endtime, frequency, amplitude) MCS 2: MCS 2: File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed) File3(starttime, endtime, lat, lon, temp), File4(starttime, endtime, lat, lon, windspeed) Domain relation Domain relation File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) Source relations Source relations MCS1File(starttime, endtime, frequency, amplitude, name) MCS1File(starttime, endtime, frequency, amplitude, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name) MCS2File(starttime, endtime, lat, lon, temp, windspeed, name) Domain Rules Domain Rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’)

16 16 Query Processing When Prometheus receives a query it determines which MCSs are relevant When Prometheus receives a query it determines which MCSs are relevant Relevant MCSs are determined by comparing the constraints of the query with the constraints of the MCSs Relevant MCSs are determined by comparing the constraints of the query with the constraints of the MCSs MCSs that do not satisfy constraints of the query are not used in the query MCSs that do not satisfy constraints of the query are not used in the query For example, if the query asked for finding files that contained data for some lat, lon then MCS1 would not be queried For example, if the query asked for finding files that contained data for some lat, lon then MCS1 would not be queried

17 17 Query Processing: Example Let’s say, the user uses the MCSWizard to form the following query. Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat 33)^(lat < 34)^ (lon -119)^ (starttime > 50000)^(endtime 50000)^(endtime < 60000) The Prometheus mediator would generate a datalog program with the query and domain rules The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’)

18 18 Query Processing: Example Let’s say, the user uses the MCSWizard to form the following query. Let’s say, the user uses the MCSWizard to form the following query. Q(name) :- File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name)^ (lat > 33)^(lat 33)^(lat < 34)^ (lon -119)^ (starttime > 50000)^(endtime 50000)^(endtime < 60000) The Prometheus mediator would generate a datalog program with the query and domain rules The Prometheus mediator would generate a datalog program with the query and domain rules File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS1File(starttime, endtime, frequency, amplitude, name)^ (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) (lat = ‘’) ^ (lon = ‘’) ^ (temp = ‘’) ^ (windspeed = ‘’) File(starttime, endtime, frequency, amplitude, lat, lon, temp, windspeed, name) :- MCS2File(starttime, endtime, lat, lon, temp, windspeed)^ (frequency = ‘’) ^ (amplitude = ‘’) (frequency = ‘’) ^ (amplitude = ‘’) The mediator determines that the order constraints in the rule one on lat and lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queried The mediator determines that the order constraints in the rule one on lat and lon attribute are not compatible with the order constraints on lat and lon in the query, so only MCS2 is queried

19 19 Artemis: Top level Selection

20 20 Artemis: Filtering

21 21 Evaluation Enabled users to query 12 different MCSs Enabled users to query 12 different MCSs Covering information from three different applications Covering information from three different applications LIGO, ESG, and Geo-spatial data warehouse LIGO, ESG, and Geo-spatial data warehouse Covering 17,000 different files Covering 17,000 different files Metadata consisted of about 300 different attributes Metadata consisted of about 300 different attributes Simulated addition of metadata to MCSs and failure of several MCSs while system was running Simulated addition of metadata to MCSs and failure of several MCSs while system was running

22 22 Related Work MCS [Singh et al 03] MCS [Singh et al 03] Organize metadata about objects on the data grid Organize metadata about objects on the data grid Object oriented schema to support user defined metadata attributes Object oriented schema to support user defined metadata attributes Difficult for users to keep track of diverse attribute names Difficult for users to keep track of diverse attribute names No semantic information is attached to the attributes No semantic information is attached to the attributes Agent Wizard [Tuchinda et. al. 2003] Agent Wizard [Tuchinda et. al. 2003] Interactive application that guides user by dividing complex tasks as series of simpler question answering tasks Interactive application that guides user by dividing complex tasks as series of simpler question answering tasks Challenge is to model complex task as set of simpler subtasks Challenge is to model complex task as set of simpler subtasks Prometheus Mediator [Thakkar et. al. 2004] Prometheus Mediator [Thakkar et. al. 2004] Data integration system that can efficiently integrate data from a wide variety of data sources Data integration system that can efficiently integrate data from a wide variety of data sources Key restriction is that relational schema for data sources and domain must be known in advance Key restriction is that relational schema for data sources and domain must be known in advance

23 23 Related Work (Cont’d) Mygrid [Wroe 2003] Mygrid [Wroe 2003] Model data sources as semantic web services Model data sources as semantic web services Integration of data sources is represented as a workflow Integration of data sources is represented as a workflow Requires that data sources have fixed schema and associated semantics Requires that data sources have fixed schema and associated semantics Model-based mediator system for scientific data management [Ludascher 2003] Model-based mediator system for scientific data management [Ludascher 2003] Data sources provide semantic information regarding their data Data sources provide semantic information regarding their data The provided information is used to generate domain model for a mediator system The provided information is used to generate domain model for a mediator system Assumption is that semantic information is provided by different data sources of interest Assumption is that semantic information is provided by different data sources of interest

24 24 Conclusions Contributions: Contributions: Mixed-initiative approach to help scientists query objects on the data grid Mixed-initiative approach to help scientists query objects on the data grid Isolate users from heterogeneity of data sources Isolate users from heterogeneity of data sources Manage distributed dynamic data Manage distributed dynamic data Future Work: Future Work: Algorithm to determine when to dynamically generate domain model Algorithm to determine when to dynamically generate domain model Better support for specifying model mappings Better support for specifying model mappings Artemis available as a grid service Artemis available as a grid service More extensive testing and usability studies More extensive testing and usability studies

25 25 ?


Download ppt "1 Artemis: Integrating Scientific Data on the Grid Rattapoom Tuchinda Snehal Thakkar Yolanda Gil Ewa Deelman."

Similar presentations


Ads by Google