Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher San Diego Supercomputer Center Associate.

Similar presentations


Presentation on theme: "Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher San Diego Supercomputer Center Associate."— Presentation transcript:

1 Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher (ludaesch@sdsc.edu) San Diego Supercomputer Center Associate Professor Dept. of Computer Science & Genome Center University of California, Davis Fellow San Diego Supercomputer Center University of California, San Diego UC DAVIS Department of Computer Science

2 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 2 Outline Semantics & Scientific Data IntegrationSemantics & Scientific Data Integration Semantics & Scientific Workflow ManagementSemantics & Scientific Workflow Management ConclusionsConclusions

3 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 3 Anatomy of the Science Environment for Ecological Knowledge (SEEK) Collaboratory Domain Science DriverDomain Science Driver –Ecology (LTER), biodiversity, … Analysis & Modeling SystemAnalysis & Modeling System –Design & execution of ecological models & analysis ( “ scientific workflows ” ) –{application,upper}-ware  Kepler system Semantic Mediation SystemSemantic Mediation System –Data Integration of hard-to- relate sources and processes –Semantic Types and Ontologies –upper middleware  Sparrow Toolkit EcoGridEcoGrid –Access to ecology data and tools –{middle,under}-ware  unified API to SRB/MCAT, MetaCat, DiGIR, … datasets sample CS problem [DILS’04]

4 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 4 Common Collaboratories / Distributed Science / Cyberinfrastructure Pieces Seamless and uniform data access (“Data-Grid”)Seamless and uniform data access (“Data-Grid”) –data & metadata registry distributed and high performance computing platform (“Compute-Grid”)distributed and high performance computing platform (“Compute-Grid”) –service registry User-friendly workbench / problem-solving environmentUser-friendly workbench / problem-solving environment – scientific workflow system A common problem:A common problem: –integrating (or at least linking) data from multiple sites, investigators, communities, …, scales, …, species, …  Federated, integrated, mediated databases –often use of semantic extensions (e.g. ontologies)

5 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 5 Interoperability & Integration Challenges System aspects: “Grid” Middleware distributed data & computing, SOA web services, WSDL/SOAP, WSRF, OGSA, … sources = functions, files, data sets … Syntax & Structure: (XML-Based) Data Mediators wrapping, restructuring (XML) queries and views sources = (XML) databases Semantics: Model-Based/Semantic Mediators conceptual models and declarative views Knowledge Representation: ontologies, description logics (RDF(S),OWL...) sources = knowledge bases (DB+CMs+ICs) Synthesis: Scientific Workflow Design & Execution Composition of declarative and procedural components into larger workflows (re)sources = services, processes, actors, … Semantic extensions needed here as well!  reconciling S 5 heterogeneities  “gluing” together resources  bridging information and knowledge gaps computationally

6 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 6 Information Integration Challenges: S 4 Heterogeneities System aspectsSystem aspects –platforms, devices, data & service distribution, APIs, protocols, …  Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, … Syntax & StructureSyntax & Structure –heterogeneous data formats (one for each tool...) –heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) –heterogeneous schemas (one for each DB...)  Database mediation and warehousing technologies + XML-based data exchange, integrated views, transparent query rewriting, … SemanticsSemantics –descriptive metadata, different terminologies, implicit assumptions & hidden semantics (“context”) of experiments, simulations, observation, …  Knowledge representation & semantic mediation technologies + “smart” data discovery & integration + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!

7 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 7 Information Integration Challenges: S 5 Heterogeneities Synthesis of applications, analysis tools, data & query components, … into “scientific workflows”Synthesis of applications, analysis tools, data & query components, … into “scientific workflows” –How to make use of these wonderful things & put them together to solve a scientist’s problem?  Scientific Problem Solving Environments (PSEs)  Portals,Workbench (“scientist’s view”, end user) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, …  Scientific Workflow System (“engineer’s view”, tool maker) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset registration, … Not discussed here: the “6th S”: Social challenges …

8 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 8 Our Focus Scientific Data Integration:Scientific Data Integration: –need DB/DI + KR (“semantic mediation”) Automation of Scientific Data Analysis, Process & Application IntegrationAutomation of Scientific Data Analysis, Process & Application Integration –need for scientific workflow systems –need for semantic extensions But first:But first: –Some data & information integration problems

9 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 9 An Online Shopper’s Information Integration Problem El Cheapo: “Where can I get the cheapest copy (including shipping cost) of Wittgenstein’s Tractatus Logicus-Philosophicus within a week?” ?InformationIntegration addall.com “One-World” Mediation amazon.comamazon.com A1books.comA1books.com half.comhalf.com barnes&noble.combarnes&noble.com Mediator (virtual DB) (vs. Datawarehouse) NOTE: non-trivial data engineering challenges!

10 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 10 A Home Buyer’s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? ? Information Integration Realtor Demographics School Rankings Crime Stats “Multiple-Worlds” Mediation

11 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 11 Information Integration from a Database Perspective Information Integration ProblemInformation Integration Problem –Given: data sources S 1,..., S k (databases, web sites,...) and user questions Q 1,..., Q n that can –in principle– be answered using the information in the S i –Find: the answers to Q 1,..., Q n The Database Perspective: source = “database”The Database Perspective: source = “database”  S i has a schema (relational, XML, OO,...)  S i can be queried  define virtual (or materialized) integrated (or global) view G over local sources S 1,..., S k using database query languages (SQL, XQuery,...)  questions become queries Q i against G(S 1,..., S k )

12 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 12 5. Post processing 5. Post processing 2. Query rewriting Standard Mediator Architecture MEDIATOR Integrated Global (XML) View G Integrated View Definition G(..)  S 1 (..)…S k (..) USER/Client USER/Client 1. Query Q ( G (S 1,..., S k ) ) 1. Query Q ( G (S 1,..., S k ) ) S1S1 Wrapper (XML) View S2S2 Wrapper (XML) View SkSk Wrapper (XML) View web services as wrapper APIs 3. Q1 Q2 Q3 4. {answers(Q1)} {answers(Q2)} {answers(Q3)} 6. {answers(Q)}

13 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 13 Query Planning in Data Integration Given:Given: –Declarative user query Q: answer(…)  …G... –… & { G  … S … } global-as-view (GAV) –… & { S  … G … } local-as-view (LAV) –… & { ic(…)  … S … G… } integrity constraints (ICs) Find:Find: –equivalent (or minimal containing, maximal contained) query plan Q’: answer(…)  … S …  query rewriting (logical/calculus, algebraic, physical levels) Results:Results: –A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP, …, undecidable –hot research area in core CS (database community)

14 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 14 A Neuroscientist’s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? ? Information Integration protein localization (NCMIR) neurotransmission (SENSELAB) sequence info (CaPROT) morphometry (SYNAPSE) “Complex Multiple-Worlds” Mediation Biomedical Informatics Research Network http://nbirn.net Inter-source links: unclear for the non-scientists hard for the scientist

15 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 15

16 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 16 Scientific Data Integration using Semantic Extensions

17 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 17 Example: Geologic Map Integration Given:Given: –Geologic maps from different state geological surveys (shapefiles w/ different data schemas) –Different ontologies: Geologic age ontology (e.g. USGS) Rock classification ontologies: –Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC) –Single hierarchy from British Geological Survey (BGS) Problem:Problem: –Support uniform queries across all maps –… possibly using different ontologies –Support registration w/ ontology A, querying w/ ontology B

18 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 18 Integration Schema Schema Integration (“registering” local schemas to the global schema) Arizona Colorado Utah Nevada Wyoming New Mexico Montana E. Idaho Montana West Formation… Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… Formation…Age… …Formation…Age …Composition …Fabric …Texture …Formation…Age …Composition …Fabric …TextureABBREV PERIOD NAME PERIOD TYPE TIME_UNIT FMATN PERIOD NAME PERIOD NAME FORMATION PERIOD FORMATION LITHOLOGY AGE andesitic sandstone Livingston formation Tertiary- Cretaceous Sources

19 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 19 Multihierarchical Rock Classification “Ontology” (Taxonomies) for “Thematic Queries” (GSC) Composition Genesis Fabric Texture

20 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 20 Ontology-Enabled Application Example: Geologic Map Integration Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (without age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) Show formations where AGE = ‘Paleozic’ (with age ontology) +/- a few hundred million years domain knowledge domain knowledge Knowledge representation Geologic Age ONTOLOGY Nevada

21 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 21 Querying by Geologic Age …

22 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 22 Querying by Geologic Age: Results

23 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 23 Semantic Mediation (via “semantic registration” of schemas and ontology articulations) Schema elements and/or data values are associated with concept expressions from the target ontologySchema elements and/or data values are associated with concept expressions from the target ontology  conceptual queries “through” the ontology Articulation ontologyArticulation ontology  source registration to A, querying through B Semantic mediation: query rewriting w/ ontologiesSemantic mediation: query rewriting w/ ontologies Database1 Database2 Ontology A Ontology B semantic registration ontology articulations Concept-based (“semantic”) queries semantic registration

24 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 24 Different views on State Geological Maps

25 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 25 Sedimentary Rocks: BGS Ontology

26 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 26 Sedimentary Rocks: GSC Ontology

27 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 27 Some Thoughts … Translate this idea of multiple conceptual (ontology) views to your domain!Translate this idea of multiple conceptual (ontology) views to your domain! –e.g. datasets  biological pathways registration Your data is valuable (time & $$$ spent in producing it)Your data is valuable (time & $$$ spent in producing it)  data (re-)usability Metadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queriesMetadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queries Does your system “understand” what to do with the metadata?Does your system “understand” what to do with the metadata? Capturing more semantics of a data set in a way that humans and systems can exploit it is an investment in reusabilityCapturing more semantics of a data set in a way that humans and systems can exploit it is an investment in reusability –“We are producing more and more data” –Today “we can store everything!” –But can we use anything? (i.e., is anyone looking at the data after the initial creation?) Design system, interfaces, data and metadata models with reusability in mind (think archives and “time capsules”)Design system, interfaces, data and metadata models with reusability in mind (think archives and “time capsules”) This may even be pushed to the experiment/simulation/workflow design…This may even be pushed to the experiment/simulation/workflow design…

28 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 28 Data Semantics and Ontologies should be useful for Humans and “The Machine”

29 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 29 formalized as domain map/ontology Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). domain expert knowledge Made usable for the system using Description Logic Example: Domain Knowledge to “glue” SYNAPSE & NCMIR Data

30 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 30 “Semantic Source Browsing”: Domain Maps/Ontologies (left) & conceptually linked data (right)

31 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 31 A Semantic Mediation Result View A Semantic Mediation Result View

32 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 32 Source Contextualization through Ontology Refinement In addition to registering (“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map...  sources can register new concepts at the mediator...  increase your data usability

33 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 33 Outline Semantics & Scientific Data IntegrationSemantics & Scientific Data Integration Semantics & Scientific Workflow ManagementSemantics & Scientific Workflow Management ConclusionsConclusions

34 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 34 What is a Scientific Workflow (SWF)? Goals:Goals: –automate a scientist’s repetitive data management and analysis tasks –typical phases: data access, scheduling, generation, transformation, aggregation, analysis, visualization  design, test, share, deploy, execute, reuse, … SWFs

35 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 35 Promoter Identification Workflow Source: Matt Coleman (LLNL)

36 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 36 Source: NIH BIRN (Jeffrey Grethe, UCSD)

37 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 37 Ecology: GARP Analysis Pipeline for Invasive Species Prediction Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample Data + A3 + A2 + A1 Data Calculation Map Generation Validation User Validation Map Generation Integrated layers (invasion area) (c) Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) Environmental layers (native range) (b) Generate Metadata Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Environmental layers (invasion area) (b) Invasion area prediction map (f) Model quality parameter (g) Selected prediction maps (h) Source: NSF SEEK (Deana Pennington et. al, UNM)

38 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 38

39 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 39 Commercial & Open Source Scientific “Workflow” (often Dataflow) Systems Kensington Discovery Edition from InforSense Taverna Triana

40 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 40 SCIRun: Problem Solving Environments for Large-Scale Scientific Computing SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations and visualizationsSCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations and visualizations Component model, based on generalized dataflow programmingComponent model, based on generalized dataflow programming Steve Parker (cs.utah.edu)

41 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 41 Ptolemy II see!see! try!try! read!read! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

42 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 42 Why Ptolemy II (and thus KEPLER)? Ptolemy II Objective:Ptolemy II Objective: –“The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuseDataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse User-OrientationUser-Orientation –Workflow design & exec console (Vergil GUI) –“Application/Glue-Ware” excellent modeling and design support run-time support, monitoring, … not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) but middle-/underware is conveniently accessible through actors! PRAGMATICSPRAGMATICS –Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) –open source system –Ptolemy II folks actively participate in KEPLER

43 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 43 KEPLER/CSP: Contributors, Sponsors, Projects (or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK Ptolemy II www.kepler-project.org

44 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 44 KEPLER: An Open Collaboration Initiated by members from DOE SDM/SPA and NSF SEEK; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)Initiated by members from DOE SDM/SPA and NSF SEEK; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …) Open Source (BSD-style license)Open Source (BSD-style license) Intensive Communications:Intensive Communications: –Web-archived mailing lists –IRC (!) –Meetings, Hackathons Co-development:Co-development: –via shared CVS repository –joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted “in” as a member/co-developer Software & social engineeringSoftware & social engineering –How to better accommodate new groups/communities? –How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?

45 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 45 Ptolemy II/KEPLER GUI (Vergil) “Directors” define the component interaction & execution semantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)

46 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 46 Web Services  Actors (WS Harvester) 1 2 3 4  “Minute-made” (MM) WS-based application integration Similarly: MM workflow design & sharing w/o implemented components

47 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 47 Rapid Web Service-based Prototyping (Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg) Source: Ilkay Altintas, SDM, NLADR ROADNet: Vernon, Orcutt et al Web services: Tony Fountain et al Source: Ilkay Altintas, SDM, NLADR ROADNet: Vernon, Orcutt et al Web services: Tony Fountain et al

48 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 48 An “early” example: Promoter Identification SSDBM, AD 2003 Scientist models application as a “workflow” of connected components (“actors”) If all components exist, the workflow can be automated/ executed Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)

49 PIW Workflow Today

50 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 50 Enter initial inputs, Run and Display results “Run Window”

51 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 51 Custom Output Visualizer

52 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 52 Job Management (here: NIMROD) Job management infrastructure in place Results database: under development Goal: 1000’s of GAMESS jobs (quantum mechanics)

53 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 53 Some Recent Actor Additions

54 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 54 in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK

55 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 55 in KEPLER (interactive session) Source: Dan Higgins, Kepler/SEEK

56 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 56 Blurring Design (ToDo) and Execution

57 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 57 Some Scientific Workflow Challenges Typical FeaturesTypical Features –data-intensive and/or compute-intensive –plumbing-intensive (consecutive web services won’t fit) –dataflow-oriented –distributed (remote data, remote processing) –user-interaction “in the middle”, … –… vs. (C-z; bg; fg)-ing (“detach” and reconnect) –advanced programming constructs (map(f), zip, takewhile, …) –logging, provenance, “registering back” (intermediate) products…

58 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 58 Scientific Workflows & Semantics Registering data to ontologies: semantic types (in addition to structural data types)Registering data to ontologies: semantic types (in addition to structural data types) Smarter data set discovery & integrationSmarter data set discovery & integration Now also:Now also: –Smarter workflow design –More “intelligent” (semantics-aware) component composition –Improved (re-)usability of data, services (actors), and workflows –Given semantic type of my input ports, what other data sets / actors produce such input

59 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 59 Reengineering a Geoscientist’s Mineral Classification Workflow Add semantic types to ports!!

60 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 60 Beginnings: Ontology-based Actor/Service Discovery Ontology based actor (service) and dataset search Result Display

61 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 61 Semantics & Scientific Workflows Data comes from heterogeneous sources –Real-world observations –Spatial-temporal contexts –Collection/measurement protocols and procedures –Many representations for the same information (count, area, density) –Schematically heterogeneous Data discovered and “synthesized” manually Hard to reuse/repurpose existing analytical steps (another form of heterogeneity)

62 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 62 The KR/SMS “Waterfall” … Ontologies Semantic Annotation Iterative Development Resource Discovery Resource Integration Workflow Planning Workflow Analysis Source: [Bowers-SEEK-AHM-04]

63 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 63 A KR+DI+Scientific Workflow Problem Services can be semantically compatible, but structurally incompatibleServices can be semantically compatible, but structurally incompatible Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Incompatible Compatible (⋠)(⋠) (⊑)(⊑) (Ps)(Ps) (Ps)(Ps)  (≺)(≺) Ontologies (OWL) Source: [Bowers-Ludaescher, DILS’04]

64 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 64 Ontology-Informed Data Transformation (“Structure-Shim”) Source Service Source Service Target Service Target Service PsPs PtPt Semantic Type P s Semantic Type P t Structural Type P t Structural Type P s Desired Connection Compatible (⊑)(⊑) Registration Mapping (Output) Registration Mapping (Input) Correspondence Generate (Ps)(Ps) (Ps)(Ps) Ontologies (OWL) Transformation Source: [Bowers-Ludaescher, DILS’04]

65 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 65 Outline Scientific Data IntegrationScientific Data Integration Scientific Workflow ManagementScientific Workflow Management Musings & ConclusionsMusings & Conclusions

66 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 66 Some Thoughts … Translate this idea of multiple conceptual (ontology) views to your domain!Translate this idea of multiple conceptual (ontology) views to your domain! –e.g. datasets  biological pathways registration Your data is valuable (time & $$$ spent in producing it)Your data is valuable (time & $$$ spent in producing it)  data (re-)usability Metadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queriesMetadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queries Does your system “understand” what to do with the metadata?Does your system “understand” what to do with the metadata? Capturing more semantics of a data set in a way that humans and systems can exploit it is an investment in reusabilityCapturing more semantics of a data set in a way that humans and systems can exploit it is an investment in reusability –“We are producing more and more data” –Today “we can store everything!” –But can we use anything? (i.e., is anyone looking at the data after the initial creation?) Design system, interfaces, data and metadata models with reusability in mind (think archives and “time capsules”)Design system, interfaces, data and metadata models with reusability in mind (think archives and “time capsules”) This may even be pushed to the experiment/simulation/workflow design…This may even be pushed to the experiment/simulation/workflow design…

67 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 67 The Future We start to see the benefits of semantic technologies in scientific data managementWe start to see the benefits of semantic technologies in scientific data management –BTW: semantic technologies have been there for a while! think conceptual models, ER diagrams, … or Gottlob Frege (German mathematician, logician, philosopher; 1848-1925) Today: momentum through “Semantic Web”, “Semantic Grid” Where will semantics lead us in 10, 20, 50 years?Where will semantics lead us in 10, 20, 50 years?

68 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 68 A 50 year forecast in retrospective (even if a hoax … you get the idea…)

69 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 69 KEPLER – a Collaboration Example A grass-roots projectA grass-roots project –Needed a coalition of the (really!) willing –People matter! Intra-project linksIntra-project links –e.g. in SEEK: AMS  SMS  EcoGrid Inter-project linksInter-project links –SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM, Ptolemy II, NIH BIRN (coming we hope …), UK eScience myGrid, … Inter-technology linksInter-technology links –Globus, SRB, JDBC, web services, soaplab services, command line tools, R, GRASS, XSLT, … Interdisciplinary linksInterdisciplinary links –CS, IT, domain sciences, … (recently: usability engineer)

70 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 70 GEON Dataset Generation & Registration (a co-development in KEPLER) Xiaowen (SDM) Edward et al.(Ptolemy) Yang (Ptolemy) Efrat (GEON) Ilkay (SDM) SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) % Makefile $> ant run % Makefile $> ant run

71 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 71 Summary/Lessons Learned Semantics mattersSemantics matters Collaboration tools neededCollaboration tools needed –CVS repositories (+cvsview, webcvs) –Mailing lists (e.g. mailman  googlified) –Bugzilla (detailed tracking of tech. issues & bugs) –WIKI (community authored web resource, e.g. high-level tech. issues) People matterPeople matter Repositories matterRepositories matter –EcoGrid (SEEK) registry, GEON registry, BIRN registry  KEPLER actor & datasets repository, … –UDDI what? “Melting Pots”: “Melting Pots”: –Places, projects, organizations (GGF), tools: National Labs, …, SDSC, NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …, new Genome Center@UC Davis (moving in …), … SDM, BIRN, GEON, SEEK, … Kepler, …

72 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 72 Q & A

73 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 73 Further Reading under review – available upon request from ludaesch@sdsc.edu

74 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 74 Related Publications Semantic Data Registration and IntegrationSemantic Data Registration and Integration On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece.On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece.On Integrating Scientific Resources through Semantic RegistrationSSDBM'04On Integrating Scientific Resources through Semantic RegistrationSSDBM'04 A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.A System for Semantic Integration of Geologic Maps via OntologiesSCISWA System for Semantic Integration of Geologic Maps via OntologiesSCISW Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003.Towards a Generic Framework for Semantic Registration of Scientific DataSCISWTowards a Generic Framework for Semantic Registration of Scientific DataSCISW The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data InteroperabilityThe Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003.Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON GridSemantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid Query Planning and RewritingQuery Planning and Rewriting Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004.Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004.Processing First-Order Queries under Limited Access PatternsPODS'04Processing First-Order Queries under Limited Access PatternsPODS'04 Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992.Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992.Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsEDBT'04Processing Unions of Conjunctive Queries with Negation under Limited Access PatternsEDBT'04 Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and NegationICDE'04Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and NegationICDE'04

75 DOE NC Meeting, Boulder, Dec 1 st 2004 Semantic Technologies in SDM, B.Ludäscher 75 Related Publications Scientific WorkflowsScientific Workflows Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece.Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece.Kepler: An Extensible System for Design and Execution of Scientific WorkflowsSSDBM'04Kepler: An Extensible System for Design and Execution of Scientific WorkflowsSSDBM'04 Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004.Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004.Kepler: Towards a Grid-Enabled System for Scientific WorkflowsWorkflow in Grid Systems (GGF10)Kepler: Towards a Grid-Enabled System for Scientific WorkflowsWorkflow in Grid Systems (GGF10) An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994.An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994.An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsDILS'04An Ontology-Driven Framework for Data Transformation in Scientific WorkflowsDILS'04 A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004.A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004.ICWS


Download ppt "Semantic Technologies: Towards Making a Difference in Scientific Data Management Bertram Ludäscher San Diego Supercomputer Center Associate."

Similar presentations


Ads by Google