Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dimitrios Skoutas Alkis Simitsis

Similar presentations


Presentation on theme: "Dimitrios Skoutas Alkis Simitsis "— Presentation transcript:

1 Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data
Dimitrios Skoutas Alkis Simitsis National Technical University of Athens Dept. of Electrical and Computer Engineering

2 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

3 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

4 Extract-Transform-Load (ETL)

5 Problem description Conceptual design of ETL processes Two main goals
is a critical task performed at the early stages of a DW project describe the integration of data from heterogeneous sources into the Data Warehouse Two main goals specify inter-schema mappings identify appropriate transformations

6 Motivation The problem of heterogeneity in data sources
structural heterogeneity data stored under different schemata semantic heterogeneity different naming conventions e.g., homonyms, synonyms different representation formats e.g., units of measurement, currencies, encodings different ranges of values

7 Overview of our approach
Key idea an ontology-based approach to facilitate the conceptual design of an ETL scenario An ontology is a “formal, explicit specification of a shared conceptualization” describes the knowledge in a domain in terms of classes, properties, and relationships between them machine processable formal semantics reasoning mechanisms The Web Ontology Language (OWL) is used as the language for the ontology W3C recommendation based on Description Logics D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

8 Overview of our approach
Method Construct a graph representation for each datastore datastore graph Construct a suitable application ontology ontology graph Annotate the datastores Establish mappings between the datastore graph and the ontology graph Apply reasoning techniques to select relevant sources to identify required transformations D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

9 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

10 Datastore schema The schema SD of a datastore comprises
elements containing the actual data elements containing or referring to other elements D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

11 Datastore graph Each element e defined in the schema SD is represented by a node ve ∈ VD. Each containment relationship between elements e1, e2 is represented by an edge (v1, v2). Each reference from element e1 to element e2 is represented by an edge (v1, v2). Each edge is assigned a label of the form [min, max] denoting the corresponding cardinality. Elements containing the actual data are represented by leaf nodes D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

12 Reference example Datastore Schema DW
PARTSUP(pkey, supplier, quantity, cost, city, address, date) DS1 PS(pid, sid, department, address, date, cost, qty) DS2 D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

13 Reference example (cont’d)
Datastore graphs D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

14 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

15 Application Ontology A suitable application ontology is constructed to model the concepts of the domain the relationships between those concepts the attributes characterizing each concept the different representation formats and (ranges of) values for each attribute D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

16 Application Ontology The application ontology comprises
a set of classes C = CC ∪ CT ∪ CG CC : classes representing domain concepts CT : classes representing value types CG : classes representing aggregate functions a set of properties P containing PP : properties representing attributes of concepts or relationships between concepts property: convertsTo property: aggregates property: groups D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

17 Ontology Graph A graph representation specified for the ontology
Graph nodes represent classes in the ontology Graph edges represent properties in the ontology Different symbols are used for the different types of classes and properties D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

18 Ontology Graph D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

19 Reference example (cont’d)
The application ontology graph D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

20 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

21 Datastore annotation The semantic annotation of each datastore consists in establishing the appropriate mappings between the datastore graph GS and the ontology graph GO. Each internal node of GS may be mapped to one concept-node of GO. A leaf node of GS may be mapped to one or more nodes of GO of the following types: type-node format-node range-node aggregated-node A node may have zero or more mappings. Mappings are represented as node labels. D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

22 Datastore annotation A defined class is created in the ontology for each internal labeled node of the datastore graph. The definition for a node is constructed based on its neighbor labeled nodes. A neighbor labeled node of n is each node n΄ such that: n΄ is labeled there is a path p in the datastore graph from node n to node n΄ p contains no other labeled nodes, except n and n΄ D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

23 Reference example (cont’d)
Datastore mappings D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

24 Reference example (cont’d)
Datastore definitions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

25 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

26 ETL Transformations Generic types of ETL transformations
D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

27 Generating ETL transformations
Two main steps select relevant sources to populate each DW element identify required data transformations between the sources and the DW D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

28 Generating ETL transformations
Selecting relevant sources a source node nS, mapped to class cS a target node nT, mapped to class cT nS is provider for nT, if cS and cT have a common superclass ensures that the integrated data records have the same semantics cS and cT are not disjoint prevents data integration between datastores with conflicting constraints

29 Generating ETL transformations
Identifying data transformations (I) a RETRIEVE operation for each provider node n a MERGE operation to combine data from several provider nodes an EXTRACT operation to extract a portion of data from a provider node

30 Generating ETL transformations
Identifying data transformations (II) if CS ≡ CT or CS ⊏ CT , no transformations are required if CT ⊏ CS, AGGREGATE, FILTER and/or MINCARD/MAXCARD operations are required else, as previous plus CONVERT operations

31 Generating ETL transformations
Identifying data transformations (III) a JOIN operation to combine recordsets from nodes, whose corresponding classes are related by a property. a UNION operation, followed by a DD operation, to combine recordsets from nodes, whose corresponding classes have a common superclass. a STORE operation to denote loading of data to the target datastore.

32 Reference example (cont’d)
Provider nodes and transformations for DS2 D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

33 Outline Introduction Graph-based Datastore Representation
Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

34 Conclusions A graph-based representation, datastore graph, as a common model for the datastores. A suitable application ontology and a corresponding graph representation, ontology graph. Datastore annotation through mappings from the datastore graph to the ontology graph. Reasoning on the mappings to identify relevant sources and required transformations.

35 Current and Future Work
Semi-automatic construction of the application ontology Semi-automatic annotation of the datastores Executable workflow Evaluation on real-world ETL scenarios Maintenance/adaptation of the ETL workflow D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

36 Thank You D. Skoutas, A. DOLAP 2006, Arlington, VA, USA, 10/11/2005

37 Questions


Download ppt "Dimitrios Skoutas Alkis Simitsis "

Similar presentations


Ads by Google