Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center.

Similar presentations


Presentation on theme: "Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center."— Presentation transcript:

1 Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

2 2 Goal of this work The goal of this paper is to contribute towards a benchmark that can serve as the experimental testbed for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.

3 3 Roadmap Motivation Goals & parameters of the benchmark Micro-macro view of an ETL workflow Open issues

4 4 Roadmap Motivation Goals & parameters of the benchmark Micro-macro view of an ETL workflow Open issues

5 5 Add_SPK 1 SUPPKEY=1 SK 1 DS.PS 1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $ 2€ COSTDATE DS.PS 2 Add_SPK 2 SUPPKEY=2 SK 2 DS.PS 2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COSTDATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS 1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF 1 DS.PS_NEW 1.PKEY, DS.PS_OLD 1.PKEY DS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate 1 PKEY, DAY MIN(COST) Aggregate 2 PKEY, MONTH AVG(COST) V2 V1 TIME  DW.PARTSUPP.DATE, DAY FTP 1 S 1 _PARTSU PP S 2 _PARTSU PP FTP 2 DS.PS_NEW 2 DIFF 2 DS.PS_OLD 2 DS.PS_NEW 2.PKEY, DS.PS_OLD 2.PKEY SourcesDW DSA ETL workflows

6 6 Off-line vs. Active ETL Off-line mode: the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. –Due to the low load of both the source systems and the warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete. Active mode: the sources continuously (with high frequency) try to send new data to the warehouse. –Not necessarily instantly; rather, small groups of data are collected and sent to the warehouse for further processing. Differences of the two modes: –frequency of the workflow execution –load incurred to all the systems involved.

7 7 Problem & Motivation When experimenting with ETL wokflows –what are the important problem parameters & what are the realistic values for them? –what test suites should we use? We have faced the problem twice so far: –Logical optimization of the ETL process (transposition of activities to speed up the workflow – ICDE05, TKDE05) –Physical optimization of the ETL process (which physical operators to use for optimal execution of the process – DOLAP07)

8 8 Logical Optimization  Can we push selection early enough?  Can we aggregate before $2€ takes place?  How about naming conflicts?

9 9 Problem & Motivation Existing standards are insufficient –TPC-H –TPC-DS Practical cases are not publishable We resort in devising our own ad-hoc test scenarios –either through a specific set of scenarios –or, through a scenario generator

10 10 Roadmap Motivation Goals & parameters of the benchmark Micro-macro view of an ETL workflow Open issues

11 11 Goal of this work We are interested in understanding –The fundamental families of activities performed in an ETL scenario –The frequent ways with which activities and recordsets interconnect in an ETL scenario –The important parameters to be tuned in an experiment & the appropriate values for them –The appropriate measures to be measured during an experiment

12 12 The goal is not … To compare existing systems… To provide specialized performance measures for very specific tasks in the overall process… To exhaustively enumerate all the possible alternatives for specific operations… …

13 13 Fundamental goals of any ETL workflow Effectiveness –Does the workflow execution reach the maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints? –Is the workflow execution resilient to occasional failures? Efficiency –How fast is the workflow executed? –What resource overheads does the workflow incur at the source and the warehouse side?

14 14 Effectiveness Bring the data from the sources to the DW s.t.: –Periodically, as many of them as possible are loaded. –The loaded data are as fresh as possible –At “checkpoint” timestamps, all data have been loaded in the warehouse –Data respect both database and business rules. Potential problems : –Recovery from failures. –Missing changes at the source. –Invalid records due to failed transactions.

15 15 Efficiency Data exchange between sources and warehouse is done as fast as possible Minimal overhead is incurred –at the source systems. –at the data warehouse.

16 16 Problem parameters the size of the workflow (i.e., the number of nodes contained in the graph), the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph) the size of input data originating from the sources, the overall selectivity of the workflow, based on the selectivities of the activities of the workflow, the values of probabilities of failure.

17 17 Measures Measures for data freshness and data consistency –(M1.1) Percentage of data that violate business rules. –(M1.2) Percentage of data that should be present at their appropriate warehouse targets, but they are not. Measures for the resilience to failures –(M2) Percentage of successfully resumed workflow executions.

18 18 Measures Speed of the overall process –(M3.1) Throughput of regular workflow execution (this may also be measured as total completion time). –(M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption. –(M3.3) Average latency per tuple in regular execution.

19 19 Measures Overheads –(M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system. –(M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation. –(M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data. –(M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system. –(M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation. –(M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.

20 20 Goal of this work Covered: The important parameters to be tuned in an experiment & the appropriate values for them The appropriate measures to be measured during an experiment Pending:  The fundamental families of activities performed in an ETL scenario  The frequent ways with which activities and recordsets interconnect in an ETL scenario

21 21 Roadmap Motivation Goals & parameters of the benchmark Micro-macro view of an ETL workflow Open issues

22 22 Formal treatment We model an ETL workflow as a directed acyclic graph G(V,E). –Each node v  V is either an activity a or a recordset r. –An edge (a,b)  E denotes that b receives data from node a for further processing.

23 23 Formal treatment Intra-node constraints –Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names. –Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schemata in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution. Inter-node constraints –The data consumer of a recordset cannot be another recordset. Still, more than one consumer is allowed for recordsets. –Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets. –Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.

24 24 Resulting problem… Due to the intra-node characteristics, we need a “taxonomy” for ETL activities (micro-level) Due to the infinite possibilities of connecting nodes (activities and recordsets) we need a set of “design patterns” as abstractions of how frequently encountered ETL graphs look like (macro-level)

25 25 Micro level Micro level: derive a set of fundamental classes, where frequently encountered activities can be classified Why a taxonomy of ETL activities? –Impossible to predict any possible script / algorithm / operator /… –No algebra available right now Not necessary only for the benchmark, but orthogonally for other tasks (e.g., optimization, statistics, etc)

26 26

27 27 Logical classification Row-level operations, which are locally applied to a single row. Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to. Unary Grouper operations, which transform a set of rows to a single row. Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations. Binary or N-ary operations, which combine many inputs into one output.

28 28 Micro level blocking semi-blocking non-blocking binary # inputs Final classification Physical-level characteristics unary N-ary row-level router grouper

29 29 Micro-level UnaryN-ary BlockingHolistic Grouper Holistic Grouper Semi-blockingGrouper Non-blockingRow-level Router

30 30 Macro level Even harder! How to derive a set of typical structural patterns for an ETL scenario? –Top down: delve to the fundamental constituents of such a scenario –Bottom up: explore scenarios and try to abstract common parts We did a little bit of both, and derived a fundamental pattern of structure

31 31 A butterfly is an ETL workflow that consists of three distinct components: Body: a central, detailed point of persistence (fact or dimension table) that is populated with the data produced by the left wing. Left wing: sources, activities and intermediate results. Performs extraction, cleaning and transformation + loads the data to the body. Right wing: materialized views, reports, spreadsheets, as well as the activities that populate them, to support reporting and analysis Butterflies to the rescue!

32 32 Butterflies to the rescue!

33 33 Butterflies to the rescue!

34 34 Butterfly classes Butterflies constitute a fundamental pattern of reference –Line –Balanced butterfly Left-winged variants (heavy of the ETL part) –Primary flow –Wishbone –Tree Right winged variants (heavy on the “reporting” part) –Fork Irregular variants

35 35 Line

36 36 Wishbone

37 37 Primary Flow

38 38 Tree

39 39 Fork

40 40 Balanced Butterfly

41 41 Balanced Butterfly Slowly Changing Dimension of Type II Not a typical butterfly…

42 42 Roadmap Motivation Goals & parameters of the benchmark Micro-macro view of an ETL workflow Open issues

43 43 Open Issues Data sizes –the numbers given by TPC-H can be a valid point of reference for data warehouse contents. –Important: fraction of source data over the warehouse contents. Values in the range 0.01 to 0.7? Selectivity of the left wing of a butterfly –Values between 0.5 and 1.2? Failure rates –Range of 10 -4 and 10 -2 ? Workflow size –Although we provide scenarios of small scale, medium–size and large-size scenarios are also needed. A careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark.

44 44 Open Issues Nature of data (now: relational; also: XML, multimedia, …) Active vs. off-line modus operandi Auxiliary structures and processes (indexes, backup & maintenance scenarios, etc) Parallelism and Partitioning

45 45 Message to you We need a commonly agreed benchmark that realistically reflects real-world ETL scenarios Butterflies to the rescue Feedback from the industry is necessary –workflow complexity –the frequencies of typically encountered ETL operations)

46 46 Thank you! All pictures are imported from MS Clipart

47 47 Auxiliary slides

48 48

49 49 Line

50 50 Wishbone

51 51 Primary Flow

52 52 Tree

53 53 Fork

54 54 Right Deep Hierarchy

55 55

56 56 Partitioning & parallelism

57 57 Statistics per pattern Legend: N+M (left wing + right wing) INCR: incremental maintenance I/U: insert and/or update FULL: full recomputation

58 58


Download ppt "Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center."

Similar presentations


Ads by Google