Download presentation
Presentation is loading. Please wait.
Published byCandice Hart Modified over 9 years ago
1
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21
2
Conceptual Modelling Solutions for the Data Warehouse Stefano Rizzi Sept-Dec 2009 – w8d22
3
Definition 1: Facts A fact is a focus of interest –for the decision-making process; Typically, it models a set of events –occurring in the enterprise world. A fact is graphically represented –by a box with two sections, one for the fact name and one for the measures. Sept-Dec 2009 – w8d23
4
Guideline 1: Facts Concepts represented in the data source –by frequently-updated archives are good candidates for facts Concepts represented –by almost-static archives are not good candidates for facts Sept-Dec 2009 – w8d24
5
Definition 2: Measure A measure is a numerical property of a fact, –and describes one of its quantitative aspects of interests for analysis. Measures are included in the bottom section of the fact. Sept-Dec 2009 – w8d25
6
Definition 3: Dimension A dimension is a fact property with a finite domain and –describes one of its analysis coordinates. The set of dimensions of a fact –determines its finest representation granularity???. Graphically, dimensions are represented –as circles attached to the fact by straight lines. Sept-Dec 2009 – w8d26
7
Guideline 2: Dimensions At least one of the dimensions of the fact –should represent time, at any granularity. Sept-Dec 2009 – w8d27
8
Definition 4: Primary Event A primary event is an occurrence of a fact, and –is identified by a tuple of values, –one value for each dimension. Each primary event is described –by one value for each measure. Sept-Dec 2009 – w8d28
9
Definition 5: Dimension Attributes A dimension attribute is a property, –with a finite domain, –of a dimension. Like dimensions, –it is represented by a circle. Sept-Dec 2009 – w8d29
10
Definition 6: hierarchy A hierarchy is a directed tree, –rooted in a dimension, –whose nodes are all the dimension attributes that describe that dimension, –and whose arcs model many-to-one associations between pairs of dimension attributes. Arcs are graphically represented by straight lines. Sept-Dec 2009 – w8d210
11
Definition 8: Descriptive attribute A descriptive attribute specifies a property of a dimension attribute, –to which is related by an x-to-one association. Descriptive attributes are not used for aggregation; –they are always leaves of their hierarchy –and are graphically represented by horizontal lines. Sept-Dec 2009 – w8d211
12
Definition 9: Cross-dimension attributes A cross-dimension attribute –is a (either dimension or descriptive) attribute –whose value is determined –by the combination of two or more dimension attributes, –possibly belonging to different hierarchies. It is denoted by connecting through a curve line –the arcs that determine it. Sept-Dec 2009 – w8d212
13
Definition 10: Convergence A convergence takes place –when two dimension attributes within a hierarchy –are connected by two or more alternative paths –of many-to-one associations. Convergences are represented –by letting two or more arcs converge –on the same dimension attribute. Sept-Dec 2009 – w8d213
14
Definition 13: Ragged Hierarchy A ragged (or incomplete) hierarchy is a hierarchy, –where, for some instances, –the values of one or more attributes are missing –(since undefined or unknown). A ragged hierarchy is graphically denoted –by marking with a dash the attributes –whose values may be missing. Sept-Dec 2009 – w8d214
15
Definition 14: Unbalanced Hierarchy An unbalanced (or recursive) hierarchy is a hierarchy –where, though inter-attribute relationships are consistent, –the instances may have different lengths. Graphically, it is represented –by introducing a cycle within the hierarchy. Sept-Dec 2009 – w8d215
16
Definition 15: Additive A measure is said to be additive along a dimension –if its values can be aggregated –along the corresponding hierarchy by the sum operator, –otherwise it is called nonadditive. A nonadditive measure is nonaggregable –if no other aggregation operator can be used on it. Sept-Dec 2009 – w8d216
17
Open Issues Lack of a standard for conceptual models Need for design patterns to support modelling Need for a method to model security issues Sept-Dec 2009 – w8d217
18
Data Cleaning (Based on Rahm) Sept-Dec 2009 – w8d218
19
Single source problems Lack of appropriate model-specific integrity constraints –Attribute: illegal values –Record: uniqueness violation –Relationship: referential integrity not validated Sept-Dec 2009 – w8d219
20
Single source problems Lack of appropriate application-specific integrity constraints can lead to: –Attribute problems: missing values, misspellings, cryptic abbreviations, embedded values, misfiled values –Record problems: violated attribute dependencies, word transpositions, duplicated records, contradicted records –Relationship problems: wrong references Sept-Dec 2009 – w8d220
21
Multi-source Problems In addition to single source problems, there can be: –overlapping or contradicting data –schema naming and structural conflicts –different data types / granularities / interpretations / points in time Sept-Dec 2009 – w8d221
22
Data Analysis for cleaning Using metadata for data profiling –focuses on the instance analysis of individual attributes –derives information such as the data type, length, value range, discrete values and their frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., for phone numbers) –providing an exact view of various quality aspects of the attribute Data mining –helps discover specific data patterns in large data sets, e.g., relationships holding between several attributes –focuses on so-called descriptive data mining models including clustering, summarization, association discovery and sequence Sept-Dec 2009 – w8d222
23
Data transformations Can be done via SQL operations –which allows tracking of all transformations –can include Extracting values from free-form attributes (attribute split): Validation and correction: Standardization Duplicate elimination May require considerable human involvement –some transformations will be more complex than others –some transformations will apply to more or less data Sept-Dec 2009 – w8d223
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.