Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.

Similar presentations


Presentation on theme: "Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data."— Presentation transcript:

1 Managing Data for DSS II

2 Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data from multiple applications –Small number of users with intense and long interactions –Read intensive usage –Periodic updates to the contents –Consists of current as well as historical data –Relatively fewer but large tables –Queries results is large results sets, involving full table scan and joins spanning several tables –Aggregation, vector operation and summarization are common –The data frequently resides in external heterogeneous sources

3 Introduction- Terminology Current Detail Data- data acquired directly from operational databases, often representing entire enterprise Old Detail Data- Aged current detail data, historical data organized by subjects, it helps in trend analysis Data Marts- A large data store for informational needs where scope is limited to a department, SBUs etc., In a phased implementation data marts are a way to build a warehouse. Summarized Data- Aggregated data along the lines required for executive reporting,trend analysis and decision support. Metadata- It is data about the data, description of contents, location, structure, end-user views, identification of authoritative data, history of updates, security authorizations

4 Introduction- Architecture Extract, Cleanup & Load External Currentl Current Repository Meta data Realized or Virtual MDDB Management Information Delivery System Report, Query & EIs OLAP Tools Data Mining Tools

5 The Data Warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making. –Integrated The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization. –Subject-Oriented The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.

6 Time Variant –The Warehouse data represent the flow of data through time. It can even contain projected data. –Non-Volatile Once data enter the Data Warehouse, they are never removed. The Data Warehouse is always growing.

7 Major Tasks in Data Preparation Data cleaning –Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration –Integration of multiple databases, data cubes, or files Data transformation –Normalization and aggregation Data reduction –Obtains reduced representation in volume but produces the same or similar analytical results Data discretization –Part of data reduction but with particular importance, especially for numerical data

8 Extraction, Cleanup, Integration Data Cleaning –Missing Values Ignore the tuple Fill in the value manually Use a global constant to fill Attribute mean as missing value –Average income of all customer is 30000 pm Attribute mean of all samples belonging to same class –Missing value with average income of same class e.g., credit_risk, emp_status Most probable value –Regression, Bayesian classifiers, decision tree induction

9 Extraction, Cleanup, Integration Data Cleaning –Noisy Data- A random error or variance of measured value. Given price how can we smooth our the data to remove noise. Binning –Smooth the sorted data by consulting the neighbours. – Given 4 8 15 21 21 24 25 28 34 –Parttion it- »Bin 1: 4 8 15 »Bin 2: 21 21 24 »Bin 3: 25 28 34 –Replace the Bin values by mean or Bin boundaries Clustering Regression- Smoothen it by fitting in fitting in functions. –Inconsistent Data – Manually or through rule base

10 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range –min-max normalization –Z-score normalization –Normalization by decimal scaling

11 Data Transformation: Normalization min-max normalization Suppose that the minimum and maximum values for the attribute income are £12,000 and £98,000, respectively. We map income to the range [0.0, 1.0]. By min-max normalization, a value of £73,600 for income is transformed to (73600-12000)/(98000-12000)*(1.0-0.0)+0=0.716.

12 Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1

13 Star Schema The star schema is a data-modeling technique used to map multidimensional decision support into a relational database. Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database. Four Components: –Facts –Dimensions –Attributes –Attribute hierarchies

14 A Simple Star Schema

15 Star Schema Facts –Facts are numeric measurements (values) that represent a specific business aspect or activity. –The fact table contains facts that are linked through their dimensions. –Facts can be computed or derived at run-time (metrics). Dimensions –Dimensions are qualifying characteristics that provide additional perspectives to a given fact. –Dimensions are stored in dimension tables.

16 Star Schema Attributes –Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts. –Dimensions provide descriptive characteristics about the facts through their attributes. Possible Attributes For Sales Dimensions

17 Three Dimensional View Of Sales

18 Slice And Dice View Of Sales

19 Star Schema Attribute Hierarchies –Attributes within dimensions can be ordered in a well-defined attribute hierarchy. –The attribute hierarchy provides a top-down data organization that is used for two main purposes: Aggregation Drill-down/roll-up data analysis

20 A Location Attribute Hierarchy

21 Attribute Hierarchies In Multidimensional Analysis

22 Example of Star Schema time_key day day_of_the_week month quarter year time location_key street city province_or_street country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch

23


Download ppt "Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data."

Similar presentations


Ads by Google