Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get.

Similar presentations


Presentation on theme: "2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get."— Presentation transcript:

1

2 2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get the data zWe can’t understand the data we found yavailable data is poorly documented zWe can’t use the data we found ydata needs to be transformed from one form to other

3 What is Data Warehouse? Definition by Inmon “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process” “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process”

4 Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, product, sales

5 Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, attribute measures, etc. among different data sources Ensure consistency in naming conventions, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted When data is moved to the warehouse, it is converted

6 Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

7 Data Warehouse—Non-Volatile Operational update of data does not occur in the data warehouse environment Requires only two operations in data accessing: Requires only two operations in data accessing: initial loading of data and access of data

8 Data Warehouse vs. Operational DBMS OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (On-Line Analytical Processing) Major task of data warehouse system Major task of data warehouse system Data analysis and decision making Data analysis and decision making

9 From Tables and Spreadsheets to Data Cubes A data warehouse is based on multidimensional data model which views data in the form of a data cube multidimensional data model which views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimensions (such as sales) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

10 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema A fact table in the middle connected to a set of dimension tables A fact table in the middle connected to a set of dimension tables Snowflake schema A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

11 Avg_sales Euros_sold Unit_sold Location_key branch_key branch_name branch_type Example of Star Schema time_key day day_of_the_week month quarter year location_key street city province_or_street country Measures item_key item_name brand type supplier_type Branch_key Branch Time Item Location Sales Fact Table Item_key Time_key

12 branch_key branch_name branch_type Example of Snowflake Schema time_key day day_of_the_week month quarter year Measures item_key item_name brand type supplier_key Branch Time Item location_key street city_key Location Sales Fact Table Avg_sales Euros_sold Unit_sold Location_key Branch_key Item_key Time_key supplier_key supplier_type city_key city province_or_street country City Supplier

13 branch_key branch_name branch_type Example of Fact Constellation time_key day day_of_the_week month quarter year Measures Branch Time item_key item_name brand type supplier_key Item location_key street city Province/street country Location Sales Fact Table Avg_sales Euros_sold Unit_sold Location_key Branch_key Item_key Time_key shipper_key shipper_name location_key shipper_type shipper unit_shipped Euros_sold to_location from_location shipper_key Item_key Time_key Shipping Fact Table

14 A Sample Data Cube Total annual sales of TV in Ireland Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr Ireland France Germany sum

15 Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice project and select project and select Pivot (rotate) reorient the cube, visualization, 3D to series of 2D planes. reorient the cube, visualization, 3D to series of 2D planes.

16 16 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

17 Data Warehouse Architecture Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources. Data Cleaning - Data Cleaning involves finding and correcting the errors in data. Data Transformation - Data Transformation involves converting data from legacy format to warehouse format. Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions. Refreshing - Refreshing involves updating from data sources to warehouse.

18 Data Warehouse Models Data Warehouse Models Enterprise warehouse collects all of the information about subjects spanning the entire organization collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

19 Introduction to Data Mining

20 What Motivated Data Mining? We are drowning in data, but starving for knowledge!

21 21 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Extraction of interesting (implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

22 22 Why Data Mining?—Potential Applications Data analysis and decision support Market analysis and management Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Risk analysis and management  Forecasting, customer retention, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Fraud detection and detection of unusual patterns (outliers)

23 23 Integration of Multiple Technologies Machine Learning Database Management Artificial Intelligence Statistics Data Mining Visualization Algorithms

24 24 What Can Data Mining Do? ClusterClassify Categorical, Regression Categorical, RegressionSummarize Summary statistics, Summary rules Summary statistics, Summary rules Link Analysis / Model Dependencies Association rules Association rules Detect Deviations

25 25 Clustering Find groups of similar data items “Group people with similar travel profiles” George, Patricia Jeff, Evelyn, Chris Rob

26 26 Classification Find ways to separate data items into pre- defined groups A bank loan officer wants to analyse the data in order to know which customer (loan applicant) are risky or which are safe.

27 27 Association Rules Identify dependencies in the data: X makes Y likely X makes Y likely Indicate significance of each dependency “Find groups of items commonly purchased together” People who purchase X are likely to purchase Y

28 28 Deviation Detection Find unexpected values, Uses: Failure analysis Anomaly discovery for analysis “Find unusual occurrences in stock prices”

29 Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

30 Knowledge Process 1.Data cleaning – to remove noise and inconsistent data 2.Data integration – to combine multiple source 3.Data selection – to retrieve relevant data for analysis 4.Data transformation – to transform data into appropriate form for data mining 5.Data mining 6.Evaluation 7.Knowledge presentation

31 Knowledge Process Although data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation

32 Knowledge Process Based on this view, the architecture of a typical data mining system may have the following major components: Database, data warehouse, world wide web, or other information repository Database, data warehouse, world wide web, or other information repository Database or data warehouse server Database or data warehouse server Data mining engine Data mining engine Pattern evaluation model Pattern evaluation model User interface User interface

33


Download ppt "2 Data, Data everywhere yet... We can’t find the data we need data is scattered over the network zWe can’t get the data we need yneed an expert to get."

Similar presentations


Ads by Google