2 Data, Data everywhere yet ... We can’t find the data we needdata is scattered over the networkWe can’t get the data we needneed an expert to get the dataWe can’t understand the data we foundavailable data is poorly documentedWe can’t use the data we founddata needs to be transformed from one form to other
3 What is Data Warehouse? Definition by Inmon “A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process”
4 Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, product, sales
5 Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sourcesrelational databases, flat files, on-line transaction recordsData cleaning and data integration techniques are appliedEnsure consistency in naming conventions, attribute measures, etc. among different data sourcesWhen data is moved to the warehouse, it is converted
6 Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systemsOperational database: current value dataData warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
7 Data Warehouse—Non-Volatile Operational update of data does not occur in the data warehouse environmentRequires only two operations in data accessing:initial loading of data and access of data
8 Data Warehouse vs. Operational DBMS OLTP (On-Line Transaction Processing)Major task of traditional relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.OLAP (On-Line Analytical Processing)Major task of data warehouse systemData analysis and decision making
9 From Tables and Spreadsheets to Data Cubes A data warehouse is based onmultidimensional data model which views data in the form of a data cubeA data cube allows data to be modeled and viewed in multiple dimensions (such as sales)Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
10 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measuresStar schemaA fact table in the middle connected to a set of dimension tablesSnowflake schemaA refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakeFact constellationsMultiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
11 Example of Star Schema Sales Fact Table Measures Time time_key Item dayday_of_the_weekmonthquarteryearSales Fact TableItemitem_keyitem_namebrandtypesupplier_typeTime_keyItem_keyBranch_keyLocationBranchLocation_keylocation_keystreetcityprovince_or_streetcountrybranch_keybranch_namebranch_typeUnit_soldEuros_soldAvg_salesMeasures
12 Example of Snowflake Schema SupplierTimesupplier_keysupplier_typetime_keydayday_of_the_weekmonthquarteryearItemSales Fact Tableitem_keyitem_namebrandtypesupplier_keyAvg_salesEuros_soldUnit_soldLocation_keyBranch_keyItem_keyTime_keycity_keycityprovince_or_streetcountryCityBranchbranch_keybranch_namebranch_typelocation_keystreetcity_keyLocationMeasures
13 Example of Fact Constellation Shipping Fact TableTimeunit_shippedEuros_soldto_locationfrom_locationshipper_keyItem_keyTime_keytime_keydayday_of_the_weekmonthquarteryearitem_keyitem_namebrandtypesupplier_keyItemSales Fact TableAvg_salesEuros_soldUnit_soldLocation_keyBranch_keyItem_keyTime_keyBranchbranch_keybranch_namebranch_typeLocationlocation_keystreetcityProvince/streetcountryshipper_keyshipper_namelocation_keyshipper_typeshipperMeasures
14 A Sample Data Cube All, All, All Date Product Country Total annual salesof TV in IrelandDateProductCountryAll, All, AllsumTVVCRPC1Qtr2Qtr3Qtr4QtrIrelandFranceGermany
15 Typical OLAP Operations Roll up (drill-up): summarize databy climbing up hierarchy or by dimension reductionDrill down (roll down): reverse of roll-upfrom higher level summary to lower level summary or detailed data, or introducing new dimensionsSlice and diceproject and selectPivot (rotate)reorient the cube, visualization, 3D to series of 2D planes.
17 Data Warehouse Architecture Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources.Data Cleaning - Data Cleaning involves finding and correcting the errors in data.Data Transformation - Data Transformation involves converting data from legacy format to warehouse format.Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions.Refreshing - Refreshing involves updating from data sources to warehouse.
18 Data Warehouse Models Enterprise warehouse Data Mart collects all of the information about subjects spanning the entire organizationData Marta subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
20 What Motivated Data Mining? We are drowning in data, but starving for knowledge!
21 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataAlternative namesKnowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
22 Why Data Mining?—Potential Applications Data analysis and decision supportMarket analysis and managementTarget marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, quality control, competitive analysisFraud detection and detection of unusual patterns (outliers)
23 Integration of Multiple Technologies ArtificialIntelligenceMachineLearningDatabaseManagementStatisticsAlgorithmsVisualizationDataMining10
24 What Can Data Mining Do? Cluster Classify Summarize Categorical, RegressionSummarizeSummary statistics, Summary rulesLink Analysis / Model DependenciesAssociation rulesDetect Deviations
25 Clustering Find groups of similar data items “Group people with similar travel profiles”George, PatriciaJeff, Evelyn, ChrisRob
26 ClassificationFind ways to separate data items into pre-defined groupsA bank loan officer wants to analyse the data in order to know which customer (loan applicant) are risky or which are safe.
27 Association Rules “Find groups of items commonly purchased together” Identify dependencies in the data:X makes Y likelyIndicate significance of each dependency“Find groups of items commonly purchased together”People who purchase X are likely to purchase Y
28 Deviation Detection Find unexpected values, Uses: Failure analysis Anomaly discovery for analysis“Find unusual occurrences in stock prices”
29 Knowledge Discovery (KDD) Process Pattern EvaluationData mining—core of knowledge discovery processData MiningTask-relevant DataSelectionData WarehouseData CleaningData IntegrationDatabases
30 Knowledge ProcessData cleaning – to remove noise and inconsistent dataData integration – to combine multiple sourceData selection – to retrieve relevant data for analysisData transformation – to transform data into appropriate form for data miningData miningEvaluationKnowledge presentation
31 Knowledge ProcessAlthough data mining is only one step in the entire process, it is an essential one since it uncovers hidden patterns for evaluation
32 Knowledge ProcessBased on this view, the architecture of a typical data mining system may have the following major components:Database, data warehouse, world wide web, or other information repositoryDatabase or data warehouse serverData mining enginePattern evaluation modelUser interface