2 Data, Data everywhere yet ... I can’t find the data I needdata is scattered over the networkmany versions, subtle differencesI can’t get the data I needneed an expert to get the dataI can’t understand the data I foundavailable data poorly documentedI can’t use the data I foundresults are unexpecteddata needs to be transformed from one form to other
3 So What Is a Data Warehouse? Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.OLTP systems are usually designed independently of each other and it is difficult for them to share information.
4 Why Do We Need Data Warehouses? Consolidation of information resourcesImproved query performanceSeparate research and decision support functions from the operational systemsFoundation for data mining, data visualization, advanced reporting and OLAP tools
5 Why Data Warehousing? Which are our lowest/highest margin customers ? Who are my customers and what products are they buying?What is the most effective distribution channel?What product prom- -otions have the biggest impact on revenue?Which customers are most likely to go to the competition ?What impact will new products/serviceshave on revenue and margins?
6 What Is a Data Warehouse Used for? Knowledge discoveryMaking consolidated reportsFinding relationships and correlationsData miningExamplesBanks identifying credit risksInsurance companies searching for fraudMedical research
7 How Do Data Warehouses Differ From Operational Systems? GoalsStructureSizePerformance optimizationTechnologies used
8 Comparison Chart of Database Types Data warehouseOperational systemSubject orientedTransaction orientedLarge (hundreds of GB up to several TB)Small (MB up to several GB)Historic dataCurrent dataDe-normalized table structure (few tables, many columns per table)Normalized table structure (many tables, few columns per table)Batch updatesContinuous updatesUsually very complex queriesSimple to complex queries
9 Design Differences Operational System Data Warehouse ER Diagram Star Schema
10 Supporting a Complete Solution Operational System-Data EntryData Warehouse-Data Retrieval
11 Data Warehouses, Data Marts, and Operational Data Stores Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.
12 Decision Support Used to manage and control business Data is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and can be ad-hocUsed by managers and end-users to understand the business and make judgements
13 What are the users saying... Data should be integrated across the enterpriseSummary data had a real value to the organizationHistorical data held the key to understanding data over timeWhat-if capabilities are required
14 Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organization’s operational database
16 From the Data Warehouse to Data Marts InformationIndividually StructuredLessMoreHistoryNormalizedDetailedDepartmentally StructuredData WarehouseOrganizationally Structured
17 Users have different views of Data Tourists: Browse information harvested by farmersOLAPFarmers: Harvest informationfrom known access pathsExplorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed dataOrganizationallystructured
18 Wal*Mart Case Study Founded by Sam Walton One the largest Super Market Chains in the USWal*Mart: Retail StoresSAM's Clubs 100+Wholesalers StoresThis case study is from Felipe Carino’s (NCR Teradata) presentation made at Stanford Database Seminar
19 Old Retail Paradigm Suppliers Wal*Mart Accept Orders Promote Products Provide special IncentivesMonitor and Track The IncentivesBill and Collect ReceivablesEstimate Retailer DemandsWal*MartInventory ManagementMerchandise Accounts PayablePurchasingSupplier Promotions: National, Region, Store Level
20 New (Just-In-Time) Retail Paradigm No more dealsShelf-Pass Through (POS Application)One Unit PriceSuppliers paid once a week on ACTUAL items soldWal*Mart ManagerDaily Inventory RestockSuppliers (sometimes SameDay) ship to Wal*MartWarehouse-Pass ThroughStock some Large ItemsDelivery may come from supplierDistribution CenterSupplier’s merchandise unloaded directly onto Wal*Mart Trucks
21 Information as a Strategic Weapon Daily Summary of all Sales InformationRegional Analysis of all Stores in a logical areaSpecific Product SalesSpecific Supplies SalesTrend Analysis, etc.Wal*Mart uses information when negotiating withSuppliersAdvertisers etc.
22 Schema Design Database organization Schema Types must look like businessmust be recognizable by business userapproachable by business userMust be simpleSchema TypesStar SchemaFact Constellation SchemaSnowflake schema
23 Star SchemaA single fact table and for each dimension one dimension tableDoes not capture hierarchies directlyTimeproddate, custno, prodno, cityname, salesfactcustcity
24 Dimension Tables Dimension tables Define business in terms already familiar to usersWide rows with lots of descriptive textSmall tables (about a million rows)Joined to fact table by a foreign keyheavily indexedtypical dimensionstime periods, geographic region (markets, cities), products, customers, salesperson, etc.
25 Fact Table Central table Typical example: individual sales records mostly raw numeric itemsnarrow rows, a few columns at mostlarge number of rows (millions to a billion)Access via dimensions
26 Snowflake schemaRepresent dimensional hierarchy directly by normalizing tables.Easy to maintain and saves storageTimeproddate, custno, prodno, cityname, ...factcustcityregion
27 Fact Constellation Fact Constellation Booking Checkout Multiple fact tables that share many dimension tablesBooking and Checkout may share many dimension tables in the hotel industryHotelsTravel AgentsPromotionRoom TypeCustomerBookingCheckout
28 Data Granularity in Warehouse Summarized data storedreduce storage costsreduce cpu usageincreases performance since smaller number of records to be processeddesign around traditional high level reporting needstradeoff with volume of data to be stored and detailed usage of data
29 Granularity in Warehouse Solution is to have dual level of granularityStore summary data on disks95% of DSS processing done against this dataStore detail on tapes5% of DSS processing against this data
30 Levels of Granularity Banking Example account month # trans withdrawalsdepositsaverage balOperationalaccountactivity dateamounttellerlocationaccount balmonthly accountregister -- up to10 years60 days ofactivityamountactivity dateaccount balNot all fieldsneed be archived
31 Data Integration Across Sources TrustSavingsLoansCredit cardSame datadifferent nameDifferent dataSame nameData found herenowhere elseDifferent keyssame data
32 Data TransformationOperational/Source DataSequentialLegacyRelationalExternalDataTransformationAccessing Capturing Extracting Householding FilteringReconciling Conditioning Loading Validating ScoringData transformation is the foundation for achieving single version of the truthMajor concern for ITData warehouse can fail if appropriate data transformation strategy is not developed
33 Data Transformation Example encodingunitfieldappl A - balanceappl B - balappl C - currbalappl D - balcurrappl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - ydsappl A - m,fappl B - 1,0appl C - x,yappl D - male, femaleData Warehouse
34 Data Integrity Problems Same person, different spellingsAgarwal, Agrawal, Aggarwal etc...Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt. LTD.Use of different namesmumbai, bombayDifferent account numbers generated by different applications for the same customerRequired fields left blankInvalid product codes collected at point of salemanual entry leads to mistakes“in case of a problem use ”
35 Data Transformation Terms ExtractingConditioningScrubbingMergingHouseholdingEnrichmentScoringLoadingValidatingDelta Updating
36 Data Transformation Terms HouseholdingIdentifying all members of a household (living at the same address)Ensures only one mail is sent to a householdCan result in substantial savings: 1 million catalogues at Rs. 50 each costs Rs. 50 million . A 2% savings would save Rs. 1 million
37 Refresh Propagate updates on source data to the warehouse Issues: when to refreshhow to refresh -- incremental refresh techniques
38 When to Refresh?periodically (e.g., every night, every week) or after significant eventson every update: not warranted unless warehouse data require current data (up to the minute stock quotes)refresh policy set by administrator based on user needs and trafficpossibly different policies for different sources
39 Refresh techniques Incremental techniques detect changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots (Oracle)transaction shipping (Sybase)compute changes to derived and summary tablesmaintain transactional correctness for incremental load
40 How To Detect ChangesCreate a snapshot log table to record ids of updated rows of source data and timestampDetect changes by:Defining after row triggers to update snapshot log when source table changesUsing regular transaction log to detect changes to source data
41 Querying Data Warehouses SQL ExtensionsMultidimensional modeling of dataOLAPMore on OLAP later …
42 SQL Extensions Extended family of aggregate functions rank (top 10 customers)percentile (top 30% of customers)median, modeObject Relational Systems allow addition of new aggregate functionsReporting featuresrunning total, cumulative totals
43 Reporting Tools Andyne Computing -- GQL Brio -- BrioQuery Business Objects -- Business ObjectsCognos -- ImpromptuInformation Builders Inc. -- Focus for WindowsOracle -- Discoverer2000Platinum Technology -- SQL*Assist, ProReportsPowerSoft -- InfoMakerSAS Institute -- SAS/AssistSoftware AG -- EsperantSterling Software -- VISION:Data
45 Deploying Data Warehouses What business information keeps you in business today? What business information can put you out of business tomorrow?What business information should be a mouse click away?What business conditions are the driving the need for business information?
46 Cultural Considerations Not just a technology projectNew way of using information to support daily activities and decision makingCare must be taken to prepare organization for changeMust have organizational backing and support
47 User TrainingUsers must have a higher level of IT proficiency than for operational systemsTraining to help users analyze data in the warehouse effectively
48 Summary: Building a Data Warehouse Data Warehouse LifecycleAnalysisDesignImport dataInstall front-end toolsTest and deploy
49 A case -- the STORET Central Warehouse Improved performance and faster data retrievalAbility to produce larger reportsAbility to provide more data query optionsStreamlined application navigation
57 Nature of OLAP Analysis Aggregation -- (total sales, percent-to-total)Comparison -- Budget vs. ExpensesRanking -- Top 10, quartile analysisAccess to detailed and aggregate dataComplex criteria specificationVisualizationNeed interactive response to aggregate queries
58 Multi-dimensional Data Measure - sales (actual, plan, variance)Dimensions: Product, Region, TimeHierarchical summarization pathsProduct Region TimeIndustry Country YearCategory Region QuarterProduct City Month weekOffice DayRegionWSNProductToothpasteJuiceColaMilkCreamSoapMonth1234765
59 Conceptual Model for OLAP Numeric measures to be analyzede.g. Sales (Rs), sales (volume), budget, revenue, inventoryDimensionsother attributes of data, define the spacee.g., store, product, date-of-salehierarchies on dimensionse.g. branch -> city -> state
60 Operations Rollup: summarize data Drill down: get more details e.g., given sales data, summarize sales for last year by product category and regionDrill down: get more detailse.g., given summarized sales as above, find breakup of sales by city within each region, or within the Andhra region
61 More OLAP OperationsHypothesis driven search: E.g. factors affecting defaultersview defaulting rate on age aggregated over other dimensionsfor particular age segment detail along professionNeed interactive response to aggregate queries=> precompute various aggregates
63 SQL Extensions Cube operator group by on all subsets of a set of attributes (month,city)redundant scan and sorting of data can be avoidedVarious other non-standard SQL extensions by vendors
64 OLAP: 3 Tier DSS Data Warehouse Database Layer Store atomic data in industry standard Data Warehouse.OLAP EngineApplication Logic LayerGenerate SQL execution plans in the OLAP engine to obtain OLAP functionality.Decision Support ClientPresentation LayerObtain multi-dimensional reports from the DSS Client.
65 Strengths of OLAP It is a powerful visualization tool It provides fast, interactive response timesIt is good for analyzing time seriesIt can be useful to find some clusters and outlinersMany vendors offer OLAP tools
66 Brief History Express and System W DSS Online Analytical Processing - coined by EF Codd in white paper by Arbor SoftwareGenerally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information SystemMOLAP: Multidimensional OLAP (Hyperion (Arbor Essbase), Oracle Express)ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
67 OLAP and Executive Information Systems Oracle -- ExpressPilot -- LightShipPlanning Sciences -- GentiumPlatinum Technology -- ProdeaBeacon, Forest & TreesSAS Institute -- SAS/EIS, OLAP++Speedware -- MediaAndyne Computing -- PabloArbor Software -- EssbaseCognos -- PowerPlayComshare -- Commander OLAPHolistic Systems -- HolosInformation Advantage -- AXSYS, WebOLAPInformix -- MetacubeMicrostrategies --DSS/Agent
68 Microsoft OLAP strategy Plato: OLAP server: powerful, integrating various operational sourcesOLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAPPivot-table services: integrate with Office 2000Every desktop will have OLAP capability.Client side caching and calculationsPartitioned and virtual cubeHybrid relational and multidimensional storage