3 Muhammad Khurram Shahzad M Khurram ShahzadAssistant ProfessorM.Sc. from PUCIT, University of the Punjab, PKMS from KTH - Royal Institute of Technology, Sweden 2006PhD from Information Systems Lab, KTH-Royal Intitute of Technology & Stockholm University, Sweden, (Jan’08 - Inshallah Nov’12)At least 26 Publications
9 Research ProjectsDigital Repository Service for Academic Performance Assessment and Social Networking in Developing CountriesCentre for Academic Statistics of Science and TechnologyProductivity and Social Network Analysis of the BPM Community
10 Research Partners Stockholm University, Sweden Technical University Eindhoven,The NetherlandsUniversity of Sri-Jayewardennepura,Sri Lanka
11 Course ObjectivesAt the end of the course you will (hopefully) be able to answer the questionsWhy exactly the world needs a data warehouse?How DW differs from traditional databases and RDBMS?Where does OLAP stands in the DW picture?What are different DW and OLAP models/schemas? How to implement and test these?How to perform ETL? What is data cleansing? How to perform it? What are the famous algorithms?Which different DW architectures have been reported in the literature? What are their strengths and weaknesses?What latest areas of research and development are stemming out of DW domain?
12 Course Material Reference Books Course Book Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley & Sons Inc., NY.Reference BooksW.H. Inmon, Building the Data Warehouse (Second Edition), John Wiley & Sons Inc., NY.Ralph Kimball and Margy Ross, The Data Warehouse Toolkit (Second Edition), John Wiley & Sons Inc., NY.
13 Assignments Implementation/Research on important concepts. To be submitted in groups of 2 students.IncludeModeling and Benchmarking of multiple warehouse schemasImplementation of an efficient OLAP cube generation algorithmData cleansing and transformation of legacy dataLiterature Review paper onView Consistency Mechanisms in Data WarehouseIndex design optimizationAdvance DW ApplicationsMay add a couple more
14 Lab WorkLab Exercises. To be submitted individually
15 Course Introduction What this course is about? Decision Support Cycle Planning – Designing – Developing - Optimizing – Utilizing
16 Course Introduction Information Sources Data Warehouse Server (Tier 1) OLAP Servers(Tier 2)Clients(Tier 3)OperationalDB’sSemistructuredSourcesextracttransformloadrefreshetc.Data MartsDataWarehousee.g., MOLAPe.g., ROLAPserveAnalysisQuery/ReportingData Mining
17 Operational Sources (OLTP’s) Operational computer systems did provide information to run day-to-day operations, and answer’s daily questions, but…Also called online transactional processing system (OLTP)Data is read or manipulated with each transactionTransactions/queries are simple, and easy to writeUsually for middle managementExamplesSales systemsHotel reservation systemsCOMSISHRM ApplicationsEtc.
18 Typical decision queries Data set are mounting everywhere, but not useful for decision supportDecision-making require complex questions from integrated data.Enterprise wide data is desiredDecision makers want to know:Where to build new oil warehouse?Which market they should strengthen?Which customer groups are most profitable?How much is the total sale by month/ year/ quarter for each offices?Is there any relation between promotion campaigns and sales growth?Can OLTP answer all such questions, efficiently?
19 Information crisis* Integrated Data Integrity Accessible Credible Must have a single, enterprise-wide viewData IntegrityInformation must be accurate and must conform to business rulesAccessibleEasily accessible with intuitive access paths and responsive for analysisCredibleEvery business factor must have one and only one valueTimelyInformation must be available within the stipulated time frame* Paulraj 2001.
20 Data Driven-DSS** Farooq, lecture slides for ‘Data Warehouse’ course
21 Failure of old DSS Inability to provide strategic information IT receive too many ad hoc requests, so large over loadRequests are not only numerous, they change overtimeFor more understanding more reportsUsers are in spiral of reportsUsers have to depend on IT for informationCan't provide enough performance, slowStrategic information have to be flexible and conductive
22 OLTP vs. DSS Trait OLTP DSS User Middle management Executives, decision-makersFunctionFor day-to-day operationsFor analysis & decision supportDB (modeling)E-R based, after normalizationStar oriented schemasDataCurrent, IsolatedArchived, derived, summarizedUnit of workTransactionsComplex queryAccess, typeDML, readReadAccess frequencyVery highMedium to LowRecords accessedTens to HundredsThousands to MillionsQuantity of usersThousandsVery small amountUsagePredictable, repetitiveAd hoc, random, heuristic basedDB size100 MB-GB100GB-TBResponse timeSub-secondsUp-to min.s
23 Expectations of new soln. DB designed for analytical tasksData from multiple applicationsEasy to useAbility of what-if analysisRead-intensive data usageDirect interaction with system, without IT assistancePeriodical updating contents & stableCurrent & historical dataAbility for users to initiate reports
24 DW meets expectations Provides enterprise view Current & historical data availableDecision-transaction possible without affecting operational sourceReliable source of informationAbility for users to initiate reportsActs as a data source for all analytical applications
25 Definition of DW Four properties of DW Inmon defined “A DW is a subject-oriented, integrated, non-volatile, time-variant collection of data in favor of decision-making”.Kelly said“Separate available, integrated, time-stamped, subject-oriented, non-volatile, accessible”Four properties of DW
26 Subject-orientedIn operational sources data is organized by applications, or business processes.In DW subject is the organization methodSubjects vary with enterpriseThese are critical factors, that affect performanceExample of Manufacturing CompanySalesShipmentInventory etc
27 Integrated Data Data comes from several applications Problems of integration comes into playFile layout, encoding, field names, systems, schema, data heterogeneity are the issuesBank example, variance: naming convention, attributes for data item, account no, account type, size, currencyIn addition to internal, external data sourcesExternal companies data sharingWebsitesOthersRemoval of inconsistencySo process of extraction, transformation & loading
28 Time variant Operational data has current values Comparative analysis is one of the best techniques for business performance evaluationTime is critical factor for comparative analysisEvery data structure in DW contains time elementIn order to promote product in certain, analyst has to know about current and historical valuesThe advantages areAllows for analysis of the pastRelates information to the presentEnables forecasts for the future
29 Non-volatileData from operational systems are moved into DW after specific intervalsData is persistent/ not removed i.e. non volatileEvery business transaction don’t update in DWData from DW is not deletedData is neither changed by individual transactionsProperties summarySubject OrientedTime-VariantNon-VolatileOrganized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.Every record in the data warehouse has some form of time variancy attached to it.Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or another.
31 Agenda Data Warehouse architecture & building blocks ER modeling reviewNeed for Dimensional ModelingDimensional modeling & its insideComparison of ER with dimensional
32 Architecture of DW Information Sources Data Warehouse Server (Tier 1) OLAP Servers(Tier 2)Clients(Tier 3)e.g., MOLAPAnalysisSemistructuredSourcesDataWarehouseserveextracttransformloadrefreshQuery/Reportingservee.g., ROLAPOperationalDB’sserveData MiningStaging areaData Marts
33 Components Major components Source data component Data staging componentInformation delivery componentMetadata componentManagement and control component
34 1. Source Data Components Source data can be grouped into 4 componentsProduction dataComes from operational systems of enterpriseSome segments are selected from itNarrow scope, e.g. order detailsInternal dataPrivate datasheet, documents, customer profiles etc.E.g. Customer profiles for specific offeringSpecial strategies to transform ‘it’ to DW (text document)Archived dataOld data is archivedDW have snapshots of historical dataExternal dataExecutives depend upon external sourcesE.g. market data of competitors, car rental require new manufacturing. Define conversion
35 2. Data Staging Components After data is extracted, data is to be preparedData extracted from sources needs to be changed, converted and made ready in suitable formatThree major functions to make data readyExtractTransformLoadStaging area provides a place and area with a set of functions toCleanChangeCombineConvert
36 3. Data Storage Components Separate repositoryData structured for efficient processingRedundancy is increasedUpdated after specific periodsOnly read-only
38 Designing DW Information Sources Data Warehouse Server (Tier 1) OLAP Servers(Tier 2)Clients(Tier 3)e.g., MOLAPAnalysisSemistructuredSourcesDataWarehouseserveextracttransformloadrefreshQuery/Reportingservee.g., ROLAPOperationalDB’sserveData MiningStaging areaData Marts
39 Background (ER Modeling) ER Hard to remember, due to increased number of tablesER doesn’t answer the question, efficientlyDimensional Modeling focuses subject-orientation, critical factors of businessCritical factors are stored in factsShould give descriptionER is complex for queries with multiple tablesRedundancy is no problem, achieve efficiency
40 Need of Dimensional Modeling For ER modeling, entities are collection from the environmentEach entity act as a tableSuccess reasonsNormalized after ER, since it removes redundancyBut number of tables is increasedSo inconsistency is achievedNo calculated attributesIs useful for fast access, small amount of dataTables can have many connectionsDe-Normalization (in DW)Add primary keyDirect relationshipsRe-introduce redundancy
41 Dimensional Modeling Logical design technique for high performance Each model represent a subject in DWIs the modeling technique for storageTwo important conceptsFactNumeric measurements, represent business activity/eventAre pre-computed, redundantExample: Profit, quantity soldDimensionQualifying characteristics, perspective to a factExample: date (Date, month, quarter, year)
42 Dimensional Modeling (cont.) Facts are stored in fact tableCalculated attributes are removed in 1NFDimensions are represented by dimension tablesDimensions are degrees in which facts can be judgedEach fact is surrounded by dimension tablesLooks like a star so called Star Schema
43 Example TIME time_key (PK) SQL_date day_of_week month STORE store_key (PK)store_IDstore_nameaddressdistrictfloor_typeCLERKclerk_key (PK)clerk_idclerk_nameclerk_gradePRODUCTproduct_key (PK)SKUdescriptionbrandcategoryCUSTOMERcustomer_key (PK)customer_namepurchase_profilecredit_profilePROMOTIONpromotion_key (PK)promotion_nameprice_typead_typeFACTtime_key (FK)store_key (FK)clerk_key (FK)product_key (FK)customer_key (FK)promotion_key (FK)dollars_soldunits_solddollars_cost
44 Inside Dimensional Modeling Inside Dimension tableKey attribute of dimension table, for identificationLarge no of columns, wide tableNon-calculated attributes, textual attributesAttributes are not directly relatedUn-normalized in Star schemaAbility to drill-down and drill-up are two ways of exploiting dimensionsCan have multiple hierarchiesRelatively small number of records
45 Inside Dimensional Modeling Have two types of attributesKey attributes, for connectionsFactsInside fact tableConcatenated keyGrain or level of data identifiedLarge number of recordsLimited attributesSparse data setDegenerate dimensionsFact-less fact table
46 Star Schema Keys Ease for users to understand Optimized for navigation To go from one table to anotherFor obtaining relative value of dimensionMost suitable for query processing
47 Advantage of Star Schema Primary keysIdentifying attribute in dimension tableRelationship attributes combine together to form P.KSurrogate keysReplacement of primary keySystem generatedForeign keysCollection of primary keys of dimension tablesPrimary key to fact tableCollection of P.Ks