Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA WAREHOUSING AND DATA MINING Mubarak Banisakher.

Similar presentations


Presentation on theme: "DATA WAREHOUSING AND DATA MINING Mubarak Banisakher."— Presentation transcript:

1 DATA WAREHOUSING AND DATA MINING Mubarak Banisakher

2 2 Course Overview zThe course: what and how z0. Introduction zI. Data Warehousing zII. Decision Support and OLAP zIII. Data Mining zIV. Looking Ahead zDemos and Labs

3 3 0. Introduction zData Warehousing, OLAP and data mining: what and why (now)? zRelation to OLTP zA case study zdemos, labs

4 4 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel? A producer wants to know….

5 5 Data, Data everywhere yet... zI can’t find the data I need ydata is scattered over the network ymany versions, subtle differences zI can’t get the data I need yneed an expert to get the data zI can’t understand the data I found yavailable data poorly documented zI can’t use the data I found yresults are unexpected ydata needs to be transformed from one form to other

6 6 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

7 7 What are the users saying... zData should be integrated across the enterprise zSummary data has a real value to the organization zHistorical data holds the key to understanding data over time zWhat-if capabilities are required

8 8 What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Information

9 9 Evolution z60’s: Batch reports yhard to find and analyze information yinflexible and expensive, reprogram every new request z70’s: Terminal-based DSS and EIS (executive information systems) ystill inflexible, not integrated with desktop tools z80’s: Desktop data access and analysis tools yquery tools, spreadsheets, GUIs yeasier to use, but only access operational databases z90’s: Data warehousing with integrated OLAP engines and tools

10 10 Warehouses are Very Large Databases 35% 30% 25% 20% 15% 10% 5% 0% 5GB 5-9GB 10-19GB50-99GB250-499GB 20-49GB100-249GB500GB-1TB Initial Projected 2Q96 Source: META Group, Inc. Respondents

11 11 Very Large Data Bases zTerabytes -- 10^12 bytes: zPetabytes -- 10^15 bytes: zExabytes -- 10^18 bytes: zZettabytes -- 10^21 bytes: zZottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos

12 12 Data Warehousing -- It is a process zTechnique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible zA decision support database maintained separately from the organization’s operational database

13 13 Data Warehouse zA data warehouse is a ysubject-oriented yintegrated ytime-varying ynon-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996

14 14 Explorers, Farmers and Tourists Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data Farmers: Harvest information from known access paths Tourists: Browse information harvested by farmers

15 15 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

16 16 Data Warehouse for Decision Support & OLAP zPutting Information technology to help the knowledge worker make faster and better decisions yWhich of my customers are most likely to go to the competition? yWhat product promotions have the biggest impact on revenue? yHow did the share price of software companies correlate with profits over last 10 years?

17 17 Decision Support zUsed to manage and control business zData is historical or point-in-time zOptimized for inquiry rather than update zUse of the system is loosely defined and can be ad-hoc zUsed by managers and end-users to understand the business and make judgements

18 18 Data Mining works with Warehouse Data zData Warehousing provides the Enterprise with a memory zData Mining provides the Enterprise with intelligence

19 19 We want to know... zGiven a database of 100,000 names, which persons are the least likely to default on their credit cards? zWhich types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? zIf I raise the price of my product by Rs. 2, what is the effect on my ROI? zIf I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? zIf I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? zWhich of my customers are likely to be the most loyal? Data Mining helps extract such information

20 20 Application Areas IndustryApplication FinanceCredit Card Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall record analysis TransportLogistics management Consumer goodspromotion analysis Data Service providersValue added data UtilitiesPower usage analysis

21 21 Data Mining in Use zThe US Government uses Data Mining to track fraud zA Supermarket becomes an information broker zBasketball teams use it to track game strategy zWarranty Claims Routing zHolding on to Good Customers zWeeding out Bad Customers

22 22 What makes data mining possible? zAdvances in the following areas are making data mining deployable: ydata warehousing ybetter and more data (i.e., operational, behavioral, and demographic) ythe emergence of easily deployed data mining tools and ythe advent of new data mining techniques. -- Gartner Group

23 23 Why Separate Data Warehouse? zPerformance yOp dbs designed & tuned for known txs & workloads. yComplex OLAP queries would degrade perf. for op txs. ySpecial data organization, access & implementation methods needed for multidimensional views & queries. zFunction yMissing data: Decision support requires historical data, which op dbs do not typically maintain. yData consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. yData quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

24 24 What are Operational Systems? zThey are OLTP systems zRun mission critical applications zNeed to work with stringent performance requirements for routine tasks zUsed to run a business!

25 25 RDBMS used for OLTP zDatabase Systems have been used traditionally for OLTP yclerical data processing tasks ydetailed, up to date data ystructured repetitive tasks yread/update a few records yisolation, recovery and integrity are critical

26 26 Operational Systems zRun the business in real time zBased on up-to-the-second data zOptimized to handle large numbers of simple read/write transactions zOptimized for fast response to predefined transactions zUsed by people who deal with customers, products -- clerks, salespeople etc. zThey are increasingly used by customers

27 27 Examples of Operational Data

28 So, what’s different?

29 29 Application-Orientation vs. Subject-Orientation Application-Orientation Operational Database Loans Credit Card Trust Savings Subject-Orientation Data Warehouse Customer Vendor Product Activity

30 30 OLTP vs. Data Warehouse zOLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse zSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) ye.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December

31 31 OLTP vs Data Warehouse zOLTP yApplication Oriented yUsed to run business yDetailed data yCurrent up to date yIsolated Data yRepetitive access yClerical User zWarehouse (DSS) ySubject Oriented yUsed to analyze business ySummarized and refined ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User (Manager)

32 32 OLTP vs Data Warehouse zOLTP yPerformance Sensitive yFew Records accessed at a time (tens) yRead/Update Access yNo data redundancy yDatabase Size 100MB - 100 GB zData Warehouse yPerformance relaxed yLarge volumes accessed at a time(millions) yMostly Read (Batch Update) yRedundancy present yDatabase Size 100 GB - few terabytes

33 33 OLTP vs Data Warehouse zOLTP yTransaction throughput is the performance metric yThousands of users yManaged in entirety zData Warehouse yQuery throughput is the performance metric yHundreds of users yManaged by subsets

34 34 To summarize... zOLTP Systems are used to “run” a business zThe Data Warehouse helps to “optimize” the business

35 35 Why Now? zData is being produced zERP provides clean data zThe computing power is available zThe computing power is affordable zThe competitive pressures are strong zCommercial products are available

36 36 Myths surrounding OLAP Servers and Data Marts zData marts and OLAP servers are departmental solutions supporting a handful of users zMillion dollar massively parallel hardware is needed to deliver fast time for complex queries zOLAP servers require massive and unwieldy indices zComplex OLAP queries clog the network with data zData warehouses must be at least 100 GB to be effective –Source -- Arbor Software Home Page


Download ppt "DATA WAREHOUSING AND DATA MINING Mubarak Banisakher."

Similar presentations


Ads by Google