Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

Similar presentations


Presentation on theme: "1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar."— Presentation transcript:

1 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar

2 2 Evolution of Data Warehousing  60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request  70’s: Terminal-based DSS and EIS (executive information systems still inflexible, not integrated with desktop tool  80’s: Desktop data access and analysis tools query tools, spreadsheets, GUI’s easier to use, but only access operational databases  90’s: Data warehousing with integrated OLAP engines and tools

3 3 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

4 4 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

5 5 Components of the Warehouse Data Extraction and Loading Data Extraction and Loading The Warehouse The Warehouse Analyze and Query -- OLAP Tools Analyze and Query -- OLAP Tools Metadata Metadata Data Mining tools Data Mining tools

6 6 Loading the Warehouse Cleaning the data before it is loaded

7 7 Data Integrity Problems Same person, different spellings Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names Use of different names mumbai, bombay mumbai, bombay Different account numbers generated by different applications for the same customer Different account numbers generated by different applications for the same customer

8 8 Data Transformation Terms Extracting Extracting Conditioning Conditioning Householding Householding Enrichment Enrichment Scoring Scoring

9 9 Extracting Extracting Capture of data from operational source in “as is” status Capture of data from operational source in “as is” status Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Conditioning Conditioning The conversion of data types from the source to the target data store (warehouse) -- always a relational database The conversion of data types from the source to the target data store (warehouse) -- always a relational database

10 10 Householding Householding Identifying all members of a household (living at the same address) Identifying all members of a household (living at the same address) Ensures only one mail is sent to a household Ensures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh. Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.Enrichment Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc... Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...

11 11  Scoring  computation of a probability of an event. e.g...., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

12 12 Data Warehouse Structure Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

13 13 Data Warehouse Structure base customer (1985-87) base customer (1985-87) custid, from date, to date, name, phone, dob custid, from date, to date, name, phone, dob base customer (1988-90) base customer (1988-90) custid, from date, to date, name, credit rating, employer custid, from date, to date, name, credit rating, employer customer activity (1986-89) -- monthly summary customer activity (1986-89) -- monthly summary customer activity detail (1987-89) customer activity detail (1987-89) custid, activity date, amount, clerk id, order no custid, activity date, amount, clerk id, order no customer activity detail (1990-91) customer activity detail (1990-91) custid, activity date, amount, line item no, order no custid, activity date, amount, line item no, order no

14 14 Schema Design Database organization Database organization must look like business must look like business must be recognizable by business user must be recognizable by business user approachable by business user approachable by business user Must be simple Must be simple Schema Types Schema Types Star Schema Star Schema Fact Constellation Schema Fact Constellation Schema Snowflake schema Snowflake schema

15 15 Dimension Tables Dimension tables Dimension tables Define business in terms already familiar to users Define business in terms already familiar to users Wide rows with lots of descriptive text Wide rows with lots of descriptive text Small tables (about a million rows) Small tables (about a million rows) Joined to fact table by a foreign key Joined to fact table by a foreign key heavily indexed heavily indexed typical dimensions typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc. time periods, geographic region (markets, cities), products, customers, salesperson, etc.

16 16 Fact Table Central table Central table mostly raw numeric items mostly raw numeric items narrow rows, a few columns at most narrow rows, a few columns at most large number of rows (millions to a billion) large number of rows (millions to a billion) Access via dimensions Access via dimensions

17 17 Star Schema A single fact table and for each dimension one dimension table A single fact table and for each dimension one dimension table Does not capture hierarchies directly Does not capture hierarchies directly T i m e prodprod custcust citycity factfact date, custno, prodno, cityname,...

18 18 Snowflake schema Represent dimensional hierarchy directly by normalizing tables. Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage Easy to maintain and saves storage T i m e prodprod custcust citycity factfact date, custno, prodno, cityname,... regionregion

19 19 Fact Constellation Fact Constellation Fact Constellation Multiple fact tables that share many dimension tables Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry Booking and Checkout may share many dimension tables in the hotel industry Hotels Travel Agents Promotion Room Type Customer Booking Checkout

20 20 On-Line Analytical Processing (OLAP) Making Decision Support Possible

21 21 What Is OLAP? Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

22 22 Data Warehouse for Decision Support & OLAP Putting Information technology to help the knowledge worker make faster and better decisions Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years? How did the share price of software companies correlate with profits over last 10 years?

23 23 Decision Support Used to manage and control business Used to manage and control business Data is historical or point-in-time Data is historical or point-in-time Optimized for inquiry rather than update Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements Used by managers and end-users to understand the business and make judgements

24 24 Data Mining The non trivial extraction of implicit, previously unknown, and potentially useful information from data The non trivial extraction of implicit, previously unknown, and potentially useful information from data In simple words In simple words Searching for new knowledge Searching for new knowledge

25 25 Data Mining Goals Prediction Prediction Identification Identification Classification Classification Optimization Optimization

26 26 Data mining techniques clustering clustering data summarization data summarization learning classification rules learning classification rules finding dependency net works finding dependency net works analyzing changes analyzing changes detecting anomalies detecting anomalies

27 27 Data Mining in Use The US Government uses Data Mining to track fraud The US Government uses Data Mining to track fraud A Supermarket becomes an information broker A Supermarket becomes an information broker Basketball teams use it to track game strategy Basketball teams use it to track game strategy Cross Selling Cross Selling Warranty Claims Routing Warranty Claims Routing Holding on to Good Customers Holding on to Good Customers Weeding out Bad Customers Weeding out Bad Customers

28 28 Data Warehouse Pitfalls You are going to spend much time extracting, cleaning, and loading data You are going to spend much time extracting, cleaning, and loading data Despite best efforts at project management, data warehousing project scope will increase Despite best efforts at project management, data warehousing project scope will increase You are going to find problems with systems feeding the data warehouse You are going to find problems with systems feeding the data warehouse You will find the need to store data not being captured by any existing system You will find the need to store data not being captured by any existing system You will need to validate data not being validated by transaction processing systems You will need to validate data not being validated by transaction processing systems

29 29 Data Warehouse Pitfalls Some transaction processing systems feeding the warehousing system will not contain detail Some transaction processing systems feeding the warehousing system will not contain detail Many warehouse end users will be trained and never or seldom apply their training Many warehouse end users will be trained and never or seldom apply their training After end users receive query and report tools, requests for IS written reports may increase After end users receive query and report tools, requests for IS written reports may increase Your warehouse users will develop conflicting business rules Your warehouse users will develop conflicting business rules Large scale data warehousing can become an exercise in data homogenizing Large scale data warehousing can become an exercise in data homogenizing

30 30 Applications  Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.  Finance - stock market prediction, credit assessment, fraud detection etc.  Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.  Knowledge Acquisition  Scientific Discovery - superconductivity research, etc.  Engineering - automotive diagnostic expert systems, fault detection etc.

31 31 Thanking you Nilesh and Siva Krishna…


Download ppt "1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar."

Similar presentations


Ads by Google