Download presentation
Presentation is loading. Please wait.
Published bySamantha Daniel Modified over 8 years ago
1
1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar
2
2 Evolution of Data Warehousing 60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request 70’s: Terminal-based DSS and EIS (executive information systems still inflexible, not integrated with desktop tool 80’s: Desktop data access and analysis tools query tools, spreadsheets, GUI’s easier to use, but only access operational databases 90’s: Data warehousing with integrated OLAP engines and tools
3
3 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
4
4 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems
5
5 Components of the Warehouse Data Extraction and Loading Data Extraction and Loading The Warehouse The Warehouse Analyze and Query -- OLAP Tools Analyze and Query -- OLAP Tools Metadata Metadata Data Mining tools Data Mining tools
6
6 Loading the Warehouse Cleaning the data before it is loaded
7
7 Data Integrity Problems Same person, different spellings Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names Use of different names mumbai, bombay mumbai, bombay Different account numbers generated by different applications for the same customer Different account numbers generated by different applications for the same customer
8
8 Data Transformation Terms Extracting Extracting Conditioning Conditioning Householding Householding Enrichment Enrichment Scoring Scoring
9
9 Extracting Extracting Capture of data from operational source in “as is” status Capture of data from operational source in “as is” status Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Conditioning Conditioning The conversion of data types from the source to the target data store (warehouse) -- always a relational database The conversion of data types from the source to the target data store (warehouse) -- always a relational database
10
10 Householding Householding Identifying all members of a household (living at the same address) Identifying all members of a household (living at the same address) Ensures only one mail is sent to a household Ensures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh. Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.Enrichment Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc... Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...
11
11 Scoring computation of a probability of an event. e.g...., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
12
12 Data Warehouse Structure Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables
13
13 Data Warehouse Structure base customer (1985-87) base customer (1985-87) custid, from date, to date, name, phone, dob custid, from date, to date, name, phone, dob base customer (1988-90) base customer (1988-90) custid, from date, to date, name, credit rating, employer custid, from date, to date, name, credit rating, employer customer activity (1986-89) -- monthly summary customer activity (1986-89) -- monthly summary customer activity detail (1987-89) customer activity detail (1987-89) custid, activity date, amount, clerk id, order no custid, activity date, amount, clerk id, order no customer activity detail (1990-91) customer activity detail (1990-91) custid, activity date, amount, line item no, order no custid, activity date, amount, line item no, order no
14
14 Schema Design Database organization Database organization must look like business must look like business must be recognizable by business user must be recognizable by business user approachable by business user approachable by business user Must be simple Must be simple Schema Types Schema Types Star Schema Star Schema Fact Constellation Schema Fact Constellation Schema Snowflake schema Snowflake schema
15
15 Dimension Tables Dimension tables Dimension tables Define business in terms already familiar to users Define business in terms already familiar to users Wide rows with lots of descriptive text Wide rows with lots of descriptive text Small tables (about a million rows) Small tables (about a million rows) Joined to fact table by a foreign key Joined to fact table by a foreign key heavily indexed heavily indexed typical dimensions typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc. time periods, geographic region (markets, cities), products, customers, salesperson, etc.
16
16 Fact Table Central table Central table mostly raw numeric items mostly raw numeric items narrow rows, a few columns at most narrow rows, a few columns at most large number of rows (millions to a billion) large number of rows (millions to a billion) Access via dimensions Access via dimensions
17
17 Star Schema A single fact table and for each dimension one dimension table A single fact table and for each dimension one dimension table Does not capture hierarchies directly Does not capture hierarchies directly T i m e prodprod custcust citycity factfact date, custno, prodno, cityname,...
18
18 Snowflake schema Represent dimensional hierarchy directly by normalizing tables. Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage Easy to maintain and saves storage T i m e prodprod custcust citycity factfact date, custno, prodno, cityname,... regionregion
19
19 Fact Constellation Fact Constellation Fact Constellation Multiple fact tables that share many dimension tables Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry Booking and Checkout may share many dimension tables in the hotel industry Hotels Travel Agents Promotion Room Type Customer Booking Checkout
20
20 On-Line Analytical Processing (OLAP) Making Decision Support Possible
21
21 What Is OLAP? Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
22
22 Data Warehouse for Decision Support & OLAP Putting Information technology to help the knowledge worker make faster and better decisions Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years? How did the share price of software companies correlate with profits over last 10 years?
23
23 Decision Support Used to manage and control business Used to manage and control business Data is historical or point-in-time Data is historical or point-in-time Optimized for inquiry rather than update Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements Used by managers and end-users to understand the business and make judgements
24
24 Data Mining The non trivial extraction of implicit, previously unknown, and potentially useful information from data The non trivial extraction of implicit, previously unknown, and potentially useful information from data In simple words In simple words Searching for new knowledge Searching for new knowledge
25
25 Data Mining Goals Prediction Prediction Identification Identification Classification Classification Optimization Optimization
26
26 Data mining techniques clustering clustering data summarization data summarization learning classification rules learning classification rules finding dependency net works finding dependency net works analyzing changes analyzing changes detecting anomalies detecting anomalies
27
27 Data Mining in Use The US Government uses Data Mining to track fraud The US Government uses Data Mining to track fraud A Supermarket becomes an information broker A Supermarket becomes an information broker Basketball teams use it to track game strategy Basketball teams use it to track game strategy Cross Selling Cross Selling Warranty Claims Routing Warranty Claims Routing Holding on to Good Customers Holding on to Good Customers Weeding out Bad Customers Weeding out Bad Customers
28
28 Data Warehouse Pitfalls You are going to spend much time extracting, cleaning, and loading data You are going to spend much time extracting, cleaning, and loading data Despite best efforts at project management, data warehousing project scope will increase Despite best efforts at project management, data warehousing project scope will increase You are going to find problems with systems feeding the data warehouse You are going to find problems with systems feeding the data warehouse You will find the need to store data not being captured by any existing system You will find the need to store data not being captured by any existing system You will need to validate data not being validated by transaction processing systems You will need to validate data not being validated by transaction processing systems
29
29 Data Warehouse Pitfalls Some transaction processing systems feeding the warehousing system will not contain detail Some transaction processing systems feeding the warehousing system will not contain detail Many warehouse end users will be trained and never or seldom apply their training Many warehouse end users will be trained and never or seldom apply their training After end users receive query and report tools, requests for IS written reports may increase After end users receive query and report tools, requests for IS written reports may increase Your warehouse users will develop conflicting business rules Your warehouse users will develop conflicting business rules Large scale data warehousing can become an exercise in data homogenizing Large scale data warehousing can become an exercise in data homogenizing
30
30 Applications Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc. Finance - stock market prediction, credit assessment, fraud detection etc. Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc. Knowledge Acquisition Scientific Discovery - superconductivity research, etc. Engineering - automotive diagnostic expert systems, fault detection etc.
31
31 Thanking you Nilesh and Siva Krishna…
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.