An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining.

Slides:

Advertisements

Similar presentations

Data Mining Lecture 9.

Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe

Decision Tree Approach in Data Mining

Classification Techniques: Decision Tree Learning

Induction and Decision Trees. Artificial Intelligence The design and development of computer systems that exhibit intelligent behavior. What is intelligence?

ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

Induction of Decision Trees

Classification Continued

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Three kinds of learning

Data Mining By Archana Ketkar.

Classification.

Data Mining – Intro.

Business Intelligence: Essential of Business

Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.

DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.

Data Mining: A Closer Look

Business Intelligence

Introduction to Directed Data Mining: Decision Trees

GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.

Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.

Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.

OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.

Data Mining Techniques

Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Data Mining Techniques As Tools for Analysis of Customer Behavior

© Negnevitsky, Pearson Education, Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data.

A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.

Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

Inductive learning Simplest form: learn a function from examples

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

Chapter 9 – Classification and Regression Trees

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Basic Data Mining Technique

Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.

Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,

1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.

CS690L Data Mining: Classification

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.

1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

Decision Tree. Classification Databases are rich with hidden information that can be used for making intelligent decisions. Classification is a form of.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.

DECISION TREES An internal node represents a test on an attribute.

Prepared by: Mahmoud Rafeek Al-Farra

Chapter 6 Classification and Prediction

Introduction C.Eng 714 Spring 2010.

Classification and Prediction

Classification by Decision Tree Induction

Data Warehousing and Data Mining

©Jiawei Han and Micheline Kamber

Presentation transcript:

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining

Concerned terms with my Research KDD process KDD process Data warehouse Data warehouse Materialized or Indexed view Materialized or Indexed view Data Mining Data Mining Data Mining Techniques Data Mining Techniques Classification Classification Decision Tree Decision Tree ID3 algorithm ID3 algorithm Objective Objective Conclusion Conclusion

KDD process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Steps of a KDD Process Learning the application domain: Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Choosing the mining algorithm(s) Data mining: search for patterns of interest Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Use of discovered knowledge

Data Warehouse Data warehouse is a sub-oriented, integrated or consolidated, Non- Data warehouse is a sub-oriented, integrated or consolidated, Non- volatile or read only, time-variant collection of data designed to support volatile or read only, time-variant collection of data designed to support management DSS needs. management DSS needs. Any read-only collection of accumulated historical data is called a data Any read-only collection of accumulated historical data is called a data warehouse. warehouse. A data warehouse is a database specifically structured for query and A data warehouse is a database specifically structured for query and analysis. analysis. A data warehouse typically contains data representing the business A data warehouse typically contains data representing the business history of an organization. history of an organization.

Materialized or Indexed View A materialized view is a special type of summary table that is A materialized view is a special type of summary table that is constructed by aggregating one or more columns of data from a constructed by aggregating one or more columns of data from a a single table, or a series of tables that are joined together a single table, or a series of tables that are joined together Materialized views can dramatically improve query performance, Materialized views can dramatically improve query performance, and significantly decrease the load on the system. and significantly decrease the load on the system. You need it in data warehouse environment more that OLTP You need it in data warehouse environment more that OLTP environment you need it in huge databases, and not a table that environment you need it in huge databases, and not a table that has 3 records. has 3 records.

Data Mining Data mining (DM) is defined as the process of discovering patterns Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some The patterns discovered must be meaningful in that they lead to some advantage. advantage. We simply define; data mining refers to extracting or mining “knowledge We simply define; data mining refers to extracting or mining “knowledge from large amounts of data. from large amounts of data. There are many other terms such as knowledge mining from databases, There are many other terms such as knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data knowledge extraction, data/pattern analysis, data archaeology, and data dredging. dredging. Data mining is concerned with finding hidden relationships present in Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. business data to allow businesses to make predictions for future use.

Data Mining Tasks or Techniques Classification Classification Regression Regression Segmentation Segmentation Association Association Forecasting Forecasting Text Analysis Text Analysis Advanced Data Exploration Advanced Data Exploration We only select classification among different data mining techniques or tasks for the research work.

Classification Classification is data analysis which can be used to extract models Classification is data analysis which can be used to extract models describing important data classes or to predict future data trends. describing important data classes or to predict future data trends. Given a number of pre-specified classes. Examine a new object, Given a number of pre-specified classes. Examine a new object, record, or individual and assign it, based on a model, to one of record, or individual and assign it, based on a model, to one of these classes these classes Examples Examples Which credit applicants are low, medium, high risk? Which hotel customers are likely, unlikely to return? Which residents are likely, unlikely to vote?

Classification Techniques Decision Tree based Methods Decision Tree based Methods Rule-based Methods Rule-based Methods Memory based reasoning Memory based reasoning Neural Networks Neural Networks Genetic algorithms Genetic algorithms Bayesian networks Bayesian networks Among these classification techniques we only select the decision tree

Decision Tree A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. The topmost node in a tree is the root node. A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. The topmost node in a tree is the root node. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome

Decision Tree

Generating Classification rules from Decision Tree IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no

Generating Classification rules from Decision Tree IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no

ID3 algorithm  Originator of the ID3 Algorithm ID3 and its successors have been developed by Ross Quinlan, who discovered it in the 1970s.  Implementation of ID3 Algorithm ID3 (Learning Sets S, Attributes Sets A, Attributes values V) Return Decision Tree. Begin Load learning sets first, create decision tree root node 'root Node', Add learning set S into Root node as its subset. For root Node, we compute Entropy (rootNode. subset) first If Entropy (rootNode. subset) = =0, then RootNode. subset consists of records all with the same value for the categorical Attribute, return a leaf node with decision attribute: attribute value; If Entropy (rootNode. subset)! =0, then Compute information gain for each attribute left (have not been used in splitting), Find attribute A with Maximum (Gain(S, A)). Create child nodes of this root Node and add to root Node in the decision tree. For each child of the root Node, apply ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node. End ID3.

Mathematical Formulae The following mathematical formulae are used for the calculation of Entropy and Gain.  Entropy Equation  Information Gain Equation

Classification Experiments  First we take a training dataset ( S ) For classification purpose

Step(1) Entropy of Original Dataset  Entropy calculation process of dataset is shown below. First decides the number of records which have (No) class value that are five (5) while the (yes) class value records are Nine (9). Total number of records is fourteen (14). Relative frequency of No class: 5/14. Relative frequency of Yes class: 9/14. Entropy of S dataset is calculated by the above Entropy formula. Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14 =

Step (2) Calculate The Gain of Each input Attribute in Dataset  The following information is required for the Gain Calculation of Outlook Attribute For the calculation of attribute gain first checked the number of values for this attribute, and then on the basis of each value the S dataset is classified. Outlook attribute have three values like rain, overcast and sunny. There are three subset is possible of S dataset on the basis of outlook attribute values. I. First subset(S1) contains five (5) records on the basis of rain value of outlook attribute. II. Second subset (S2) contains four (4) records on the basis of overcast value of outlook attribute. III. Thirds subset (S3) contains five (5) records on the basis of sunny value. Proportionality measure for S1 is 5/14 Proportionality measure fro S2 is 4/14 Proportionality measure for S3 is 5/14

Step (2) Calculate The Gain of Each input Attribute in Dataset The following steps are required for the calculation of Gain value of Each Attribute in original dataset (S)  Calculate the Entropy of each subset ( S1, S2, S3)  Calculate the attribute Entropy ( Outlook )  Calculate the Gain of Attribute ( Outlook )

 Calculate the Entropy of each subset (S1,S2,S3) In first set S1 the three (3) yes class and (2) No class. Total is five records (5) Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = In second set S2 the four (4) yes class. Total is four records (4) Entropy (4,4) = -4/4 log2 4/4 =0 In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5). Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971

 Calculate the Entropy of the outlook Attribute The following formula is used for calculation Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3) Entropy (5, 4, 5) or Entropy (outlook) = 5/14 * /14 * 0 + 5/14 * = 0.694

 Calculate the Gain value of the outlook Attribute The following formula is used for calculation Gain (S, A) = Entropy (S) – Entropy (outlook) Gain(S, outlook) = = The above Three steps Repeats for other three remaining input attributes. The following tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset.

Attributes Along with Gain Information (Original set, rain and sunny subsets)

Step (3) Select the Maximum Gain value Attribute for the classification of dataset (S) In the above Table the attribute which have the maximum gain value is the outlook. The Gain value of outlook is which is the highest value. After this process we can split the dataset into three different subsets on the basis of outlook attribute values which are following. Rain, overcast and sunny. The classification is show the complete classification process which generate the decision tree or classification rules.

Decision Tree

Decision Tree developed in my work

Objectives 1. To integrate the decision tree with Data warehouse or database 2. To reduce the time of construction of decision tree at root node. The computational process for constructing tree is highly complex and recursive in nature. It includes calculating various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed results required at least for the selection and classification of root node. There is no change in the intelligence approach, only the values required are stored, instead of calculating them at run time in memory. However, this integration has given a jump start for the construction of the classification model, enhancing the overall efficiency of the model.

Pre-Calculated Structure

Conclusion Classification algorithms are memory resident, calculating various statistical values at runtime. Storage of these statistical values, even for the selection of the best attribute at the root node, greatly increases the performance of the classification algorithms. Materialized view will hold the input training dataset while these statistical values will be stored in a dependent table. This table will be updated according to the policy chosen. Modern data warehouses offer a many methods to update the materialized view. However, each time a new target class is introduced or new data is loaded in this containing the statistical values will be updated accordingly. The accuracy of the algorithm is in no way affected, not in a positive or negative direction. The significant improvement introduced is in the efficiency, in selection of the root level attribute.