Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining.

Similar presentations


Presentation on theme: "An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining."— Presentation transcript:

1 An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining

2 Concerned terms with my Research KDD process KDD process Data warehouse Data warehouse Materialized or Indexed view Materialized or Indexed view Data Mining Data Mining Data Mining Techniques Data Mining Techniques Classification Classification Decision Tree Decision Tree ID3 algorithm ID3 algorithm Objective Objective Conclusion Conclusion

3 KDD process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

4 Steps of a KDD Process Learning the application domain: Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Choosing the mining algorithm(s) Data mining: search for patterns of interest Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Use of discovered knowledge

5 Data Warehouse Data warehouse is a sub-oriented, integrated or consolidated, Non- Data warehouse is a sub-oriented, integrated or consolidated, Non- volatile or read only, time-variant collection of data designed to support volatile or read only, time-variant collection of data designed to support management DSS needs. management DSS needs. Any read-only collection of accumulated historical data is called a data Any read-only collection of accumulated historical data is called a data warehouse. warehouse. A data warehouse is a database specifically structured for query and A data warehouse is a database specifically structured for query and analysis. analysis. A data warehouse typically contains data representing the business A data warehouse typically contains data representing the business history of an organization. history of an organization.

6 Materialized or Indexed View A materialized view is a special type of summary table that is A materialized view is a special type of summary table that is constructed by aggregating one or more columns of data from a constructed by aggregating one or more columns of data from a a single table, or a series of tables that are joined together a single table, or a series of tables that are joined together Materialized views can dramatically improve query performance, Materialized views can dramatically improve query performance, and significantly decrease the load on the system. and significantly decrease the load on the system. You need it in data warehouse environment more that OLTP You need it in data warehouse environment more that OLTP environment you need it in huge databases, and not a table that environment you need it in huge databases, and not a table that has 3 records. has 3 records.

7 Data Mining Data mining (DM) is defined as the process of discovering patterns Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some The patterns discovered must be meaningful in that they lead to some advantage. advantage. We simply define; data mining refers to extracting or mining “knowledge We simply define; data mining refers to extracting or mining “knowledge from large amounts of data. from large amounts of data. There are many other terms such as knowledge mining from databases, There are many other terms such as knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data knowledge extraction, data/pattern analysis, data archaeology, and data dredging. dredging. Data mining is concerned with finding hidden relationships present in Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. business data to allow businesses to make predictions for future use.

8 Data Mining Tasks or Techniques Classification Classification Regression Regression Segmentation Segmentation Association Association Forecasting Forecasting Text Analysis Text Analysis Advanced Data Exploration Advanced Data Exploration We only select classification among different data mining techniques or tasks for the research work.

9 Classification Classification is data analysis which can be used to extract models Classification is data analysis which can be used to extract models describing important data classes or to predict future data trends. describing important data classes or to predict future data trends. Given a number of pre-specified classes. Examine a new object, Given a number of pre-specified classes. Examine a new object, record, or individual and assign it, based on a model, to one of record, or individual and assign it, based on a model, to one of these classes these classes Examples Examples Which credit applicants are low, medium, high risk? Which hotel customers are likely, unlikely to return? Which residents are likely, unlikely to vote?

10 Classification Techniques Decision Tree based Methods Decision Tree based Methods Rule-based Methods Rule-based Methods Memory based reasoning Memory based reasoning Neural Networks Neural Networks Genetic algorithms Genetic algorithms Bayesian networks Bayesian networks Among these classification techniques we only select the decision tree

11 Decision Tree A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. The topmost node in a tree is the root node. A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. The topmost node in a tree is the root node. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome

12 Decision Tree

13 Generating Classification rules from Decision Tree IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no

14 Generating Classification rules from Decision Tree IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no

15 ID3 algorithm  Originator of the ID3 Algorithm ID3 and its successors have been developed by Ross Quinlan, who discovered it in the 1970s.  Implementation of ID3 Algorithm ID3 (Learning Sets S, Attributes Sets A, Attributes values V) Return Decision Tree. Begin Load learning sets first, create decision tree root node 'root Node', Add learning set S into Root node as its subset. For root Node, we compute Entropy (rootNode. subset) first If Entropy (rootNode. subset) = =0, then RootNode. subset consists of records all with the same value for the categorical Attribute, return a leaf node with decision attribute: attribute value; If Entropy (rootNode. subset)! =0, then Compute information gain for each attribute left (have not been used in splitting), Find attribute A with Maximum (Gain(S, A)). Create child nodes of this root Node and add to root Node in the decision tree. For each child of the root Node, apply ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node. End ID3.

16 Mathematical Formulae The following mathematical formulae are used for the calculation of Entropy and Gain.  Entropy Equation  Information Gain Equation

17 Classification Experiments  First we take a training dataset ( S ) For classification purpose

18 Step(1) Entropy of Original Dataset  Entropy calculation process of dataset is shown below. First decides the number of records which have (No) class value that are five (5) while the (yes) class value records are Nine (9). Total number of records is fourteen (14). Relative frequency of No class: 5/14. Relative frequency of Yes class: 9/14. Entropy of S dataset is calculated by the above Entropy formula. Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14 = 0.9403

19 Step (2) Calculate The Gain of Each input Attribute in Dataset  The following information is required for the Gain Calculation of Outlook Attribute For the calculation of attribute gain first checked the number of values for this attribute, and then on the basis of each value the S dataset is classified. Outlook attribute have three values like rain, overcast and sunny. There are three subset is possible of S dataset on the basis of outlook attribute values. I. First subset(S1) contains five (5) records on the basis of rain value of outlook attribute. II. Second subset (S2) contains four (4) records on the basis of overcast value of outlook attribute. III. Thirds subset (S3) contains five (5) records on the basis of sunny value. Proportionality measure for S1 is 5/14 Proportionality measure fro S2 is 4/14 Proportionality measure for S3 is 5/14

20 Step (2) Calculate The Gain of Each input Attribute in Dataset The following steps are required for the calculation of Gain value of Each Attribute in original dataset (S)  Calculate the Entropy of each subset ( S1, S2, S3)  Calculate the attribute Entropy ( Outlook )  Calculate the Gain of Attribute ( Outlook )

21  Calculate the Entropy of each subset (S1,S2,S3) In first set S1 the three (3) yes class and (2) No class. Total is five records (5) Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971 In second set S2 the four (4) yes class. Total is four records (4) Entropy (4,4) = -4/4 log2 4/4 =0 In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5). Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971

22  Calculate the Entropy of the outlook Attribute The following formula is used for calculation Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3) Entropy (5, 4, 5) or Entropy (outlook) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.694

23  Calculate the Gain value of the outlook Attribute The following formula is used for calculation Gain (S, A) = Entropy (S) – Entropy (outlook) Gain(S, outlook) = 0.9403 - 0.694 = 0.2463 The above Three steps Repeats for other three remaining input attributes. The following tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset.

24 Attributes Along with Gain Information (Original set, rain and sunny subsets)

25 Step (3) Select the Maximum Gain value Attribute for the classification of dataset (S) In the above Table the attribute which have the maximum gain value is the outlook. The Gain value of outlook is 0.2463 which is the highest value. After this process we can split the dataset into three different subsets on the basis of outlook attribute values which are following. Rain, overcast and sunny. The classification is show the complete classification process which generate the decision tree or classification rules.

26 Decision Tree

27 Decision Tree developed in my work

28 Objectives 1. To integrate the decision tree with Data warehouse or database 2. To reduce the time of construction of decision tree at root node. The computational process for constructing tree is highly complex and recursive in nature. It includes calculating various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed results required at least for the selection and classification of root node. There is no change in the intelligence approach, only the values required are stored, instead of calculating them at run time in memory. However, this integration has given a jump start for the construction of the classification model, enhancing the overall efficiency of the model.

29 Pre-Calculated Structure

30 Conclusion Classification algorithms are memory resident, calculating various statistical values at runtime. Storage of these statistical values, even for the selection of the best attribute at the root node, greatly increases the performance of the classification algorithms. Materialized view will hold the input training dataset while these statistical values will be stored in a dependent table. This table will be updated according to the policy chosen. Modern data warehouses offer a many methods to update the materialized view. However, each time a new target class is introduced or new data is loaded in this containing the statistical values will be updated accordingly. The accuracy of the algorithm is in no way affected, not in a positive or negative direction. The significant improvement introduced is in the efficiency, in selection of the root level attribute.


Download ppt "An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining."

Similar presentations


Ads by Google