1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012.

1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012

DATA WAREHOUSES AND OLAP TECHNOLOGY 2

3 What is Data Warehouse? Data warehouse have been defined in many ways “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process” – W.H. Inmon The four keywords :- subject-oriented, integrated, time- variant and non-volatile

4 So, what is data warehousing ? A process of constructing and using data warehouses The utilization of a data warehouse necessitates a collection of decision support technologies This allow knowledge workers (e.g. managers, analysts and executives) to use the data warehouse to obtain an overview of the data and make decision based on information in the warehouse Term “warehouse DBMS” – refer to the management and utilization of data warehouse Constructing A Data Warehouse Data integration Data cleaning Data consolidation

5 Operational Database vs. Data Warehouses Operational DBMS OLTP (on-line transaction processing)  Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, etc. Data warehouses OLAP (on-line analytical processing)  Serve users or knowledge workers in the role of data analysis and decision making  The system can organize and present data in various formats

6 OLTP vs. OLAP FeatureOLTPOLAP CharacteristicOperational ProcessingInformational processing usersClerk, IT professionalKnowledge worker OrientationtransactionAnalysis FunctionDay-to-day operationLong-term informational requirements, DSS DB designER based, application- oriented Star/snowflake, subject- oriented DataCurrent, guaranteed up-to- date Historical; accuracy maintained over time SummarizationPrimitive, highly detailedSummarized, consolidated # of record accessTensMillions # of usersThousandsHundreds DB size100 MB to GB100GB to TB

7 Why Have a Separate Data warehouse? High performance for both systems  An operational database – tuned for OLTP: access methods, indexing, concurrency control, recovery  A data warehouse – tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data  DSS require historical data, whereas operational DB do not maintain historical data  DSS require consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high-quality, clean, and integrated data, whereas operational DB contain only detailed raw data, which need to be consolidate before analysis

8 A Multidimensional Data Model (1) Data warehouses and OLAP tools are based on a multidimensional data model – views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimension Dimensions are the perspectives of entities with respect to which an organization wants to keep records Example  A sales data warehouse keep records of the store’s sales with respect to the dimensions time, item, branch and location.  Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.  Ex. item(item_name, brand, type)  Dimension tables can be specified by users or experts, or automatically adjusted based on data distribution

9 A Multidimensional Data Model (2) A multidimensional model is organized around a central theme, for instance, sales, which is represented by a fact table Facts are numerical measures such as quantities :- dolar_sold, unit_sold, amount_budget

10 Example: A 2-D view Table 2.1 A 2-D view of sales data according to the dimension time and item, where the sales are from braches located in Vancouver. The measure shown is dollar_sold (in thousands)

11 Example: A 3-D View Table 2.2 A 3-D view of sales data according to the dimension time and item and location. The measure shown is dollar_sold (in thousands)

12 Example: A 3-D data cube A 3-D data cube represent the data in table 2.2 according to the dimension time and item and location. The measure shown is dollar_sold (in thousands)

13 Star Schema The most common modeling paradigm, in which the data warehouse contains 1.A large central table (fact table) containing the bulk of the data, no redundancy 2.A set of smaller attendant table (dimension table), one for each dimension

14 Example: star schema of a data warehouse for sales A central fact table is sales that contains keys to each of the four dimensions, along with 2 measures: dollars_sold and unit_sold.

15 Example: snowflake schema of a data warehouse for sales

16 Example: fact constellation schema of a data warehouse for sales and shipping 2 fact models

17 A Concept Hierarchies A concept hierarchy defines a sequence of mapping form a set of low-level concepts to higher-level, more general concepts (Example below is location)

18 A Concept Hierarchies (2) Many concept hierarchies are implicit within the database schema street city province_or_state country location which is described by attributes number, street, city, province_or_state, zipcode and country Total order hierarchy day month quarter year week time which is described by attributes day, week, month, quarter and year Partial order hierarchy

19 Typical OLAP Operations for multidimensional data (1) Roll-up (drill-up): climbing up a concept hierarchy or dimension reduction – summarize data Drill down(roll-down): stepping down a concept hierarchy or introducing additional dimensions  reverse of roll-up  Navigate from less detailed data to more detailed data Slice and dice:  Slice operation perform a selection on one dimension of the given cube, resulting in subcube.  Dice operation defines a subcube by performing a selection on two or more dimensions

20 Typical OLAP Operations for multidimensional data (2) Pivot (rotate):  A visualization operation that rotate data axes in view in order to provide an alternative presentation of the data Other OLAP operations: such as  drill-across – execute queries involving more than one fact table  drill-through

21 605 82514400 Q1 Q2 Q3 Q4 Time (Quarter) Location (cities) 395 1560 440 Vancouver Toronto New York Chicago Home Entertainment Computer Phone Security item (type) roll-up on location (from cities to countries) 1000 Q1 Q2 Q3 Q4 Time (Quarter) Location (cities) 2000 Canada USA Home Entertainment Computer Phone Security item (type)

22 Sep Oct Nov Dec Aug May Jun Jul 605 82514400 Q1 Q2 Q3 Q4 Time (Quarter) Location (cities) 395 1560 440 Vancouver Toronto New York Chicago Home Entertainment Computer Phone Security item (type) drill-down on time (from quarters to months) 150 100 150 Jan Feb Mar Apr Time (Quarter) Vancouver Toronto New York Chicago Home Entertainment Computer Phone Security item (type) Location (cities)

23 605 82514400 605 Q1 Q2 Q3 Q4 Time (Quarter) Location (cities) 395 1560 440 Vancouver Toronto New York Chicago Home Entertainment Computer Phone Security item (type) 395 Vancouver Toronto Home Entertainment Computer Q1 Q2 dice for (location=“Toronto”or“Vancouver”) and (time = “Q1” or “Q2”) and (item=“home entertainment” or “computer”) item (type) Location (cities)

24 605 82514400 Q1 Q2 Q3 Q4 Time (Quarter) Location (cities) 395 1560 440 Vancouver Toronto New York Chicago Home Entertainment Computer Phone Security item (type) slice for (time=“Q1”) 82514605400 Chicago New York Toronto Vancouver Home Entertainment Computer Phone Security item (type) 1560395440605 825 14 400 Chicago New York Toronto Home Entertainment Computer Phone Security item (type) Vancouver pivot Location (cities)

MINING FREQUENT PATTERN, ASSOCIATIONS 25

26 What is Association Mining? Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Applications:  Basket data analysis, cross-marketing, catalog design, loss- leader analysis, clustering, classification, etc. Rule form: “Body  Head [support, confidence]”  buys(x, “diapers”)  buys(x, “beers”) [0.5%,60%]  major(x, “CS”)  take (x, “DB”)  grade (x, “A”) [1%,75%]

27 A typical example of association rule mining is market basket analysis.

28 The information that customers who purchase computer also tend to buy antivirus software at the same time is represented in Association Rule below: computer  antivirus_software [support = 2%, confidence = 60%] Rule support and confidence are two measures of rule interestingness  Support= 2% means that 2% of all transactions under analysis show that computer and antivirus software are purchased together  Confidence=60% means that 60% of the customers who purchased a computer also bought the software Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold Such threshold can be set by users of domain experts

29 Rule Measure: Support and Confidence Support: probability that a transaction contain {ABC} Confidence: Condition probability that a transaction having {AB} also contain {C} TransIDItems Bought T001A,B,C T002A,C T003A,D T004B,E,F Find all the rule A  B  C with minimum confidence and support Let min_sup=50%, min_conf.=50% Typically association rules are considered interesting if they satisfy both a minimum support threshold and a mininum confidence threshold Such thresholds can be set by users or domain experts

30 Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong A set of items is referred to as an itemset An itemset that contains k items is a k-itemset The occurrence frequency of an itemset is the number of transactions that contain the itemset An itemset satisfies minimum support if the occurrence frequency of the itemset >= min_sup * total no. of transaction An itemset satisfies minimum support  it is a frequent itemset

31 Two Steps in Mining Association Rules Step1 :Find all frequent itemsets  A subset of a frequent itemset must also be a frequent itemset i.e. if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset  Iteratively find frequent itemset with cardinality from 1 to k (k- itemset) Step2 : generate strong association rules from the frequent itemsets

32 Mining Single-Dimensional Boolean Association Rules From Transaction Databases Methods for mining the simplest form of association rules: single-dimensional, single-level, boolean association rules  Apriori algorithm The Apriori algorithm : Finding frequent itemset for boolean association rules  L k : frequent k- itemset is used to explore L k+1  Consists of join and prune step 1.The join step: A set of candidate k-itemset (C k ) is generated by joining L k-1 with itself 2.The prune step: Determine L k as : any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

33 The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : Frequent itemset of size k L 1 = {frequent 1-itemsets}: for (k=1; L k !=  ; k++) do begin C k+1 =candidates generated from L k ; for each transaction t in database D do increment count of all candidates in C k+1 that are contained in t L k+1 =candidate in C k+1 with min_support end Return  k L k ;

34 Example: Finding frequent itemsets in D 1.Each item is a member of the set of candidate 1- itemsets (C 1 ), count the number of occurrences of each item 2.Suppose the minimum transaction support count = 2, the set of L 1 = candidate 1-itemsets that satisfy minimum support 3.Generate C 2 = L 1  L 1 4.Continue the algo. Until C 4 =  Transaction database D |D| = 9

36 Example of Generating Candidates L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3  L 3  C 4 ={abcd acde} Pruning:  acde is remove because ade is not in L 3 C 4 ={abcd}

37 Generating Association Rule from frequent Itemsets confidence(A  B)= P(B|A)=support_count(A  B) support_count(A)  support_count (A  B) is the no. of transaction containing the itemsets A  B  support_count (A) is the no. of transaction containing the itemsets A Association rules can be generated as  For each frequent itemset l, generate all nonempty subset of l  For every nonempty subset s of l, output the rule s  (l-s) if support_count(l) support_count(s) Min_conf. is the minimum confidence threshold  min_conf.

38 Example Suppose the data contain the frequent itemset l={I 1,I 2,I 5 } What are the association rules that can be generated from l? The nonempty subsets of l are {I 1,I 2 }, {I 1,I 5 }, {I 2,I 5 }, {I 1 }, {I 2 }, {I 5 } The resulting association rules are 1l 1  l 2  l 5 confidence=2/4=50% 2l 1  l 5  l 2 confidence=2/2=100% 3l 2  l 5  l 1 confidence=2/2=100% 4l 1  l 2  l 5 confidence=2/6=33% 5l 2  l 1  l 5 confidence=2/7=29% 6l 5  l 1  l 2 confidence=2/2=100% If minimun confidence threshold = 70% Output are the rule no. 2,3 and 6

CLASSIFICATION AND PREDICTION 39

40 Lecture 5 Classification and Prediction Chutima Pisarn Faculty of Technology and Environment Prince of Songkla University 976-451 Data Warehousing and Data Mining

What Is Classification? Case  A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank  A marketing manager at AllElectronics needs data analysis to help guess whether a customer with a given profile will buy a new computer  A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive The data analysis task is classification, where the model or classifier is constructed to predict categorical labels, such as  “safe” or “risky” for the loan application data  “yes” or “no” for the marketing data  “treatment A”, “treatment B” or “treatment C” for the medical data 41

What Is Prediction? Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at AllElectronics This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label This model is a predictor Regression analysis is a statistical methodology that is most often used for numeric prediction 42

43 How does classification work? Data classification is a two-step process In the first step, -- learning step or training phase  a model is built describing a predetermined set of data classes or concepts  The model is constructed by analyzing database tuples described by attributes  Each tuple is assumed to belong to a predefined class, as determined by the class label attribute  Data tuples used to build the model are called training data set  The individual tuples in a training set are referred to as training samples  If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning (or clustering)  The learned model is represented in the form of classification rules, decision trees or mathematical formulae

44 How does classification work? In the second step,  The model is used for classification  First, estimate the predictive accuracy of the model  The holdout method is a technique that uses a test set of class- labeled samples which are randomly selected and are independent of the training samples  The accuracy of a model on a given test set is the percentage of test set correctly classified by model  If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data  If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown

45 How is prediction different from classification? Data prediction is a two step process, similar to that of data classification For prediction, the attributefor which values are being predicted is continuous-value (ordered) rather than categorical (discrete-value and unordered) Prediction can also be viewed as a mapping or function, y=f(X)

46 Classification by Decision Tree Induction A decision tree is a flow-chart-like tree structure,  each internal node denotes a test on an attribute,  each branch represents an outcome of the test  leaf node represent classes  Top-most node in a tree is the root node Age? student? Credit_rating? noyes no <=30 31…40 >40 no yes excellentfair The decision tree represents the concept buys_computer

47 Attribute Selection Measure The information gain measure is used to select the test attribute at each node in the tree Information gain measure is referred to as an attribute selection measure or measure of the goodness of split The attribute with the highest information gain is chosen as the test attribute for the current node Let S be a set consisting of s data samples, the class label attribute has m distinct value defining m distinct classes, C i (for i=1,...,m) Let s i be the no of sample of S in class C i The expected information I(s 1,s 2,…s m )=-  p i log 2 (p i ),  where p i is the probability that the sample belongs to class, p i =s i /s i=1 m

48 Attribute Selection Measure (cont.) Find an entropy of attribute A Let A have distinct value {a 1,a 2,…,a } which can partition S into {S 1,S 2,….S } For each S j, s ij is the number of samples S j of class C i The entropy or expected information based on attribute A is given by E(A)=  s 1j +…+s mj Gain(A)=I(s 1,s 2,…s m )-E(A) The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S s I(s 1j,…,s mj ) j=1

49 Example RIDageincomestudentCredit_ratingClass:buys_computer 1<=30HighNoFairNo 2<=30HighNoExcellentNo 331…40HighNoFairYes 4>40MediumNoFairYes 5>40LowYesFairYes 6>40LowYesExcellentNo 731…40LowYesExcellentYes 8<=30MediumNoFairno 9<=30LowYesFairYes 10>40MediumYesFairYes 11<=30MediumYesExcellentYes 1231…40MediumNoExcellentYes 1331…40HighYesFairYes 14>40mediumnoExcellentNo the class label attribute: 2 classes I(s 1,s 2 ) = I(9,5) = -9/14 log 2 (9/14)- 5/14log 2 (5/14) =0.940

50 I(s 1,s 2 ) = I(9,5) = -9/14 log 2 (9/14)- 5/14 log 2 (5/14) =0.940 Compute the entropy of each attribute For attribute “age” For age=“<=30” s 11 =2, s 21 =3 For age=“31…40” s 12 =4, s 22 =0 For age=“>40” s 13 =3, s 23 =2 Gain(age) = I(s 1,s 2 ) – E(age) = 0.940 –[(5/14)I(2,3)+(4/14)I(4,0)+(5/14)I(3,2) = 0.246 For attribute “income” For income=“high” s 11 =2, s 21 =2 For income=“medium” s 12 =4, s 22 =2 For income=“low”s 13 =3, s 23 =1 Gain(income) = I(s 1,s 2 ) – E(income) = 0.940 –[(4/14)I(2,2)+(6/14)I(4,2)+(4/14)I(3,1) = 0.029

51 For attribute “student” For student=“yes”s 11 =6, s 21 =1 For student =“no” s 12 =3, s 22 =4 Gain(student) = I(s 1,s 2 ) – E(student) = 0.940 –[(7/14)I(6,1)+(7/14)I(3,4) = 0.151 For attribute “credit_rating” For credit_rating =“fair” s 11 =6, s 21 =2 For credit_rating =“excellent” s 12 =3, s 22 =3 Gain(credit_rating) = I(s 1,s 2 ) – E(credit_rating) = 0.940 –[(8/14)I(6,2)+(6/14)I(3,3) = 0.048 Since age has the highest information gain, age is selected as the test attribute -A node is created and labeled with age -Braches are grown for each of the attribute’s values

52 incomestudentCredit_ratingClass HighNoFairNo HighNoExcellentNo MediumNoFairNo LowYesFairYes MediumYesExcellentYes incomestudentCredit_ratingClass mediumNoFairYes LowYesFairYes lowYesExcellentNo mediumYesFairYes mediumNoExcellentNo Age? <=30 30…40 >40 incomestudentCredit_ratingClass HighNoFairYes LowYesExcellentYes MediumNoExcellentYes HighYesFairYes S1S1 S3S3 S2S2

53 For the partition age=“<=30”  Find information gain for each attribute in this partition, then select the attribute with the highest information gain as a test node (call generate_decision_tree(S 1, {income, student, credit_rating}))  student have the highest information gain incomeCredit_ratingClass LowFairYes MediumExcellentYes Age? <=30 31…40 >40 student? incomeCredit_ratingClass HighFairNo HighExcellentNo MediumFairNo yesno All sample belong to class yes  create leaf node and label with “yes” All sample belong to class no  create leaf node and label with “no”

54 Age? student? noyes <=30 31…40 >40 no yes incomestudentCredit_ratingClass mediumNoFairYes LowYesFairYes lowYesExcellentNo mediumYesFairYes mediumNoExcellentNo For the partition age=“30…40”  All sample belong to class no  create leaf node and label with “no” For the partition age=“>40”  Consider credit rating and income  credit rating has higher information gain

55 Age? student? noyes <=30 31…40 >40 no yes Attribute left is income but sample is empty  terminate generate_decision_tree yes no excellentfair Credit_rate? Assignment 1 แสดงการสร้าง Decision Tree นี้ อย่างละเอียด แสดงการคำนวณ ด้วย

56 Example : generate rules from decision tree 1.IF age=“<=30” AND student=“no” THEN buys_computer =“no” 2.IF age=“<=30” AND student=“yes” THEN buys_computer =“yes” 3.IF age=“31…40” THEN buys_computer =“yes” 4.IF age=“>40” AND credit_rate=“excellent” THEN buys_computer =“no” 5.IF age=“>40” AND credit_rate=“fair” THEN buys_computer =“yes” Age? student? noyes <=30 31…40 >40 no yes no excellentfair Credit_rate?

57 Naïve Bayesian Classification Naïve Bayesian classifier also called simple Bayesian classifier, works as follows: 1.Each data sample is represented by an n-dimensional feature, X=(x 1,x 2,…,x n ) from n attributes, repectively, A 1,A 2,…,A n OutlookTemperatureHumidityWindyPlay RainyMildNormalFalseY OvercastCoolNormalTrueY SunnyHotHighTrueN OvercastHotHighFalseY SunnyHotHighFalse X X=(sunny,hot,high, false) unknown class A 1,A 2,…,A n

58 Naïve Bayesian Classification (cont.) 2.Suppose that there are m clases, C 1,C 2,…C m  Given an unknown data sample X  The calssifier will predict that X belongs to the class having the highest posterior probability, condition on X  The naïve Bayesian will assigns an unknown X to class C i if and only if P(C i |X) > P(C j |X) for 1  j  m, j  I That is, it will find the maximum posterior probability among P(C 1 |X), P(C 2 |X), ….,P(C m |X)  The class C i for which P(C i |X) is maximized is called the maximum posteriori hypothesis

59 OutlookTemperatureHumidityWindyPlay RainyMildNormalFalseY OvercastCoolNormalTrueY SunnyHotHighTrueN OvercastHotHighFalseY SunnyHotHighFalse X X=(sunny,hot,high,false) unknown class A 1,A 2,…,A n m =2 C 1 : Play=“Y” and C 2 : Play=“N” If (Play=“Y” | X) > (Play=‘N’| X) Y Training Samples

60 Naïve Bayesian Classification (cont.) 3.By Bayes theorem, P(C i |X) = P(X|C i ) P(C i ) P(X)  As P(X) is constant for all classes, only P(X|C i ) P(C i ) need to be maximized  If P(C i ) are not known, it is commonly assume that P(C 1 ) = P(C 2 ) = … = P(C m ), therefore only P(X|C i ) need to be maximized  Otherwise, we maximize P(X|C i ) P(C i ), where P(C i ) = s i s # of training sample of Class Ci Total # of training sample

61 OutlookTemperatureHumidityWindyPlay RainyMildNormalFalseY OvercastCoolNormalTrueY SunnyHotHighTrueN OvercastHotHighFalseY SunnyHotHighFalse X X=(sunny,hot,high,false) unknown class A 1,A 2,…,A n m =2 C 1 : Play=“Y” and C 2 : Play=“N” (Play=“Y”|X) = P(X|Play=“Y”) P(Play=“Y”) = P(X|Play=“Y”) (3/4) Training Samples (Play=“N”|X) = P(X|Play=“N”) P(Play=“N”) = P(X|Play=“N”) (1/4)

62 Naïve Bayesian Classification (cont.) 4.Given a data sets with many attribute  it is expensive to compute P(X|C i )  To reduce computation, naïve made an assumption of class conditional independence (there are no dependence relationship among the attribute)  P(X|C i ) =   P(x k |C i ) = P(x 1 |C i )* P(x 2 |C i )*…* P(x k |C i ) If A k is categorical, then P(x k |C i ) = s ik If Ak is continuous values  perform Gaussian distribution (not focus in this class) k=1 n sisi # of training sample of Class C i having the value x k for A k Total # of training sample belong to class Ci

63 Naïve Bayesian Classification (cont.) 5.In order to classify an unknown X, P(X|C i ) P(C i ) is evaluated for each class C i  Sample X is assign to the class C i for which P(X|C i ) P(C i ) is the maximum

64 Example: Predicting a class label using naïve Bayesian classification RIDageincomestudentCredit_ratingClass:buys_computer 1<=30HighNoFairNo 2<=30HighNoExcellentNo 331…40HighNoFairYes 4>40MediumNoFairYes 5>40LowYesFairYes 6>40LowYesExcellentNo 731…40LowYesExcellentYes 8<=30MediumNoFairno 9<=30LowYesFairYes 10>40MediumYesFairYes 11<=30MediumYesExcellentYes 1231…40MediumNoExcellentYes 1331…40HighYesFairYes 14>40mediumnoExcellentNo 15<=30mediumyesfair Unknown sample

65 C 1 : buys_computer =“Yes”, C 2 : buys_computer =“No” The unknown sample we wish to classify is X=(age=“<=30”, income=“medium”, student=“yes”, credit_rating=“fair”) We need to maximize P(X|C i ) P(C i ), for i=1,2 P(X|buys_computer=“yes”) P(buys_computer=“yes”) P(buys_computer=“yes”)= 9/14 = 0.64 P(X|buys_computer=“yes”)= P(age=“<=30“|buys_computer=“yes”) * P(income=“medium“|buys_computer=“yes”) * P(student=“yes“|buys_computer=“yes”) * P(credit_rating=“fair“|buys_computer=“yes”) = 2/9 * 4/9 * 6/9 * 6/9 = 0.044 P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.64*0.044 = 0.028 i=1

66 P(X|buys_computer=“no”) P(buys_computer=“no”) P(buys_computer=“no”)= 5/14 = 0.36 P(X|buys_computer=“no”)= P(age=“<=30“|buys_computer=“no”) * P(income=“medium“|buys_computer=“no”) * P(student=“yes“|buys_computer=“no”) * P(credit_rating=“fair“|buys_computer=“no”) = 3/5 * 2/5 * 1/5 * 2/5 = 0.019 P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.36*0.019= 0.007 i=2 Therefore, X=(age=“<=30”, income=“medium”, student=“yes”, credit_rating=“fair”) should be in class buys_computer= “yes”

67 OutlookTemperatureHumidityWindyPlay SunnyHotHighFalseN SunnyHotHighTrueN OvercastHotHighFalseY RainyMildHighFalseY RainyCoolNormalFalseY RainyCoolNormalTrueN OvercastCoolNormalTrueY SunnyMildhighFalseN SunnyCoolNormalFalseY RainyMildNormalFalseY SunnyMildNormalTrueY OvercastHotNormalFalseY OvercastMildHighTrueY RainyMildHighTrueN SunnyCoolNormalFalse RainyMildHighFalse Assignment2: Using naïve Bayesain classifier to predict those unknown data samples Unknown data samples

68 Prediction: Linear Regression The prediction of continuous values can be modeled by statistical technique of regression The linear regression is the simplest form of regression Y=  +  X  Y is called a response variable  X is called a predictor variable   and  are regression coefficient specifying the Y-intercept and slop of the line These coefficients can be solved method of least squares, which minimizes the error between the actual data and the estimate of the line

69 Example : Find the linear regression of salary data Y=  +  X  =  (x i -x)(y i -y) = 3.5  = y -  x = 23.6 Predicted line is estimated by Y = 23.6 + 3.5X X Year experience Y Salary (in $1000s) 330 857 964 1372 336 643 1159 2190 120 1683 1023.6+3.5(10) = 58.6 Salary data  (x i -x) 2 s i=1 s x = 9.1 and y = 55.4

70 Classifier Accuracy Measures The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier  Recognition rate – for pattern recognition literature The error rate or misclassification rate of the classifier M is simply 1- Acc(M) where Acc(M) is the accuracy of M If we were to use the training set to estimate the error rate of the model  resubstitution error Confusion matrix is a useful tool for analyzing how well the classifier can recognize

Confusion matrix: Example ClassBuys_computer =yes Buys_computer =no totalRecognition (%) Buys_computer =yes 6,954467,00099.34 Buys_computer =no 4122,5883,00086.27 Total7,3662,63410,00095.52 71 No. of tuple of class buys_computer=yes that were labeled by the classifier as class buys_computer=yes C1C1 C2C2 C1C1 True positivesFalse negative C2C2 False positiveTrue negative Predicted Class Actual Class

Are there alternatives to the accuracy measure? Sensitivity refer to true positive (recognition) rate = the porportion of positive tuples that are correctly idenitfied Specificity is the true negative rate = the proportion of negative tuples that are correctly identified 72 sensitivity = t_pos / pos specificity = t_neg / neg precision = t_pos / (t_pos+f_pos) Accuracy = sensitivity pos + specificity neg (pos+neg) (pos+neg) No of positive tuples

Predictor Error Measure Loss functions measure the error between y i and the predicted value y i ’ The most common loss functions are  Absolute error: |y i -y i ’|  Squared error: (y i -y i ’) 2 Based on the above, the test error (rate) or generalization error, is the average loss over the test set Thus, we get the following error rates  Mean absolute error:  Mean squared error: 73

Evaluating the Accuracy of a Classifier or Predictor How can we use those measures to obtain a reliable estimate of classifier accuracy (or predictor accuracy) Accuracy estimates can help in comparison of different classifiers Common techniques to assessing accuracy based on randomly sampled partitions of the given data  Holdout, random subsamplig, cross-validation, bootstrap 74

75 Evaluating the Accuracy of a Classifier or Predictor Holdout method  The given data are randomly partitioned into 2 independent sets, a training data and a test set  Typically, 2/3 are training set, 1/3 are test set  Training set: used to derive the classifier  Test set: used to estimate the derived classifier Data Training set Test set Derived model Estimate accuracy

Evaluating the Accuracy of a Classifier or Predictor Random subsampling  The variation of the hold out method  Repeat hold out method k times  The overall accuracy estimate is the average of the accuracies obtained from each iteration 76

77 Evaluating the Accuracy of a Classifier or Predictor k-fold cross validation  The initial data are randomly partitioned into k equal sized subsets (“folds”) S 1, S 2,...,S k  Training and testing are performed k times  In iteration i, the subset S i is the test sets, and the remaining subset are collectively used to train the classifier  Accuracy = overall no. of correct classifiers from the k iterations total no. of samples in the initial data Iteration 1 s1s1 Iteration 2 s2s2 …

Evaluating the Accuracy of a Classifier or Predictor Bootstrap method  The training tuples are sampled uniformly with replacement  Each time a tuple is selected, it is equally likely to be selected again and readded to the training set  There are several bootstrap method – the commonly used one is.632 bootstrap which works as follows Given a data set of d tuples The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples It is very likely that some of the original data tuples will occur more than once in this sample The data tuples that did not make it into the training set end up forming the test set Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set 78

CLUSTER ANALYSIS 79

80 What is Cluster Analysis Clustering: the process of grouping the data into classes or clusters which the objects within a cluster have high similarity in comparison to one another, but very dissimilarity to objects in other clusters What are some typical applications of clustering?  In business, discovering distinct groups in their customer bases and characterize customer groups based on purchasing patterns  In biology, derive plant and animal taxonomies, categorize genes  Etc. Clustering is also call data segmentation in some application because clustering partition large data set into groups according to their similarity

What is Cluster Analysis Clustering can also be used for outlier detection, where outliers (value that are “far away” from any cluster) may be more interesting than common cases  Application- the detection of credit card fraud, monitoring of criminal activities in electronic commerce In machine learning, clustering is an example of unsupervised learning (do not rely on predefined classes) 81

82 How to compute the dissimilarity between object? The dissimilarity (or similarity) between objects described by interval-scaled variables is typically compute based on the distance between each pair of objects  Euclidean distance  Manhattan (or city block) distance  Minkowski distance, a generalization of both Euclidean and Manhattan distance d(i,j) = (|x i1 -x j1 | 2 + |x i2 -x j2 | 2 +… + |x ip -x jp | 2 ) d(i,j) = (|x i1 -x j1 | + |x i2 -x j2 | +… + |x ip -x jp |) d(i,j) = (|x i1 -x j1 | q + |x i2 -x j2 | q +… + |x ip -x jp | q ) 1/q q=2 q=1

83 Centroid-Based Technique: The K-Means Method Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity

The k-means algorithm Input parameter: the number of cluster k and a database containing n objects Output: A set of k clusters that minimizes the squared- error criterion 1.Randomly selects k of the objects, each of which initially represents a cluster mean or center 2.For each remaining objects,  an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean  Compute the new mean for each cluster  The process iterates until the criterion function converges 84

85 The K-Means Method The criterion used is called square-error criterion where E is the sum of square-error for all objects in the database, p is the point representing a given object, and m i is the mean of cluster C i Assignment 3: suppose that the data mining task to cluster the following eight points into three clusters A1(2,10) A2(2,5) A3(8,4) B1(5,8) B2(7,5) B3(6,4) C1(1,2) C2(4,9)  The distance function is Euclidean distance  Suppose A1, B1 and C1 is assigned as the center of each cluster E=   |p-m i | 2 i=1 k pCipCi

1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012.

Similar presentations

Presentation on theme: "1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012.

Similar presentations

Presentation on theme: "1 Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24 th March 2012."— Presentation transcript:

Similar presentations

About project

Feedback