Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/2 Copyright © 2004 The big problem Billions of records A small number of interesting patterns “Data rich but information poor”
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/3 Copyright © 2004 Data mining Knowledge discovery Knowledge extraction Data/pattern analysis
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/4 Copyright © 2004 Types of source data Relational databases Transactional databases Web logs Textual databases
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/5 Copyright © 2004 Association rules 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings Association rules: X Y
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/6 Copyright © 2004 Association analysis IF 20 < age < 30 AND 20K < INCOME < 30K THEN –Buys (“CD player”) SUPPORT = 2%, CONFIDENCE = 60%
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/7 Copyright © 2004 Basic concepts Minimum support threshold Minimum confidence threshold Itemsets Occurrence frequency of an itemset
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/8 Copyright © 2004 Association rule mining Find all frequent itemsets Generate strong association rules from the frequent itemsets
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/9 Copyright © 2004 Support and confidence Support (X) Confidence (X Y) = Support(X+Y) / Support (X)
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/10 Copyright © 2004 Example TIDList of item IDs T100I1, I2, I5 T200I2, I4 T300I2, I3 T400I1, I2, I4 T500I1, I3 T600I2, I3 T700I1, I3 T800I1, I2, I3, I5 T900I1, I2, I3
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/11 Copyright © 2004 Example (cont’d) Frequent itemset l = {I1, I2, I5} I1 AND I2 I5 C = 2/4 = 50% I1 AND I5 I2 I2 AND I5 I1 I1 I2 AND I5 I2 I1 AND I5 I3 I1 AND I2
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/12 Copyright © 2004 Example 2 TIDdateitems T10010/15/99{K, A, D, B} T20010/15/99{D, A, C, E, B} T30010/19/99{C, A, B, E} T40010/22/99{B, A, D} min_sup = 60%, min_conf = 80%
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/13 Copyright © 2004 Correlations Corr (A,B) = P (A OR B) / P(A) P (B) If Corr < 1: A discourages B (negative correlation) (lift of the association rule A B)
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/14 Copyright © 2004 Contingency table Game^GameSum Video4,0003,5007,500 ^Video2, ,500 Sum6,0004,00010,000
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/15 Copyright © 2004 Example P({game}) = 0.60 P({video}) = 0.75 P({game,video}) = 0.40 P({game,video})/(P({game})x(P({video })) = 0.40/(0.60 x 0.75) = 0.89
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/16 Copyright © 2004 Example 2 hotdogs^hotdogsSum hamburgers ^hamburgers Sum
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/17 Copyright © 2004 Classification using decision trees Expected information need I (s 1, s 2, …, s m ) = - p i log (p i ) s = data samples m = number of classes
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/18 Copyright © 2004 RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes MediumNoExcellentYes HighYesFairYes 14> 40Mediumnoexcellentno
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/19 Copyright © 2004 Decision tree induction I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/20 Copyright © 2004 Entropy and information gain E(A) = I (s 1j,…,s mj ) S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/21 Copyright © 2004 Entropy Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = Age in s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0 Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/22 Copyright © 2004 Entropy (cont’d) E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = Gain (age) = I (s1,s2) – E(age) = Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/23 Copyright © 2004 Final decision tree excellent age studentcredit noyesnoyes no > 40 yes fair
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/24 Copyright © 2004 Other techniques Bayesian classifiers X: age <=30, income = medium, student = yes, credit = fair P(yes) = 9/14 = P(no) = 5/14 = 0.357
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/25 Copyright © 2004 Example P (age < 30 | yes) = 2/9 = P (age < 30 | no) = 3/5 = P (income = medium | yes) = 4/9 = P (income = medium | no) = 2/5 = P (student = yes | yes) = 6/9 = P (student = yes | no) = 1/5 = P (credit = fair | yes) = 6/9 = P (credit = fair | no) = 2/5 = 0.400
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/26 Copyright © 2004 Example (cont’d) P (X | yes) = x x x = P (X | no) = x x x = P (X | yes) P (yes) = x = P (X | no) P (no) = x = Answer: yes/no?
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/27 Copyright © 2004 Predictive models Inputs (e.g., medical history, age) Output (e.g., will patient experience any side effects) Some models are better than others
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/28 Copyright © 2004 Principles of data mining Training/test sets Error analysis and overfitting Cross-validation Supervised vs. unsupervised methods error input size training test
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/29 Copyright © 2004 Representing data Vector space salary credit pay off default
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/30 Copyright © 2004 Decision surfaces salary credit pay off default
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/31 Copyright © 2004 Decision trees salary credit pay off default
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/32 Copyright © 2004 Linear boundary salary credit pay off default
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/33 Copyright © 2004 kNN models Assign each element to the closest cluster Demos: – 2.cs.cmu.edu/~zhuxj/courseproject/knnd emo/KNN.html
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/34 Copyright © 2004 Other methods Decision trees Neural networks Support vector machines Demos – ost/
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/35 Copyright © 2004 arff outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/36 Copyright © 2004 Weka Methods: rules.ZeroR bayes.NaiveBayes trees.j48.J48 lazy.IBk trees.DecisionStump
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/37 Copyright © 2004 kMeans clustering tware.html java weka.clusterers.SimpleKMeans -t data/weather.arff
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/38 Copyright © 2004 More useful pointers
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/39 Copyright © 2004 More types of data mining Classification and prediction Cluster analysis Outlier analysis Evolution analysis