Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.

Similar presentations


Presentation on theme: "Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005."— Presentation transcript:

1 Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005

2 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/2 Copyright © 2004 The big problem  Billions of records  A small number of interesting patterns  “Data rich but information poor”

3 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/3 Copyright © 2004 Data mining  Knowledge discovery  Knowledge extraction  Data/pattern analysis

4 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/4 Copyright © 2004 Types of source data  Relational databases  Transactional databases  Web logs  Textual databases

5 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/5 Copyright © 2004 Association rules  65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings  Association rules: X  Y

6 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/6 Copyright © 2004 Association analysis  IF 20 < age < 30 AND 20K < INCOME < 30K  THEN –Buys (“CD player”)  SUPPORT = 2%, CONFIDENCE = 60%

7 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/7 Copyright © 2004 Basic concepts  Minimum support threshold  Minimum confidence threshold  Itemsets  Occurrence frequency of an itemset

8 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/8 Copyright © 2004 Association rule mining  Find all frequent itemsets  Generate strong association rules from the frequent itemsets

9 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/9 Copyright © 2004 Support and confidence  Support (X)  Confidence (X  Y) = Support(X+Y) / Support (X)

10 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/10 Copyright © 2004 Example TIDList of item IDs T100I1, I2, I5 T200I2, I4 T300I2, I3 T400I1, I2, I4 T500I1, I3 T600I2, I3 T700I1, I3 T800I1, I2, I3, I5 T900I1, I2, I3

11 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/11 Copyright © 2004 Example (cont’d)  Frequent itemset l = {I1, I2, I5}  I1 AND I2  I5 C = 2/4 = 50%  I1 AND I5  I2  I2 AND I5  I1  I1  I2 AND I5  I2  I1 AND I5  I3  I1 AND I2

12 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/12 Copyright © 2004 Example 2 TIDdateitems T10010/15/99{K, A, D, B} T20010/15/99{D, A, C, E, B} T30010/19/99{C, A, B, E} T40010/22/99{B, A, D} min_sup = 60%, min_conf = 80%

13 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/13 Copyright © 2004 Correlations  Corr (A,B) = P (A OR B) / P(A) P (B)  If Corr < 1: A discourages B (negative correlation)  (lift of the association rule A  B)

14 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/14 Copyright © 2004 Contingency table Game^GameSum Video4,0003,5007,500 ^Video2,0005002,500 Sum6,0004,00010,000

15 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/15 Copyright © 2004 Example  P({game}) = 0.60  P({video}) = 0.75  P({game,video}) = 0.40  P({game,video})/(P({game})x(P({video })) = 0.40/(0.60 x 0.75) = 0.89

16 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/16 Copyright © 2004 Example 2 hotdogs^hotdogsSum hamburgers20005002500 ^hamburgers100015002500 Sum300020005000

17 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/17 Copyright © 2004 Classification using decision trees  Expected information need  I (s 1, s 2, …, s m ) = - p i log (p i )  s = data samples  m = number of classes 

18 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/18 Copyright © 2004 RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo 331.. 40HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo 731.. 40LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes 1231.. 40MediumNoExcellentYes 1331.. 40HighYesFairYes 14> 40Mediumnoexcellentno

19 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/19 Copyright © 2004 Decision tree induction  I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940

20 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/20 Copyright © 2004 Entropy and information gain E(A) = I (s 1j,…,s mj )  S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)

21 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/21 Copyright © 2004 Entropy  Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) = 0.971  Age in 31.. 40 s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0  Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971

22 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/22 Copyright © 2004 Entropy (cont’d)  E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694  Gain (age) = I (s1,s2) – E(age) = 0.246  Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

23 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/23 Copyright © 2004 Final decision tree excellent age studentcredit noyesnoyes no 31.. 40 > 40 yes fair

24 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/24 Copyright © 2004 Other techniques  Bayesian classifiers  X: age <=30, income = medium, student = yes, credit = fair  P(yes) = 9/14 = 0.643  P(no) = 5/14 = 0.357

25 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/25 Copyright © 2004 Example  P (age < 30 | yes) = 2/9 = 0.222 P (age < 30 | no) = 3/5 = 0.600 P (income = medium | yes) = 4/9 = 0.444 P (income = medium | no) = 2/5 = 0.400 P (student = yes | yes) = 6/9 = 0.667 P (student = yes | no) = 1/5 = 0.200 P (credit = fair | yes) = 6/9 = 0.667 P (credit = fair | no) = 2/5 = 0.400

26 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/26 Copyright © 2004 Example (cont’d)  P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044  P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019  P (X | yes) P (yes) = 0.044 x 0.643 = 0.028  P (X | no) P (no) = 0.019 x 0.357 = 0.007  Answer: yes/no?

27 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/27 Copyright © 2004 Predictive models  Inputs (e.g., medical history, age)  Output (e.g., will patient experience any side effects)  Some models are better than others

28 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/28 Copyright © 2004 Principles of data mining  Training/test sets  Error analysis and overfitting  Cross-validation  Supervised vs. unsupervised methods error input size training test

29 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/29 Copyright © 2004 Representing data  Vector space salary credit pay off default

30 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/30 Copyright © 2004 Decision surfaces salary credit pay off default

31 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/31 Copyright © 2004 Decision trees salary credit pay off default

32 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/32 Copyright © 2004 Linear boundary salary credit pay off default

33 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/33 Copyright © 2004 kNN models  Assign each element to the closest cluster  Demos: –http://www- 2.cs.cmu.edu/~zhuxj/courseproject/knnd emo/KNN.html

34 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/34 Copyright © 2004 Other methods  Decision trees  Neural networks  Support vector machines  Demos –http://www.cs.technion.ac.il/~rani/LocBo ost/

35 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/35 Copyright © 2004 arff files @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

36 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/36 Copyright © 2004 Weka http://www.cs.waikato.ac.nz/ml/weka Methods: rules.ZeroR bayes.NaiveBayes trees.j48.J48 lazy.IBk trees.DecisionStump

37 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/37 Copyright © 2004 kMeans clustering  http://www.cc.gatech.edu/~dellaert/html/sof tware.html  java weka.clusterers.SimpleKMeans -t data/weather.arff

38 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/38 Copyright © 2004 More useful pointers  http://www.kdnuggets.com/  http://www.twocrows.com/booklet.htm

39 Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/39 Copyright © 2004 More types of data mining  Classification and prediction  Cluster analysis  Outlier analysis  Evolution analysis


Download ppt "Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005."

Similar presentations


Ads by Google