Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005.

Slides:



Advertisements
Similar presentations
COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
COMP3740 CR32: Knowledge Management and Adaptive Systems Data Mining outputs: What knowledge can Data Mining learn? By Eric Atwell, School of Computing,
Rule Generation from Decision Tree Decision tree classifiers are popular method of classification due to it is easy understanding However, decision tree.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Instance Based Learning IB1 and IBK Find in text Early approach.
Classification Techniques: Decision Tree Learning
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Linear Regression Demo using PolyAnalyst Generating Linear Regression Formula Generating Regression Rules for Categorical classification.
Data Mining with Naïve Bayesian Methods
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
An overview of The IBM Intelligent Miner for Data By: Neeraja Rudrabhatla 11/04/1999.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
© 2002 by Prentice Hall 1 SI 654 Database Application Design Winter 2003 Dragomir R. Radev.
Three kinds of learning
Classification.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining Essentials Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Data Mining Essentials Introduction Data production rate has.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
ISOM Data Mining and Warehousing Arijit Sengupta.
Contributed by Yizhou Sun 2008 An Introduction to WEKA.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Appendix: The WEKA Data Mining Software
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Fundamentals, Design, and Implementation, 9/e Chapter 10 Managing Databases with Oracle 9i SII 654 Fall 2005.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 29 and 30– Decision Tree Learning; ID3;Entropy.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
COMP3410 DB32: Technologies for Knowledge Management 10 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
CS690L Data Mining: Classification
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Most of contents are provided by the website Data Mining Essentials TJTSD66: Advanced Topics in Social.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
W E K A Waikato Environment for Knowledge Aquisition.
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Information Retrieval Search Engine Technology (5&6) Prof. Dragomir R. Radev.
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Decision Tree. Classification Databases are rich with hidden information that can be used for making intelligent decisions. Classification is a form of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
An Introduction to WEKA
DATA MINING © Prentice Hall.
Prepared by: Mahmoud Rafeek Al-Farra
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Waikato Environment for Knowledge Analysis
DATAWAREHOUSING AND DATAMINING
A task of induction to find patterns
Presentation transcript:

Fundamentals, Design, and Implementation, 9/e KDD and Data Mining Instructor: Dragomir R. Radev Winter 2005

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/2 Copyright © 2004 The big problem  Billions of records  A small number of interesting patterns  “Data rich but information poor”

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/3 Copyright © 2004 Data mining  Knowledge discovery  Knowledge extraction  Data/pattern analysis

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/4 Copyright © 2004 Types of source data  Relational databases  Transactional databases  Web logs  Textual databases

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/5 Copyright © 2004 Association rules  65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings  Association rules: X  Y

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/6 Copyright © 2004 Association analysis  IF 20 < age < 30 AND 20K < INCOME < 30K  THEN –Buys (“CD player”)  SUPPORT = 2%, CONFIDENCE = 60%

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/7 Copyright © 2004 Basic concepts  Minimum support threshold  Minimum confidence threshold  Itemsets  Occurrence frequency of an itemset

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/8 Copyright © 2004 Association rule mining  Find all frequent itemsets  Generate strong association rules from the frequent itemsets

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/9 Copyright © 2004 Support and confidence  Support (X)  Confidence (X  Y) = Support(X+Y) / Support (X)

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/10 Copyright © 2004 Example TIDList of item IDs T100I1, I2, I5 T200I2, I4 T300I2, I3 T400I1, I2, I4 T500I1, I3 T600I2, I3 T700I1, I3 T800I1, I2, I3, I5 T900I1, I2, I3

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/11 Copyright © 2004 Example (cont’d)  Frequent itemset l = {I1, I2, I5}  I1 AND I2  I5 C = 2/4 = 50%  I1 AND I5  I2  I2 AND I5  I1  I1  I2 AND I5  I2  I1 AND I5  I3  I1 AND I2

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/12 Copyright © 2004 Example 2 TIDdateitems T10010/15/99{K, A, D, B} T20010/15/99{D, A, C, E, B} T30010/19/99{C, A, B, E} T40010/22/99{B, A, D} min_sup = 60%, min_conf = 80%

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/13 Copyright © 2004 Correlations  Corr (A,B) = P (A OR B) / P(A) P (B)  If Corr < 1: A discourages B (negative correlation)  (lift of the association rule A  B)

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/14 Copyright © 2004 Contingency table Game^GameSum Video4,0003,5007,500 ^Video2, ,500 Sum6,0004,00010,000

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/15 Copyright © 2004 Example  P({game}) = 0.60  P({video}) = 0.75  P({game,video}) = 0.40  P({game,video})/(P({game})x(P({video })) = 0.40/(0.60 x 0.75) = 0.89

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/16 Copyright © 2004 Example 2 hotdogs^hotdogsSum hamburgers ^hamburgers Sum

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/17 Copyright © 2004 Classification using decision trees  Expected information need  I (s 1, s 2, …, s m ) = - p i log (p i )  s = data samples  m = number of classes 

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/18 Copyright © 2004 RIDAgeIncomestudentcreditbuys? 1<= 30HighNoFairNo 2<= 30HighNoExcellentNo HighNoFairYes 4> 40MediumNoFairYes 5> 40LowYesFairYes 6> 40LowYesExcellentNo LowYesExcellentYes 8<= 30MediumNoFairNo 9<= 30LowYesFairYes 10> 40MediumYesFairYes 11<= 30MediumYesExcellentYes MediumNoExcellentYes HighYesFairYes 14> 40Mediumnoexcellentno

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/19 Copyright © 2004 Decision tree induction  I(s 1,s 2 ) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/20 Copyright © 2004 Entropy and information gain E(A) = I (s 1j,…,s mj )  S 1j + … + s mj s Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s 1,s 2,…,s m ) – E(A)

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/21 Copyright © 2004 Entropy  Age <= 30 s 11 = 2, s 21 = 3, I(s 11, s 21 ) =  Age in s 12 = 4, s 22 = 0, I (s 12,s 22 ) = 0  Age > 40 s 13 = 3, s 23 = 2, I (s 13,s 23 ) = 0.971

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/22 Copyright © 2004 Entropy (cont’d)  E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) =  Gain (age) = I (s1,s2) – E(age) =  Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/23 Copyright © 2004 Final decision tree excellent age studentcredit noyesnoyes no > 40 yes fair

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/24 Copyright © 2004 Other techniques  Bayesian classifiers  X: age <=30, income = medium, student = yes, credit = fair  P(yes) = 9/14 =  P(no) = 5/14 = 0.357

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/25 Copyright © 2004 Example  P (age < 30 | yes) = 2/9 = P (age < 30 | no) = 3/5 = P (income = medium | yes) = 4/9 = P (income = medium | no) = 2/5 = P (student = yes | yes) = 6/9 = P (student = yes | no) = 1/5 = P (credit = fair | yes) = 6/9 = P (credit = fair | no) = 2/5 = 0.400

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/26 Copyright © 2004 Example (cont’d)  P (X | yes) = x x x =  P (X | no) = x x x =  P (X | yes) P (yes) = x =  P (X | no) P (no) = x =  Answer: yes/no?

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/27 Copyright © 2004 Predictive models  Inputs (e.g., medical history, age)  Output (e.g., will patient experience any side effects)  Some models are better than others

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/28 Copyright © 2004 Principles of data mining  Training/test sets  Error analysis and overfitting  Cross-validation  Supervised vs. unsupervised methods error input size training test

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/29 Copyright © 2004 Representing data  Vector space salary credit pay off default

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/30 Copyright © 2004 Decision surfaces salary credit pay off default

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/31 Copyright © 2004 Decision trees salary credit pay off default

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/32 Copyright © 2004 Linear boundary salary credit pay off default

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/33 Copyright © 2004 kNN models  Assign each element to the closest cluster  Demos: – 2.cs.cmu.edu/~zhuxj/courseproject/knnd emo/KNN.html

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/34 Copyright © 2004 Other methods  Decision trees  Neural networks  Support vector machines  Demos – ost/

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/35 Copyright © 2004 arff outlook {sunny, overcast, temperature humidity windy {TRUE, play {yes, sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/36 Copyright © 2004 Weka Methods: rules.ZeroR bayes.NaiveBayes trees.j48.J48 lazy.IBk trees.DecisionStump

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/37 Copyright © 2004 kMeans clustering  tware.html  java weka.clusterers.SimpleKMeans -t data/weather.arff

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/38 Copyright © 2004 More useful pointers  

Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 9/39 Copyright © 2004 More types of data mining  Classification and prediction  Cluster analysis  Outlier analysis  Evolution analysis