Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or.

Similar presentations


Presentation on theme: "WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or."— Presentation transcript:

1 WEKA-application Overview

2 Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules. The automated extraction of predictive information from large databases

3 Knowledge Discovery in Databases

4 1.Identifier le problème 2.Préparer les données 3.Explorer des modèles 4.Utiliser le modèle 5.Suivre le modèle Méthodologie

5 Data Mining Is...

6 Data Mining Is Not...

7 Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

8 @relation escape.symbolic @attribute age {junior, adult, senior} @attribute from {Europe, America, Asia} @attribute education {university, college} @attribute occupation {TRUE, FALSE} @attribute tourX {yes, no} @data junior,Europe,university,FALSE,no junior,Europe,university,TRUE,no adult,Europe,university,FALSE,yes senior,America,university,FALSE,yes senior,Asia,college,FALSE,yes senior,Asia,college,TRUE,no adult,Asia,college,TRUE,yes junior,America,university,FALSE,no junior,Asia,college,FALSE,yes senior,America,college,FALSE,yes junior,America,college,TRUE,yes adult,America,university,TRUE,yes adult,Europe,college,FALSE,yes senior,America,university,TRUE,no

9 Construction automatique d ’arbre de décision recherche à chaque niveau de l'attribut le plus discriminant partition (données P) si tous les élément de P sont dans la même classe alors retour; autrement, pour chaque attribut A faire évaluer la qualité du partitionnement sur A utiliser le meilleur partitionement pour diviser P en P1 P2, …, Pn pour i=1 à n faire Partition (Pi)

10 Decision tree Age EducationOccupation junior senior adult YES universitycollege NOYES truefalse YESNO

11 age = junior | education = university: no (3.0) | education = college: yes (2.0) age = adult: yes (4.0) age = senior | occupation = TRUE: no (2.0) | occupation = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

12 Association Rules Another tool related to exploratory data analysis, knowledge discovery and machine learning Example: LHS RHS, where both LHS and RHS are sets of items; if every item in LHS is purchased in a transaction, then it is likely that the items in RHS will also be purchased.

13 Apriori Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Best rules found: 1. education=college occupation=FALSE 4 ==> tourX=yes 4 (1) 2. from=Asia 4 ==> education=college 4 (1) 3. age=adult 4 ==> tourX=yes 4 (1) 4. from=Asia tourX=yes 3 ==> education=college 3 (1) 5. age=senior occupation=FALSE 3 ==> tourX=yes 3 (1) 6. age=senior tourX=yes 3 ==> occupation=FALSE 3 (1) 7. age=junior education=university 3 ==> tourX=no 3 (1) 8. age=junior tourX=no 3 ==> education=university 3 (1) 9. from=Asia occupation=FALSE 2 ==> education=college tourX=yes 2 (1) 10. from=Asia education=college occupation=FALSE 2 ==> tourX=yes 2 (1)

14 Naïve Bayesian Classifiers Bayes rule (or law or theorem) P(Y¦X)=P(X¦Y)P(Y)/P(X) Conditional independence If A and B are probabilistically independent with P(A|C) and P(B|C), then P(A,B|C) = P(A|C)P(B|C)

15 Class yes: P(C) = 0.625 Attribute age junior adult senior 0.25 0.41666667 0.33333333 Attribute from Europe America Asia 0.25 0.41666667 0.33333333 Attribute education university college 0.36363636 0.63636364 Attribute occupation TRUE FALSE 0.36363636 0.63636364 Class no: P(C) = 0.375 Attribute age junior adult senior 0.5 0.125 0.375 Attribute from Europe America Asia 0.375 0.375 0.25 Attribute education university college 0.71428571 0.28571429 Attribute occupation TRUE FALSE 0.57142857 0.42857143

16 Clustering Aim: discover regularities in data = find clusters of instances

17 Distance measures Similarity is usually defined by distance Two basic distance measures are

18 Number of clusters: 2 Cluster: 0 Prior probability: 0.5643 Attribute: age Discrete Estimator. Counts = 2.9 3.63 4.37 (Total = 10.9) Attribute: from Discrete Estimator. Counts = 2.34 3.79 4.78 (Total = 10.9) Attribute: education Discrete Estimator. Counts = 2.56 7.34 (Total = 9.9) Attribute: occupation Discrete Estimator. Counts = 4.22 5.68 (Total = 9.9) Attribute: tourX Discrete Estimator. Counts = 7.74 2.16 (Total = 9.9) === Clustering stats for training data === Cluster Instances 0 7 ( 50%) 1 7 ( 50%) Log likelihood: -4.00994 Cluster: 1 Prior probability: 0.4357 Attribute: age Discrete Estimator. Counts = 4.1 2.37 2.63 (Total = 9.1) Attribute: from Discrete Estimator. Counts = 3.66 4.21 1.22 (Total = 9.1) Attribute: education Discrete Estimator. Counts = 6.44 1.66 (Total = 8.1) Attribute: occupation Discrete Estimator. Counts = 3.78 4.32 (Total = 8.1) Attribute: tourX Discrete Estimator. Counts = 3.26 4.84 (Total = 8.1)

19 Example: WEKA

20

21

22

23

24

25

26

27 Reference Theory Theory : Berry,Linoff.Data Mining Techniques, 1997 http://www3.shore.net/~kht/dmintro/dmintro.htm Software Software : http://www.cs.waikato.ac.nz/ml/weka/


Download ppt "WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or."

Similar presentations


Ads by Google