Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or.

Similar presentations


Presentation on theme: "WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or."— Presentation transcript:

1 WEKA-application Overview

2 Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules. The automated extraction of predictive information from large databases

3 Knowledge Discovery in Databases

4 1.Identifier le problème 2.Préparer les données 3.Explorer des modèles 4.Utiliser le modèle 5.Suivre le modèle Méthodologie

5 Data Mining Is...

6 Data Mining Is Not...

7 Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

8 @relation age {junior, adult, from {Europe, America, education {university, occupation {TRUE, tourX {yes, junior,Europe,university,FALSE,no junior,Europe,university,TRUE,no adult,Europe,university,FALSE,yes senior,America,university,FALSE,yes senior,Asia,college,FALSE,yes senior,Asia,college,TRUE,no adult,Asia,college,TRUE,yes junior,America,university,FALSE,no junior,Asia,college,FALSE,yes senior,America,college,FALSE,yes junior,America,college,TRUE,yes adult,America,university,TRUE,yes adult,Europe,college,FALSE,yes senior,America,university,TRUE,no

9 Construction automatique d ’arbre de décision recherche à chaque niveau de l'attribut le plus discriminant partition (données P) si tous les élément de P sont dans la même classe alors retour; autrement, pour chaque attribut A faire évaluer la qualité du partitionnement sur A utiliser le meilleur partitionement pour diviser P en P1 P2, …, Pn pour i=1 à n faire Partition (Pi)

10 Decision tree Age EducationOccupation junior senior adult YES universitycollege NOYES truefalse YESNO

11 age = junior | education = university: no (3.0) | education = college: yes (2.0) age = adult: yes (4.0) age = senior | occupation = TRUE: no (2.0) | occupation = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

12 Association Rules Another tool related to exploratory data analysis, knowledge discovery and machine learning Example: LHS RHS, where both LHS and RHS are sets of items; if every item in LHS is purchased in a transaction, then it is likely that the items in RHS will also be purchased.

13 Apriori Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Best rules found: 1. education=college occupation=FALSE 4 ==> tourX=yes 4 (1) 2. from=Asia 4 ==> education=college 4 (1) 3. age=adult 4 ==> tourX=yes 4 (1) 4. from=Asia tourX=yes 3 ==> education=college 3 (1) 5. age=senior occupation=FALSE 3 ==> tourX=yes 3 (1) 6. age=senior tourX=yes 3 ==> occupation=FALSE 3 (1) 7. age=junior education=university 3 ==> tourX=no 3 (1) 8. age=junior tourX=no 3 ==> education=university 3 (1) 9. from=Asia occupation=FALSE 2 ==> education=college tourX=yes 2 (1) 10. from=Asia education=college occupation=FALSE 2 ==> tourX=yes 2 (1)

14 Naïve Bayesian Classifiers Bayes rule (or law or theorem) P(Y¦X)=P(X¦Y)P(Y)/P(X) Conditional independence If A and B are probabilistically independent with P(A|C) and P(B|C), then P(A,B|C) = P(A|C)P(B|C)

15 Class yes: P(C) = Attribute age junior adult senior Attribute from Europe America Asia Attribute education university college Attribute occupation TRUE FALSE Class no: P(C) = Attribute age junior adult senior Attribute from Europe America Asia Attribute education university college Attribute occupation TRUE FALSE

16 Clustering Aim: discover regularities in data = find clusters of instances

17 Distance measures Similarity is usually defined by distance Two basic distance measures are

18 Number of clusters: 2 Cluster: 0 Prior probability: Attribute: age Discrete Estimator. Counts = (Total = 10.9) Attribute: from Discrete Estimator. Counts = (Total = 10.9) Attribute: education Discrete Estimator. Counts = (Total = 9.9) Attribute: occupation Discrete Estimator. Counts = (Total = 9.9) Attribute: tourX Discrete Estimator. Counts = (Total = 9.9) === Clustering stats for training data === Cluster Instances 0 7 ( 50%) 1 7 ( 50%) Log likelihood: Cluster: 1 Prior probability: Attribute: age Discrete Estimator. Counts = (Total = 9.1) Attribute: from Discrete Estimator. Counts = (Total = 9.1) Attribute: education Discrete Estimator. Counts = (Total = 8.1) Attribute: occupation Discrete Estimator. Counts = (Total = 8.1) Attribute: tourX Discrete Estimator. Counts = (Total = 8.1)

19 Example: WEKA

20

21

22

23

24

25

26

27 Reference Theory Theory : Berry,Linoff.Data Mining Techniques, Software Software :


Download ppt "WEKA-application Overview. Definitions of Data Mining Search for Valuable Information in Large Volumes of Data. Exploration & Analysis, by Automatic or."

Similar presentations


Ads by Google