Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interestingness Measures. Quality in KDD Levels of Quality Quality of discovered knowledge : f(D,M,U) Data Quality (D) Data Quality (D) Noise, accuracy,

Similar presentations


Presentation on theme: "Interestingness Measures. Quality in KDD Levels of Quality Quality of discovered knowledge : f(D,M,U) Data Quality (D) Data Quality (D) Noise, accuracy,"— Presentation transcript:

1 Interestingness Measures

2 Quality in KDD

3

4 Levels of Quality Quality of discovered knowledge : f(D,M,U) Data Quality (D) Data Quality (D) Noise, accuracy, missing values, bad values, … [Berti-Equille 2004] Noise, accuracy, missing values, bad values, … [Berti-Equille 2004] Model Quality (M) Model Quality (M) Accuracy, generalization, relevance, … Accuracy, generalization, relevance, … User-based Quality (U) User-based Quality (U) Relevance for decision making Relevance for decision making

5 User-based Quality 2 categories: Objective (D, M) Objective (D, M) Computed from data only Subjective (U) Subjective (U) Hypothesis : goal, domain knowledge Hard to formalize (novelty)

6 Examples of Quality Criteria 7 criteria of interest (interestingness) [Hussein 2000]: Objective: Objective: Generality : (ex: Support) Generality : (ex: Support) Validity: (ex: Confidence) Validity: (ex: Confidence) Reliability: (ex: High generality and validity) Reliability: (ex: High generality and validity) Subjective: Subjective: Common Sense: reliable + known yet Common Sense: reliable + known yet Actionability : utility for decision Actionability : utility for decision Novelty: previously unknown Novelty: previously unknown Surprise (Unexpectedness): contradiction ? Surprise (Unexpectedness): contradiction ? Objective Measures

7 Quality and Association Rules

8 Association Rules Association rules [Agrawal et al. 1993]: Market-basket analysis Market-basket analysis Non supervised learning Non supervised learning Algorithms + 2 measures (support and confidence) Algorithms + 2 measures (support and confidence)Problems: Enormous amount of rules (rough rules) Enormous amount of rules (rough rules) Few semantic on support and confidence measures Few semantic on support and confidence measures  Need to help the user select the best rules AR Quality

9 Association Rules Solutions: Redundancy reduction Redundancy reduction Structuring (classes, close rules) Structuring (classes, close rules) Improve quality measures Improve quality measures Interactive decision aid (rule mining) Interactive decision aid (rule mining) AR Quality

10 Association Rules Input : data p Boolean attributes (V 0, V 1, … V p ) (columns) p Boolean attributes (V 0, V 1, … V p ) (columns) n transactions (rows) n transactions (rows) Output : Association Rules: Implicative tendencies : X  Y Implicative tendencies : X  Y X and Y (itemsets) ex: V 0 ^V 4 ^V 8  V 1 X and Y (itemsets) ex: V 0 ^V 4 ^V 8  V 1 Negative examples Negative examples 2 measures: 2 measures: Support: supp(X  Y) = freq(X U Y) Support: supp(X  Y) = freq(X U Y) Confidence: conf(X  Y) = P(Y|X) = freq(X U Y)/freq(X) Confidence: conf(X  Y) = P(Y|X) = freq(X U Y)/freq(X) Algorithm properties (monotony) Algorithm properties (monotony) Ex: Couches  beer (supp=20%, conf=90%) (NB: max nb of rules 3 p ) AR Quality

11 Limits of Support Support: supp(X  Y) = freq(X U Y) Generality of the rule Generality of the rule Minimum support threshold (ex: 10%) Minimum support threshold (ex: 10%) Reduce the complexity Reduce the complexity Lose nuggets (support pruning) Lose nuggets (support pruning) Nugget: Nugget: Specific rule (low support) Specific rule (low support) Valid rule (high confidence) Valid rule (high confidence)  High potential of novelty/surprise AR Quality

12 Limits of Confidence Confidence: conf(X  Y) = P(Y|X) = freq(X U Y)/freq(X) Validity/logical aspect of the rule (inclusion) Validity/logical aspect of the rule (inclusion) Minimal confidence threshold (ex: 90%) Minimal confidence threshold (ex: 90%) Reduces the amount of extracted rules Reduces the amount of extracted rules Interestingness /= validity Interestingness /= validity No detection of independence No detection of independence Independence: Independence: X and Y are independent: P(Y|X) = P(Y) X and Y are independent: P(Y|X) = P(Y) If P(Y) is high => nonsense rule with high support If P(Y) is high => nonsense rule with high support Ex: Couches  beer (supp=20%, conf=90%) if supp(beer)=90% AR Quality [Guillaume et al. 1998], [Lallich et al. 2004]

13 Limits of the Pair Support-Confidence In practice: High support threshold (10%) High support threshold (10%) High confidence threshold (90%) High confidence threshold (90%)  Valid and general rules  Common Sense but not novelty Efficient measures but insufficient to capture quality AR Quality

14 Subjective Measures

15 Criteria User-oriented measures (U) Quality : interestingness: Unexpectedness [Silberschatz 1996] Unexpectedness [Silberschatz 1996] Unknown or contradictory rule Unknown or contradictory rule Actionability (Usefulness) [Piatesky-shapiro 1994] Actionability (Usefulness) [Piatesky-shapiro 1994] Usefulness for decision making, gain Usefulness for decision making, gain Anticipation [Roddick 2001] Anticipation [Roddick 2001] Prediction on temporal dimension Prediction on temporal dimension AR Quality : Subjective Measures

16 Criteria Unexpectedness and actionability: Unexpected + useful = high interestingness Unexpected + useful = high interestingness Expected + non-useful = ? Expected + non-useful = ? Expected + useful = reinforcement Expected + useful = reinforcement Unexpected + non-useful = ? Unexpected + non-useful = ? AR Quality : Subjective Measures

17 Principle Algorithm principle: 1. Extraction of the decision maker Knowledge 2. Formalization of the knowledge (K) (expected and actionable) 3. KDD (K’) 4. Compare K and K’ 5. Select (subjective measures) the rules Δ(K,K’) of K’ which are: Differ the most from K (unexpectedness) Or are the most similar (actionability) AR Quality : Subjective Measures

18 Rule Templates User knowledge (K): syntactic constraints Patterns/forms of rules: A1, A2, …, Ak  Ak+1 Patterns/forms of rules: A1, A2, …, Ak  Ak+1 Ai: constraints on attribute Vi (interval of values) Ai: constraints on attribute Vi (interval of values) K: K1 + K2 K: K1 + K2 K1: interesting patterns (select) K1: interesting patterns (select) K2: not interesting patterns (reject) K2: not interesting patterns (reject) Goal: select the interesting rules inside K’ Boolean Criterion: Rules X  Y of K’ satisfying K1 patterns but not K2 ones + constraints (threshold) support, confidence, rule size (| X U Y |) Rules X  Y of K’ satisfying K1 patterns but not K2 ones + constraints (threshold) support, confidence, rule size (| X U Y |) AR Quality : Subjective Measures [Klemettinen et al. 1994]

19 Interestingness User knowledge (K): beliefs A set K of beliefs (bayes rules) A set K of beliefs (bayes rules) A belief (rule) a ∈ K weighted by p(α) A belief (rule) a ∈ K weighted by p(α) K: K1 + K2 K: K1 + K2 K1: hard beliefs (p(α) constant) K1: hard beliefs (p(α) constant) K2: soft beliefs (p(α) can vary) K2: soft beliefs (p(α) can vary) Goal: make beliefs K2 varying in function of the part of K’ which satisfy K1 + Interest criterion of R=X  Y of K’: Change weight α: Change weight α: AR Quality : Subjective Measures [Sibershatz & Tuzhilin 1995]

20 Logical Contradiction User Knowledge (K): A set of K rules A set of K rules Goal: select unexpected rules in K’ Unexpected criterion: Rule A  B of K’ and X  Y of K Rule A  B of K’ and X  Y of K A  B is unexpected if: A  B is unexpected if: B and Y are contradictory (p(B and Y)=0) B and Y are contradictory (p(B and Y)=0) (A and X) is frequent (p(A and Y) high) (A and X) is frequent (p(A and Y) high) (A and X)  B is true (hence (A and X)  not Y also (exception!)) (A and X)  B is true (hence (A and X)  not Y also (exception!)) AR Quality : Subjective Measures [Padmanabhan and Tuzhilin 1998]

21 Attribute Costs User Knowledge (K): costs Cost of each attribute/item Ai: Cost(Ai) Cost of each attribute/item Ai: Cost(Ai) Goal: select the costless rules in K’ Cost of a rule: Rule A1, A2, …, Ak  B Rule A1, A2, …, Ak  B Low mean cost: Low mean cost: AR Quality : Subjective Measures [Freitas 1999]

22 Other Subjective Measures Projected Savings (KEFIR system’s interestingness) [Matheus & Piatetsky-Shapiro 1994] Projected Savings (KEFIR system’s interestingness) [Matheus & Piatetsky-Shapiro 1994] Fuzzy Matching Interestingness Measure [Lie et al. 1996] Fuzzy Matching Interestingness Measure [Lie et al. 1996] General Impression [Liu et al. 1997] General Impression [Liu et al. 1997] Logical Contradiction [Padmanabhan & Tuzhilin’s 1997] Logical Contradiction [Padmanabhan & Tuzhilin’s 1997] Misclassification Costs [Frietas 1999] Misclassification Costs [Frietas 1999] Vague Feelings (Fuzzy General Impressions) [Liu et al. 2000] Vague Feelings (Fuzzy General Impressions) [Liu et al. 2000] Anticipation [Roddick and rice 2001] Anticipation [Roddick and rice 2001] Interestingness [Shekar & Natarajan’s 2001] Interestingness [Shekar & Natarajan’s 2001] AR Quality : Subjective Measures

23 Classification Interestingness Measure YearApplicationFoundationScope Subjective Aspects User’s Knowledge Representation 1 Matheus and Piatetsky- Shapiro’s Projected Savings 1994SummariesUtilitarian Single Rule Unexpectedness Pattern Deviation 2 Klemettinen et al. Rule Templates 1994 Association Rules Syntactic Single Rule Unexpectedness & Actionability Rule Templates 3 Silbershatz and Tuzhilin’s Interestingness 1995 Format Independent Probabilistic Rule Set Unexpectedness Hard & Soft Beliefs 4 Liu et al. Fuzzy Matching Interestingness Measure 1996 Classification rules Syntactic Distance Single Rule Unexpectedness Fuzzy Rules 5 Liu et al. General Impressions 1997 Classification Rules Syntactic Single Rule Unexpectedness GI, RPK GI, RPK 6 Padmanabhan and Tuzhilin Logical Contradiction 1997 Association Rules Logical, Statistic Single Rule Unexpectedness Beliefs X  Y 7 Freitas’ Attributes Costs 1999 Association Rules Utilitarian Single Rule Actionability Costs Values 8 Freitas’ Misclassification Costs 1999 Association rules Utilitarian Single rule Actionability Costs Values 9 Liu et al. Vague Feelings (Fuzzy General Impressions) 2000 Generalized Association Rules Syntactic Single Rule Unexpectedness GI, RPK, PK 10101010 Roddick and Rice’s Anticipation 2001 Format Independent Probabilistic Single Rule Temporal Dimension Probability Graph 11111111 Shekar and Natarajan’s Interestingness 2002 Association Rules Distance Single Rule Unexpectedness Fuzzy-graph based taxonomy

24 Conclusion Algorithm + Measures to compare K and K’ Algorithm + Measures to compare K and K’ Focus on interesting rules Focus on interesting rules Knowledge is Domain specific Knowledge is Domain specific Acquisition of K? Acquisition of K? Hard task to represent knowledge and goals of the decision maker Hard task to represent knowledge and goals of the decision maker Many improvements to make Many improvements to make AR Quality : Subjective Measures

25 Objective Measures Principles and Classification

26 Principle Statistics on data D (transactions) for each rule R=X  Y Interestingness measure = i(R,D,H) Degree of satisfaction of the hypothesis H in D independently of U AR Quality : Objective Measures

27 Contingency Rule X  with X and Y disjoined itemsets  Inclusion of E(X) in E(Y) 5 observable parameters in E: n=|E|amount of transactions n=|E|amount of transactions n x =|E(X)|cardinal of the premise (left hand side) n x =|E(X)|cardinal of the premise (left hand side) n y =|E(Y)|cardinal of the conclusion (right hand side) n y =|E(Y)|cardinal of the conclusion (right hand side) n xy =|E(X and Y)|number of positive examples n xy =|E(X and Y)|number of positive examples n x¬y =|E(X and ¬Y)|number of negative examples n x¬y =|E(X and ¬Y)|number of negative examples AR Quality : Objective Measures

28 Independence p(X) estimated by (frequency) p(X) estimated by (frequency) Hypothesis of Independence of X and Y: Hypothesis of Independence of X and Y: Inclusion /= dependence AR Quality : Objective Measures

29 Equiprobability (Equilibrium) Rule X  Y Same amount of negative examples (e-) and positive examples (e+): Same amount of negative examples (e-) and positive examples (e+): hence when: 2 situations: (or P(Y|X)>0.5): e+ higher: rule X  Y (or P(Y|X)>0.5): e+ higher: rule X  Y (or P(Y|X)<0.5): e- higher: rule X  ¬Y (or P(Y|X)<0.5): e- higher: rule X  ¬Y Contra-positive ¬X  ¬Y Contra-positive ¬X  ¬Y AR Quality : Objective Measures

30 Interestingness Measure - Definition i(X  Y) = f(n, n x, n y, n xy ) General principles: Semantic and readability for the user Semantic and readability for the user Increasing value with the quality Increasing value with the quality Sensibility to equiprobability (inclusion) Sensibility to equiprobability (inclusion) Statistic Likelihood (confidence in the measure itself) Statistic Likelihood (confidence in the measure itself) Noise resistance, time stability Noise resistance, time stability Surprisingness, nuggets ? Surprisingness, nuggets ? AR Quality : Objective Measures

31 Properties in the Literature Properties of i(X  Y) = f(n, n x, n y, n xy ) [Piatetsky-Shapiro 1991] (strong rules): [Piatetsky-Shapiro 1991] (strong rules): (P1) =0 if X and Y are independent (P1) =0 if X and Y are independent (P2) increases with examples n xy (P2) increases with examples n xy (P3) decreases with premise n x (or conclusion n y )(?) (P3) decreases with premise n x (or conclusion n y )(?) [Major & Mangano 1993]: [Major & Mangano 1993]: (P4) increases with n xy when confidence is constant (n xy /n x ) (P4) increases with n xy when confidence is constant (n xy /n x ) [Freitas 1999]: [Freitas 1999]: (P5) asymmetry (i(X  Y)/=i(Y  X)) (P5) asymmetry (i(X  Y)/=i(Y  X)) Small disjunctions (nuggets) Small disjunctions (nuggets) [Tan et al. 2002], [Hilderman & Hamilton 2001] and [Gras et al. 2004] AR Quality : Objective Measures

32 Selected Properties Inclusion and equiprobability Inclusion and equiprobability 0, interval of security 0, interval of security Independence Independence 0, interval of security 0, interval of security Bounded maximum value Bounded maximum value Comparability, global threshold, inclusion Comparability, global threshold, inclusion Non linearity Non linearity Noise Resistance, interval of security for independence and equiprobability Noise Resistance, interval of security for independence and equiprobability Sensibility Sensibility N (nuggets), dilation (likelihood) N (nuggets), dilation (likelihood) Frequency p(X)  cardinal n x Frequency p(X)  cardinal n x Reinforcement by similar rules (contra-positive, negative rule,…) Reinforcement by similar rules (contra-positive, negative rule,…) [Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004] [Smyth & Goodman 1991][Kodratoff 2001][Gras et al 2001][Gras et al. 2004] AR Quality : Objective Measures

33 What Could Be a Good Measure? Negative-examples n x¬y Negative-examples n x¬y I max + independence + equiprobability  constraints upon other dimensions I max + independence + equiprobability  constraints upon other dimensions AR Quality : Objective Measures

34 Consequences On Other Dimensions Conclusion n y Conclusion n y Decrease with n y (n y  n: Ind ↓) Decrease with n y (n y  n: Ind ↓) Size of data n Size of data n Increase with dilation (Ind ↑) Increase with dilation (Ind ↑) Increase with n (Ind ↑) Increase with n (Ind ↑) AR Quality : Objective Measures

35 List

36 Classification Classification between three criteria: Object of the index Object of the index Concept measured by the index Concept measured by the index Range of the index Range of the index Entity concerned with measurement Entity concerned with measurement Nature of the index Nature of the index Statistical or descriptive character of the index Statistical or descriptive character of the index AR Quality : Objective Measures

37 Classification The Object: Certain indices take a fixed value with independence. P(a ∩ b) = P(a) x P(b) They evaluate a variation with independence Certain indices take a fixed value with equilibrium. P(a ∩ b) = P(a)/2 They evaluate a variation with equilibrium Others do not take a fixed value with independence or with equilibrium Statistical indices AR Quality : Objective Measures

38 Classification The Range: Certain indices evaluate to more than a simple rule: They relate simultaneously to a rule and its contra-positive: ¬¬ I(a  b) = I(¬b  ¬ a)   Indices of quasi-Involvement They simultaneously relate a rule and its reciprocal: I(a  b) = I(b  a)   Indices of quasi-conjunction They relate simultaneously to all three: ¬¬ I(a  b) = I(b  a) = I(¬ b  ¬ a)   Indices of quasi-equivalence AR Quality : Objective Measures

39 Classification The Nature: If variation :statistical index If not :descriptive index AR Quality : Objective Measures

40 Classification

41 List Of Quality Measures Monodimensional e+, e- Support [Agrawal et al. 1996] Ralambrodrainy [Ralambrodrainy, 1991] Bidimensional - Inclusion Descriptive-Confirm [Yves Kodratoff, 1999] Sebag et Schoenauer [Sebag, Schoenauer, 1991] Examples neg examples ratio (*) Bidimensional – Inclusion – Conditional Probability Confidence [Agrawal et al. 1996] Wang index [Wang et al., 1988] Laplace (*) Bidimensional – Analogous Rules Descriptive Confirmed-Confidence [Yves Kodratoff, 1999] (*) AR Quality : Objective Measures

42 List Of Quality Measures Tridimensional – Analogous Rules Causal Support [Kodratoff, 1999] Causal Confidence [Kodratoff, 1999] (*) Causal Confirmed-Confidence [Kodratoff, 1999] Least contradiction [Aze & Kodratoff 2004] (*) Tridimensional – Linear - Independent Pavillon index [Pavillon, 1991] Rule Interest [Piatetsky-Shapiro, 1991] (*) Pearl index [Pearl, 1988], [Acid et al., 1991] [Gammerman, Luo, 1991] Correlation [Pearson 1996] (*) Loevinger index [Loevinger, 1947] (*) Certainty factor [Tan & Kumar 2000] Rate of connection[Bernard et Charron 1996] Interest factor [Brin et al., 1997] Top spin(*) Cosine [Tan & Kumar 2000] (*) Kappa [Tan & Kumar 2000] AR Quality : Objective Measures

43 List Of Quality Measures Tridimensional – Nonlinear – Independent Chi squared distance Logarithmic lift [Church & Hanks, 1990] (*) Predictive association [Tan & Kumar 2000] (Goodman & Kruskal) Conviction [Brin et al., 1997b] Odd’s ratio [Tan & Kumar 2000] Yule’Q [Tan & Kumar 2000] Yule’s Y [Tan & Kumar 2000] Jaccard [Tan & Kumar 2000] Klosgen [Tan & Kumar 2000] Interestingness [Gray & Orlowska, 1998] Mutual information ratio (Uncertainty) [Tan et al., 2002] J-measure [Smyth & Goodman 1991] [Goodman & Kruskal 1959] (*) Gini [Tan et al., 2002] General measure of rule interestingness [Jaroszewicz & Simovici, 2001] (*) AR Quality : Objective Measures

44 Quadridimensional – Linear – independent Lerman index of similarity[Lerman, 1981] Index of Involvement[Gras, 1996] Quadridimensional – likeliness (conditional probability?) of dependence Probability of error of Chi2 (*) Intensity of Involvement [Gras, 1996] (*) Quadridimensional – Inclusion – dependent – analogous rules Entropic intensity of Involvement [Gras, 1996] (*) TIC [Blanchard et al., 2004] (*) Others Surprisingness (*) [Freitas, 1998] + rules of exception [Duval et al. 2004] + rule distance, similarity [Dong & Li 1998] AR Quality : Objective Measures List Of Quality Measures

45 Objective Measures Simulations and Properties AR Quality : Objective Measures

46 Monodimensional Measures e+ e- Definition : Semantics : degree of general information Sensitivity: 1 parameter Measuring frequency Linear Insensitive to independence Disequilibrium? Symmetrical Quality of Rules : Objective Measures Support [Agrawal et al. 1996]

47 Monodimensional Measures e+ e- Definition : Semantics: scarcity of the e- Sensitivity: 1 parameter Measuring frequency Linear Insensitive to independence Disequilibrium? Increasing Quality of Rules : Objective Measures Ralambrodrainy Measure [Ralambrodrainy 1991]

48 Bidimensional Measures - Inclusion Definition: Semantics: variation e+ e- (improved support) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Quality of Rules : Objective Measures Descriptive-Confirm [Kodratoff 1999]

49 Bidimensional Measures - Inclusion Definition: Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-Linear (very selective) Insensitive to independence 1 with disequilibrium Max value not limited Quality of Rules : Objective Measures Sebag and Schoenauer [Sebag & Schoenauer, 1991]

50 Bidimensional Measures - Inclusion Definition: Semantics: ratio e+/e- Sensitivity: 2 parameters Measuring frequency Non-linear (tolerance) Insensitive to independence 0 with disequilibrium Max value limited Quality of Rules : Objective Measures Example and Counterexample Rate (*)

51 Bidimensional Measures - Inclusion Definition: Semantics: inclusion, validity Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0.5 with disequilibrium Max value limited Variations: [Ganascia, 1991] : Charade Or Descriptive Confirmed- Confidence [Kodratoff, 1999] Quality of Rules : Objective Measures Confidence [Agrawal et al. 1996]

52 Bidimensional Measures - Inclusion Definition: Semantics: improved support (threshold of confidence integrated) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Quality of Rules : Objective Measures Wang [Wang et al 1988]

53 Bidimensional Measures - Inclusion Definition: Semantics: estimates confidence (decreases with lowering support) Sensitivity: 2 parameters Does not measure frequency when numbers are small Linear Insensitive to independence Max value limited Quality of Rules : Objective Measures Laplace [Clark & Robin 1991], [Tan & Kumar 2000]

54 Bidimensional Measures–Similar Rules Definition: ¬ Semantics: confidence confirmed by its negative (X  ¬Y) Sensitivity: 2 parameters Measuring frequency Linear Insensitive to independence 0 with disequilibrium Max value limited Reinforcement by the negative rule Quality of Rules : Objective Measures Descriptive Confirmed-Confidence [Kondratoff 1999]

55 Definition: Semantics: support improved by the use of the contra-positive Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Reinforcement by the contra-positive rule Quality of Rules : Objective Measures Casual Support [Kodratoff 1999] Bidimensional Measures–Similar Rules

56 Definition: Semantics: confidence reinforced by the contra- positive Sensitivity: 3 parameters Measuring frequency Linear Insensitive to independence Disequilibrium? Max value limited Reinforcement by the contra-positive rule Evolution: Causal-Confirmed Confidence: contra-positive + negative Quality of Rules : Objective Measures Casual Confidence [Kodratoff 1999] Bidimensional Measures–Similar Rules

57 Definition: Semantics: little-contradiction Sensitivity: 3 parameters Measuring frequency Linear 0 with Disequilibrium Supports inclusive measurement Reinforcement by the negative rule Coupled with an algorithm Quality of Rules : Objective Measures Least Contradiction [Aze & Kodratoff 2004] Bidimensional Measures–Similar Rules

58 Definition: Semantics: variation with independence, correction of the size of the conclusion Sensitivity: 3 parameters Measuring frequency Linear 0 when independent Disequilibrium? Called Added Value in [Tan et al. 2002] Quality of Rules : Objective Measures Centered Confidence (Pavillon Index) [Pavillon 1991] Tridimensional Measures-Independence

59 Definition: Definition: Semantics: gap to independence (strong rules) Semantics: gap to independence (strong rules) Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear Linear 0 when independent 0 when independent ? Disequilibrium? Alternative symmetric Measure: Alternative symmetric Measure: Pear [Pearl, 1988], [Acid et al., 1991] [GAMMERMAN, Luo, 1991] Pear [Pearl, 1988], [Acid et al., 1991] [GAMMERMAN, Luo, 1991] Quality of Rules : Objective Measures Rule Interest [Piatetsky-Shapiro 1991]

60 Tridimensional Measures-Independence Definition: Definition: Semantics: Semantics: Correlation Correlation Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear Linear 0 when independent 0 when independent Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Coefficient of Correlation [Pearson 1996]

61 Tridimensional Measures-Independence Definition: Definition: Semantics: dependence implicative Semantics: dependence implicative Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear Linear 0 when independent 0 when independent Maximum value bounded (inclusion) Maximum value bounded (inclusion) Disequilibrium? Disequilibrium? Equivalent measure: Certainty factor [Tan & Kumar 2000]: Equivalent measure: Certainty factor [Tan & Kumar 2000]: Quality of Rules : Objective Measures Loevinger (*) [Loevinger 1947] Certainty Factor [Tan & Kumar 2000]

62 Tridimensional Measures-Independence Definition: Definition: Semantics: dependence Semantics: dependence Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear Linear 0 when independent 0 when independent Inclusion? Inclusion? Disequilibrium? Disequilibrium? Variations: Variations: Measurement of interest (interest factor) [Brin et al., 1997] Measurement of interest (interest factor) [Brin et al., 1997] Equivalent to Lift Equivalent to Lift Alternative: Logarithmic Measure of lift [Church & Hanks, 1990] Alternative: Logarithmic Measure of lift [Church & Hanks, 1990] Quality of Rules : Objective Measures Varying Rates Liaison [Bernard & Charron 1996]

63 Tridimensional Measures-Independence Definitions: Definitions: Measure of Interest (Interest Factor): Measure of Interest (Interest Factor): Lift Lift Logarithmic Measure of Lift: Logarithmic Measure of Lift: Cosine: Cosine: Quality of Rules : Objective Measures Measure of Interest (Interest Factor) [Brin et al. 1997] Lift (*) Logarithmic Measure of Lift (*) [Church & Hanks 1990] Cosine (*) [Tan & Kumar 2000] Semantics: dependence Semantics: dependence Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring Frequency Measuring Frequency Linear Linear Inclusion? Inclusion? Disequilibrium? Disequilibrium?

64 Tridimensional Measures-Independence Definition: Definition: Semantics: Semantics: Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear Linear 0 when Independent 0 when Independent Disequilibrium? Disequilibrium? Maximum value Maximum value Strengthened by contra-positive Strengthened by contra-positive Quality of Rules : Objective Measures Kappa [Tan & Kumar 2000]

65 Tridimensional Measures-Independence Definition: with Definition: with Semantics: X good prediction for Y Semantics: X good prediction for Y Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Linear piecewise Linear piecewise 0 when independent? 0 when independent? Maximum value? Maximum value? Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Predictive Association (*) [Tan & Kumar 2000] (Goodman & Kruskal)

66 Tridimensional Measures-Independence Definition: Definition: Semantics: conviction Semantics: conviction Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non Linear (very selective) Non Linear (very selective) 1 when independent 1 when independent Maximum value not merely Maximum value not merely Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Conviction [Brin et al. 1997b] (shape similar to Sebag and Schoenauer [Sebag & Schoenauer 1991] except for independence)

67 Tridimensional Measures-Independence Definitions: Definitions: Odds Ratio: Yule’s Q: Yule’s Y: Semantics: correlation Semantics: correlation Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non Linear (resistance to noise?) Non Linear (resistance to noise?) 1 or 0 when independent 1 or 0 when independent Bounded max value (1 or not) Bounded max value (1 or not) Disequilibrium? Disequilibrium? Strengthened by the similar rules Strengthened by the similar rules Quality of Rules : Objective Measures Odds Ratio, Yule’s Q, Yule’s Y [Tan & Kumar 2000] (Close Conviction)

68 Tridimensional Measures-Independence Definitions: Definitions: Jaccard: Jaccard: Klosgen: Klosgen: Semantics: correlation Semantics: correlation Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non Linear Non Linear 0 when independent 0 when independent Bounded max value (0 or 1) Bounded max value (0 or 1) Disequilibrium? Disequilibrium? Strengthened by similar rules Strengthened by similar rules Quality of Rules : Objective Measures Jaccard, Klosgen [Tan & Kumar 2000]

69 Tridimensional Measures-Independence Definition: Definition: Semantics: interest? Semantics: interest? Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non Linear Non Linear 0 when independent 0 when independent Inclusion? Inclusion? Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Interestingness Weighting Dependency [Gray & Orlowska 1998]

70 Tridimensional Measures-Independence Definition: Definition: Semantics: information gain provided by X for Y Semantics: information gain provided by X for Y Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non linear, entropic Non linear, entropic 0 when independent 0 when independent Inclusion? Disequilibrium? Inclusion? Disequilibrium? Strongly Symmetric Strongly Symmetric Low value Low value Quality of Rules : Objective Measures Mutual Information (Uncertainty) [Tan et al 2002]

71 Tridimensional Measures-Independence Definition: Definition: Semantics: cross entropy (by mutual information) Semantics: cross entropy (by mutual information) Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non linear, entropic Non linear, entropic O when Independent + concave O when Independent + concave Inclusion? Disequilibrium? Inclusion? Disequilibrium? Symmetric Symmetric Low value Low value Strengthened by the negative (X  ¬Y) Strengthened by the negative (X  ¬Y) Quality of Rules : Objective Measures J-Measure (*) [Smyth & Goodman 1991][Goodman & Kruskal 1959]

72 Tridimensional Measures-Independence Definition: Definition: Semantics: quadratic entropy Semantics: quadratic entropy Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non linear, entropic Non linear, entropic 0 when Independent + concave 0 when Independent + concave Inclusion? Disequilibrium? Inclusion? Disequilibrium? Very Symmetric Very Symmetric Low value Low value Quality of Rules : Objective Measures Gini Index

73 Tridimensional Measures-Independence Definition: Definition: (continuum of measure between Gini and Chi2) (continuum of measure between Gini and Chi2) Semantics:? Semantics:? Sensitivity: 3 parameters Sensitivity: 3 parameters Measuring frequency Measuring frequency Non-Linear (Gini-> distance from the Chi-2) Non-Linear (Gini-> distance from the Chi-2) 0 when independent 0 when independent Inclusion? Disequilibrium? Inclusion? Disequilibrium? Not Symmetric -> Symmetric Not Symmetric -> Symmetric Δ α : Family measures differences conditioned by a factor real α (Gini -> Distance from chi2) Δ α : Family measures differences conditioned by a factor real α (Gini -> Distance from chi2) Δ X (resp. Δ Y ): distribution of vector X and (resp. Y) Δ X (resp. Δ Y ): distribution of vector X and (resp. Y) Δ xy : vector distribution of X and attached Y Δ xy : vector distribution of X and attached Y Δ X x Δ Y : vector distribution of attached X and Y under the hypothesis of independence Δ X x Δ Y : vector distribution of attached X and Y under the hypothesis of independence θ: vector apriori distribution of Y θ: vector apriori distribution of Y Quality of Rules : Objective Measures General Measure of Rule Interestingness (*) [Jaroszewicz & Simovici 2001]

74 Quadridimensional Measures-Independence Definition: Definition: Semantics: number of examples normalized centered Semantics: number of examples normalized centered Sensitivity: 4 parameters Sensitivity: 4 parameters Measurement statistics (numbers) Measurement statistics (numbers) Linear Linear 0 when independent 0 when independent Inclusion? Inclusion? Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Lerman Similarity [Lerman 1981]

75 Quadridimensional Measures-Independence Definition: Definition: Semantics: number of normalized counter-examples Semantics: number of normalized counter-examples Sensitivity: 4 parameters Sensitivity: 4 parameters Measurement statistics (numbers) Measurement statistics (numbers) Linear Linear 0 when independent 0 when independent Inclusion? Inclusion? Disequilibrium? Disequilibrium? Quality of Rules : Objective Measures Variation: Implication Index [Gras 1996]

76 Quadridimensional Measures-Independence Definition: Definition: (probabilistic modeling, law chi2) (probabilistic modeling, law chi2) Semantics: probability of a dependence between X and Y Semantics: probability of a dependence between X and Y Sensitivity: 4 parameters Sensitivity: 4 parameters Measuring probability, not frequency Measuring probability, not frequency Non Linear + e- tolerance Non Linear + e- tolerance 0 when independent + real 0 when independent + real Maximum value bounded Maximum value bounded inclusion? Disequilibrium? inclusion? Disequilibrium? Strongly Symmetric => Coupling measure of interest [Brin et al., 1997] Strongly Symmetric => Coupling measure of interest [Brin et al., 1997] Alternative: Report likelihood [Ritschard & al., 1998] Alternative: Report likelihood [Ritschard & al., 1998] Quality of Rules : Objective Measures Lerman Similarity [Lerman 1981]

77 Quadridimensional Measures-Independence Definition: Definition: (probabilistic modeling, law of counter- examples) Semantics: likely the scarcity of counter- examples (Statistical astonishment) Semantics: likely the scarcity of counter- examples (Statistical astonishment) Sensitivity: 4 parameters Sensitivity: 4 parameters Measuring probability, not frequency Measuring probability, not frequency Non Linear + e-tolerance Non Linear + e-tolerance 0.5 when independent + likelihood 0.5 when independent + likelihood Maximum value bounded Maximum value bounded inclusion? Disequilibrium? inclusion? Disequilibrium? Logic rules:Can be 0 Inspired by Link Likelihood [Lerman et al 1981] Quality of Rules : Objective Measures Intensity of Implication (*)[Gras 1996] (Analysis of Statistical Involvement)

78 Modeling: Binary variables => numerical, ordinal, intervals, fuzzy [Bernadet 2000, Guillaume 2002,...] Bulky data: intensity of entropic Involvement [Gras et al. 2001] Sequences: rules of prediction [Blanchard et al. 2002] Intensity Of Involvement and Analysis Of Implicative Statistics Structuring: Hierarchy implicative (cohesion) [Gras et al. 2001] Typical, reduction of variables (inertia of Involvement) [Gras et al. 2002] AR Quality : Objective Measures Extensions Applications CHIC (http://www.ardm.asso.fr/CHIC.html)http://www.ardm.asso.fr/CHIC.html SIPINA (University of Lyon 2) FELIX (PerformanSE SA)

79 Quadridimensional Rules Definition: Definition: Inclusion Rate: Inclusion Rate: Information: (increases with ) Information: (increases with ) Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X) Asymmetric entropy: the entropy H’(Y|X) decreases with p(Y|X) Semantics: Surprising Statistic + inclusion (removal of disequilibrium) Semantics: Surprising Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parameters Sensitivity: 4 parameters Measuring frequency non-probabilistic Measuring frequency non-probabilistic Non linear + tolerance e- (adjustment of the selectivity with α (ex: α=2) Non linear + tolerance e- (adjustment of the selectivity with α (ex: α=2) Max 0.5 when independent + real Max 0.5 when independent + real 0 when in disequilibrium 0 when in disequilibrium Strengthened by the contra-positive Strengthened by the contra-positive Maximum value bounded (1) Maximum value bounded (1) AR Quality : Objective Measures Entropy (*) [Gras et al 2001] (Analysis of Statistical Involvement)

80 Tridimensional Measures-Independence Definition: Definition: Information Rate: Information Rate: Asymmetric Entropy: The entropy Ê(X) with p(X) Asymmetric Entropy: The entropy Ê(X) with p(X) Semantics: Surprise Statistic + inclusion (removal of disequilibrium) Semantics: Surprise Statistic + inclusion (removal of disequilibrium) Sensitivity: 4 parameters Sensitivity: 4 parameters Measuring frequency Measuring frequency Non-linear, entropic Non-linear, entropic 0 to independence 0 to independence 0 to Imbalance 0 to Imbalance Strengthened by the contra-positive Strengthened by the contra-positive Maximum value bounded (1) Maximum value bounded (1) Quality of Rules : Objective Measures TIC (*) [Blanchard et al.2004] (Analysis of Statistical Involvement)

81 Tridimensional Measures-Independence Definition:Rule: X 1 X 2 X 3 … X p-1 X p  Y Definition:Rule: X 1 X 2 X 3 … X p-1 X p  Y Information gain provided by the Information gain provided by the attribute Xi: Conditional entropy: Conditional entropy: Semantics: surprise gain informational resources provided by the premise Semantics: surprise gain informational resources provided by the premise Measuring frequency Measuring frequency Non-Linear: entropic Non-Linear: entropic Can be used to assess individual contribution of each attribute... Can be used to assess individual contribution of each attribute... Quality of Rules : Objective Measures Surprisingness (*) [Freitas 1998]

82 Comparative Theory

83 Intensity of Involvement Comparison by Simulation Confidence, J-Measure, Coverage Rate

84 Intensity of Involvement Comparison by Simulation Confidence, J-Measure, Coverage Rate

85 Intensity of Involvement Comparison by Simulation Confidence, J-Measure, Coverage Rate

86 Intensity of Involvement Comparison by Simulation Confidence, PS, Intensity of Implication

87 TIC Comparison by Simulation Confidence, TIM, J-Measure, Gini Index

88 TIC Comparison by Simulation Confidence, TIM, J-Measure, Gini Index

89 TIC Comparison by Simulation Confidence, J-Measure, Coverage Rate

90 Comparison by Simulation

91

92

93

94 Intensity of Involvement Comparison by Simulation Confidence, J-Measure, Coverage Rate

95 Synthesis & Comparative Studies [Bayardo and Agrawal, 1999]: influence of support [Bayardo and Agrawal, 1999]: influence of support 9 measures, monotonous functions / antitones support, optimization 9 measures, monotonous functions / antitones support, optimization [Hilderman and Hamilton, 2001]: Interest summaries [Hilderman and Hamilton, 2001]: Interest summaries 16 measures, 5 principles of independence, correlation study 16 measures, 5 principles of independence, correlation study [Azé and Kodratoff, 2001]: resistance to noise in the data [Azé and Kodratoff, 2001]: resistance to noise in the data [Tan & Kumar 2000]: interest association rules [Tan & Kumar 2000]: interest association rules 9 symmetric measures, study of the relationship observed between 2 measurements, influence of support 9 symmetric measures, study of the relationship observed between 2 measurements, influence of support [Tan et al., 2002]: association rules interest [Tan et al., 2002]: association rules interest 21 measures symmetrical 8 principles study of correlation, influence the media 21 measures symmetrical 8 principles study of correlation, influence the media [Gras et al. 04]: interest association rules [Gras et al. 04]: interest association rules 10 criteria 10 criteria [Lenca et al., 2004]: association rules interest [Lenca et al., 2004]: association rules interest 20 measures, 8 criteria for decision support multi-criteria 20 measures, 8 criteria for decision support multi-criteria [Lallich & Teytaud 2004]: association rules interest [Lallich & Teytaud 2004]: association rules interest 15 measures, 10 principles, learning and using the VC-dimension 15 measures, 10 principles, learning and using the VC-dimension Quality of Rules : Subjective Measures

96 Study of Comparative Experiments

97 Project AR-QAT : Quality Measures Analysis Tool

98 Experimental Results 30 Objective Measures 30 Objective Measures Input Data Sets

99 Experimental Results – Positive Correlations

100

101 Experimental Results Stable Strong Positive Correlations Average Correlation

102 ARVAL A workshop for calculating quality measures for the scientific community http://www.univ-nantes.fr/arval

103 ARVAL

104 ARVAL

105 ARVAL

106 ARVAL

107 Conclusion

108 Conclusion and Outlooks Quality = multidimensional concept: Quality = multidimensional concept: Subjective (maker) Subjective (maker) Interest = changes with the knowledge of the decision-maker Interest = changes with the knowledge of the decision-maker PB1: extract knowledge / objective decision-maker PB1: extract knowledge / objective decision-maker Objective (data and rules) Objective (data and rules) Interest = on the Hypothetical Data: Inclusion, Independence, Imbalance, nuggets, robustness... Interest = on the Hypothetical Data: Inclusion, Independence, Imbalance, nuggets, robustness... Antagonism Independence / Disequilibrium Antagonism Independence / Disequilibrium Many indices (~ 50!) => Many indices (~ 50!) => PB2: restricted to support / confidence => workshop for calculating indices PB2: restricted to support / confidence => workshop for calculating indices PB3: comparative study (properties, simulations) and experimental (behavior data): a platform? PB3: comparative study (properties, simulations) and experimental (behavior data): a platform? PB4: combining the clues, choose the right index => Decision Support PB4: combining the clues, choose the right index => Decision Support PB5: new clues? PB5: new clues? PB6: What is a good index? (ingredients of quality) PB6: What is a good index? (ingredients of quality)

109 Perspective (PB1) Search for knowledge Search for knowledge Anthropocentric approach Anthropocentric approach Adaptive Extraction Adaptive Extraction FELIX [Lehn et. Al 1999] FELIX [Lehn et. Al 1999] AR-VIS [Blanchard et al. 2003] AR-VIS [Blanchard et al. 2003] Ax: Quality Assessment of Knowledge Combining Subjective and Objective Aspects of Quality

110 Perspective (PB 2 3 4 5) Calculation: ARVAL? (www.polytech.univ-nantes.fr/arval) Calculation: ARVAL? (www.polytech.univ-nantes.fr/arval)www.polytech.univ-nantes.fr/arval Analysis: AR-QAT? [Popovici 2003] Analysis: AR-QAT? [Popovici 2003] Decision Support: HERBS? [Lenca et al. 2003] (www- iasc.enst-bretagne.fr/ecd-ind/HERBS) Decision Support: HERBS? [Lenca et al. 2003] (www- iasc.enst-bretagne.fr/ecd-ind/HERBS) Ax: Quality Assessment of Knowledge Platform for experimentation, support and a decision

111 Bibliography [Agrawal et al., 1993] R. Agrawal, T. Imielinsky et A. Swami. Mining associations rules between sets of items in large databases. Proc. of ACM SIGMOD'93, 1993, p. 207-216 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Evaluation de la résistance au bruit de quelques mesures d'extraction de règles d'association. Extraction des connaissances et apprentissage 1(4), 2001, p. 143-154 [Azé & Kodratoff, 2001] J. Azé et Y. Kodratoff. Extraction de « pépites » de connaissances dans les données : une nouvelle approche et une étude de sensibilité au bruit. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Bayardo & Agrawal, 1999] R.J. Bayardo et R. Agrawal. Mining the most interesting rules. Proc. of the 5th Int. Conf. on Knowledge Discovery and Data Mining, 1999, p.145-154. [Bernadet 2000] M. Bernardet. Basis of a fuzzy knowledge discovery system. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 24-33. Springer, 2000. [Bernard et Charron 1996] J.-M. Bernard et C. Charron. L’analyse implicative bayésienne, une méthode pour l’étude des dépendances orientées. I. Données binaires, Revue Mathématique Informatique et Sciences Humaines (MISH), vol. 134, 1996, p. 5-38. [Berti-Equille 2004] L. Berti-équille. Etat de l'art sur la qualité des données : un premier pas vers la qualité des connaissances. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Blanchard et al. 2001] J. Blanchard, F. Guillet, et H. Briand. L'intensité d'implication entropique pour la recherche de règles de prédiction intéressantes dans les séquences de pannes d'ascenseurs. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4):77-88, 2002. [Blanchard et al. 2003] J. Blanchard, F. Guillet, F. Rantière, H. Briand. Vers une Représentation Graphique en Réalité Virtuelle pour la Fouille Interactive de Règles d’Association. Extraction des Connaissances et Apprentissage (ECA), vol. 17, n°1-2-3, 105-118, 2003. Hermès Science Publication. ISSN 0992-499X, ISBN 2-7462-0631-5 [Blanchard et al. 2003a] J. Blanchard, F. Guillet, H. Briand. Une visualisation orientée qualité pour la fouille anthropocentrée de règles d’association. In Cognito - Cahiers Romans de Sciences Cognitives. A paraître. ISSN 1267-8015 [Blanchard et al. 2003b] J. Blanchard, F. Guillet, H. Briand. A User-driven and Quality oriented Visualiation for Mining Association Rules. In Proc. Of the Third IEEE International Conference on Data Mining, ICDM’2003, Melbourne, Florida, USA, November 19 - 22, 2003. [Blanchard et al., 2004] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesurer la qualité des règles et de leurs contraposées avec le taux informationnel TIC. EGC2004, RNTI, Cépaduès. 2004 A paraître. [Blanchard et al., 2004a] J. Blanchard, F. Guillet, R. Gras, H. Briand. Mesure de la qualité des règles d'association par l'intensité d'implication entropique. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Breiman & al. 1984] L.Breiman, J. Friedman, R. Olshen and C.Stone. Classification and Regression Trees. Chapman & Hall,1984. [Briand et al. 2004] H. Briand, M. Sebag, G. Gras et F. Guillet (eds). Mesures de Qualité pour la fouille de données. Revue des Nouvelles Technologies de l’Information, RNTI, Cépaduès, 2004. A paraître. [Brin et al., 1997] S. Brin, R. Motwani and C. Silverstein. Beyond Market Baskets: Generalizing Association Rules to Correlations. In Proceedings of SIGMOD’97, pages 265-276, AZ, USA, 1997. [Brin et al., 1997b] S. Brin, R. Motwani, J. Ullman et S. Tsur. Dynamic itemset counting and implication rules for market basket data. Proc. of the Int. Conf. on Management of Data, ACM Press, 1997, p. 255-264.

112 Bibliography [Church & Hanks, 1990] K. W. Church et P. Hanks. Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22-29, 1990. [Clark & Robin 1991] Peter Clark and Robin Boswell: Rule Induction with CN2: Some Recent Improvements. In Proceeding of the European Working Session on Learning EWSL-91, 1991. [Dong & Li, 1998] G. Dong and J. Li. Interestingness of Discovered Association Rules in terms of Neighborhood-Based Unexpectedness. In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), Melbourne, Australia, April 1998. [Duval et al. 2004] B. Duval, A. Salleb, C. Vrain. Méthodes et mesures d’intérêt pour l’extraction de règles d’exception. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Fleury 1996] L. Fleury. Découverte de connaissances pour la gestion des ressources humaines. Thèse de doctorat, Université de Nantes, 1996. [Frawley & Piatetsky-Shapiro 1992] Frawley W. Piatetsky-Shapiro G. and Matheus C., « Knowledge discovery in databases: an overview », AI Magazine, 14(3), 1992, pages 57-70 [Freitas, 1998] A. A. Freitas. On Objective Measures of Rule Suprisingness. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD `98), pages 1-9, Nantes, France, September 1998. [Freitas, 1999] A. Freitas. On rule interestingness measures. Knowledge-Based Systems Journal 12(5-6), 1999, p. 309-315. [Gago & Bento, 1998 ] P. Gago and C. Bento. A Metric for Selection of the Most Promising Rules. PKDD’98, 1998. [Gray & Orlowska, 1998] B. Gray and M. E. Orlowska. Ccaiia: Clustering Categorical Attributes into Interesting Association Rules. In X. Wu, R. Kotagiri and K. Korb, editors, Proc. of 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD `98), pages 132 43, Melbourne, Australia, April 1998. [Goodman & Kruskal 1959] L. A. Goodman andW. H. Kruskal. Measures of Association for Cross Classification, ii: Further discussion and references. Journal of the American Statistical Association, ??? 1959. [Gras et al. 1995] R. Gras, H. Briand and P. Peter. Structuration sets with implication intensity. Proc. of the Int. Conf. On Ordinal and Symbolic Data Analysis - OSDA 95. Springer, 1995. [Gras, 1996] R. Gras et coll.. L'implication statistique - Nouvelle méthode exploratoire de données. La pensée sauvage éditions, 1996. [Gras et al. 2001] R. Gras, P. Kuntz, et H. Briand. Les fondements de l'analyse statistique implicative et quelques prolongements pour la fouille de données. Mathématiques et Sciences Humaines : Numéro spécial Analyse statistique implicative, 1(154-155) :9-29, 2001. [Gras et al. 2001b] R. Gras, P. Kuntz, R. Couturier, et F. Guillet. Une version entropique de l'intensité d'implication pour les corpus volumineux. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(1-2) :69-80, 2001. [Gras et al. 2002] R. Gras, F. Guillet, et J. Philippe. Réduction des colonnes d'un tableau de données par quasi-équivalence entre variables. Extraction des Connaissances et Apprentissage (ECA), Hermès Science Publication, 1(4) :197-202, 2002. [Gras et al. 2004] R. Gras, R. Couturier, J. Blanchard, H. Briand, P. Kuntz, P. Peter. Quelques critères pour une mesure de la qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Guillaume et al. 1998] S. Guillaume, F. Guillet, J. Philippé. Improving the discovery of associations Rules with Intensity of implication. Proc. of 2 nd European Symposium Principles of data Mining and Knowledge Discovery, LNAI 1510, p 318-327. Springer 1998. [Guillaume 2002] S. Guillaume. Discovery of Ordinal Association Rules. M.-S. Cheng, P. S. Yu, B. Liu (Eds.), Proc. Of the 6th Pacific- sia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2002, LNCS 2336, pages 322-327 Springer 2002.

113 Bibliography [Guillet et al. 1999] F. Guillet, P. Kuntz, et R. Lehn. A genetic algorithm for visualizing networks of association rules. Proc. the 12th Int. Conf. On Industrial and Engineering Appl. of AI and Expert Systems, LNCS 1611, pages 145-154. Springer 1999 [Guillet 2000] F. Guillet. Mesures de qualité de règles d’association. Cours DEA-ECD. Ecole polytechnique de l’université de Nantes. 2000. [Hilderman & Hamilton, 1998] R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Interestingness Measures: A Survey. (KDD `98), ??? New-York 1998. [Hilderman et Hamilton, 2001] R. Hilderman et H. Hamilton. Knowledge discovery and measures of interest. Kluwer Academic publishers, 2001. [Hussain et al. 2001] F. Hussain, H. Liu, E. Suzuki and H. Lu. Exception Rule Mining with a Relative Interestingness Measure. ??? [Jaroszewicz & Simovici, 2001] S. Jaroszewicz et D.A. Simovici. A general measure of rule interestingness. Proc. of the 7th Int. Conf. on Knowledge Discovery and Data Mining, L.N.C.S. 2168, Springer, 2001, p. 253-265 [Klemettinen et al. 1994] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen and A. I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In N. R. Adam, B. K. Bhargava and Y. Yesha, editors, Proc. of the Third International Conf. on Information and Knowledge Management``, pages 401-407, Gaitersburg, Maryland, 1994. [Kodratoff, 1999] Y. Kodratoff. Comparing Machine Learning and Knowledge Discovery in Databases:An Application to Knowledge Discovery in Texts. Lecture Notes on AI (LNAI)-Tutorial series. 2000. [Kuntz et al. 2000] P.Kuntz, F.Guillet, R.Lehn and H.Briand. A User-Driven Process for Mining Association Rules. In D. Zighed, J. Komorowski and J.M. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery (PKDD2000), Lecture Notes in Computer Science, vol. 1910, pages 483-489, 2000. Springer. [Kodratoff, 2001] Y. Kodratoff. Comparing machine learning and knowledge discovery in databases: an application to knowledge discovery in texts. Machine Learning and Its Applications, Paliouras G., Karkaletsis V., Spyropoulos C.D. (eds.), L.N.C.S. 2049, Springer, 2001, p. 1-21. [Kuntz et al. 2001] P. Kuntz, F. Guillet, R. Lehn and H. Briand. A user-driven process for mining association rules. Proc. of Principles of Data Mining and Knowledge Discovery, LNAI 1510, pages 483-489. Springer, 2000. [Kuntz et al. 2001b] P. Kuntz, F. Guillet, R. Lehn, et H. Briand. Vers un processus d'extraction de règles d'association centré sur l'utilisateur. In Cognito, Revue francophone internationale en sciences cognitives, 1(20) :13-26, 2001. [Lallich et al. 2004] S. Lallich et O. Teytaud. Évaluation et validation de l’intérêt des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Lehn et al. 1999] R.Lehn, F.Guillet, P.Kuntz, H.Briand and J. Philippé. Felix : An interactive rule mining interface in a kdd process. In P. Lenca (editor), Proc. of the 10th Mini-Euro Conference, Human Centered Processes, HCP’99, pages 169-174, Brest, France, September 22-24, 1999. [Lenca et al. 2004] P. Lenca, P. Meyer, B. Vaillant, P. Picouet, S. Lallich. Evaluation et analyse multi-critères des mesures de qualité des règles d’association. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004]. [Lerman et al. 1981] I. C. Lerman, R. Gras et H. Rostam. Elaboration et évaluation d’un indice d’implication pour les données binaires. Revue Mathématiques et Sciences Humaines, 75, p. 5-35, 1981. [Lerman, 1981] I. C. Lerman. Classification et analyse ordinale des données. Paris, Dunod 1981. [Lerman, 1993] I. C. Lerman. Likelihood linkage analysis classification method, Biochimie 75, p. 379-397, 1993. [Lerman & Azé 2004] I. C. Lerman et J. Azé.Indidice probabiliste discriminant de vraisemblance du lien pour des données volumineuses. Rapport d’activité du groupe gafoQualité de l’AS GafoDonnées. A paraître dans [Briand et al. 2004].

114 Bibliography [Liu et al., 1999] B. Liu, W. Hsu, L. Mun et H. Lee. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering 11, 1999, p. 817-832. [Loevinger, 1947] J. Loevinger. A systemic approach to the construction and evaluation of tests of ability. Psychological monographs, 61(4), 1947. [Mannila & Pavlov, 1999] H. Mannila and D. Pavlov. Prediction with Local Patterns using Cross-Entropy. Technical Report, Information and Computer Science, University of California, Irvine, 1999. [Matheus & Piatetsky-Shapiro, 1996] C. J. Matheus and G. Piatetsky-Shapiro. Selecting and Reporting what is Interesting: The KEFIR Application to Healthcare data. In U. M. Fayyad, G. Piatetsky-Shapiro, P.Smyth and R. Uthurusamy (eds), Advances in Knowledge Discovery and Data Mining, p. 401-419, 1996. AAAI Press/MIT Press. [Meo 2000] R. Meo. Theory of dependence values, ACM Transactions on Database Systems 5(3), p. 380-406, 2000. [Padmanabhan et Tuzhilin, 1998] B. Padmanabhan et A. Tuzhilin. A belief-driven method for discovering unexpected patterns. Proc. Of the 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, p. 94-100. [Pearson, 1896] K. Pearson. Mathematical contributions to the theory of evolution. III. regression, heredity and panmixia. Philosophical Transactions of the Royal Society, vol. A, 1896. [Piatestsky-Shapiro, 1991] G. Piatestsky-Shapiro. Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases. Piatetsky-Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 229-248 [Popovici, 2003] E. Popovici. Un atelier pour l'évaluation des indices de qualité. Mémoire de D.E.A. E.C.D., IRIN/Université Lyon2/RACAI Bucarest, Juin 2003 [Ritschard & al., 1998] G. Ritschard, D. A. Zighed and N. Nicoloyannis. Maximiser l`association par agrégation dans un tableau croisé. In J. Zytkow and M. Quafafou, editors, Proc. of the Second European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD `98), Nantes, France, September 1998. [Sebag et Schoenauer, 1988] M. Sebag et M. Schoenauer. Generation of rules with certainty and confidence factors from incomplete and incoherent learning bases. Proc. of the European Knowledge Acquisition Workshop (EKAW'88), Boose J., Gaines B., Linster M. (eds.), Gesellschaft für Mathematik und Datenverarbeitung mbH, 1988, p. 28.1-28.20. [Shannon & Weaver, 1949] C.E. Shannon et W. Weaver. The mathematical theory of communication. University of Illinois Press, 1949. [Silbershatz &Tuzhilin,1995] Avi Silberschatz and Alexander Tuzhilin. On Subjective Measures of Interestingness in Knowledge Discovery, (KD. & DM. `95) ???, 1995. [Smyth & Goodman, 1991] P. Smyth et R.M. Goodman. Rule induction using information theory. Knowledge Discovery in Databases, Piatetsky- Shapiro G., Frawley W.J. (eds.), AAAI/MIT Press, 1991, p. 159-176 [Tan & Kumar 2000] P. Tan, V. Kumar. Interestingness Measures for Association Patterns : A Perspective. Workshop tutorial (KDD 2000). [Tan et al., 2002] P. Tan, V. Kumar et J. Srivastava. Selecting the right interestingness measure for association patterns. Proc. of the 8 th Int. Conf. on Knowledge Discovery and Data Mining, 2002, p. 32-41.


Download ppt "Interestingness Measures. Quality in KDD Levels of Quality Quality of discovered knowledge : f(D,M,U) Data Quality (D) Data Quality (D) Noise, accuracy,"

Similar presentations


Ads by Google