Download presentation

Presentation is loading. Please wait.

Published byMikayla Ozanne Modified about 1 year ago

1
Stefan Rüping Fraunhofer IAIS stefan.rueping@iais.fraunhofer.de Ranking Interesting Subgroups

2
2 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0 p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66.1% Motivation

3
3 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0 p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66.1% Motivation

4
4 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5 p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5 p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0 p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5 p = 66.1% Motivation Applying ranking to complex data: subgroup models Optimization of data mining models for non-expert users Applying ranking to complex data: subgroup models Optimization of data mining models for non-expert users

5
5 Fraunhofer IAIS Overview Introduction to Subgroup Discovery Interesting Patterns Ranking Subgroups Representation Ranking SVMs Iterative algorithm Experiments Conclusions

6
6 Fraunhofer IAIS Subgroup Discovery Input X defined by nominal attributes A 1,…,A d Data Subgroup language Propositional formula A i1 = v j1 A i2 = v j2 … For a subgroup S let g(S) = #{ x i S }/n,p(S) = #{ x i S | y i = 1 }/g(S), p 0 = |y i = 1|/n q(S) = g(S) a (p(S)-p 0 ) Task Find k subgroups with highest significance (maximal quality q) a = 0.5 t-test Subgroup quality = significance of pattern Subgroup size and class probability

7
7 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow

8
8 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265

9
9 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8) 0.5 (2/2 – 5/8) = 0.187

10
10 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8) 0.5 (2/2 – 5/8) = 0.187 Significance ≠ Interestingness

11
11 Fraunhofer IAIS Interesting Patterns What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist Attributes Actionability Acquaintedness Sub-space Novelty Complexity Not too complex Not too simple ?

12
12 Fraunhofer IAIS Overview: Ranking Interesting Subgroups Data Subgroup Discovery Ranking SVM Task Modification Subgroup Representation „S1 > S2“

13
13 Fraunhofer IAIS Subgroup Representation (1/3) Subgroups become examples of ranking learner! Notation A i = original attribute r(S) = representation of subgroup S Remember: important properties of subgroups Attributes Examples Complexity Representing complexity r(S) includes g(S) and p(S)-p 0

14
14 Fraunhofer IAIS Subgroup Representation (2/3) Representing attributes For each attribute A i of the original examples include into subgroup representation attribute Observation: TF/IDF-like representation performs even better

15
15 Fraunhofer IAIS Subgroup Representation (3/3) Representing examples User may be more interested in subset of examples Construct list of known relevant and irrelevant subgroups from user feedback For each subgroup S and each known relevant/irrelevant subgroup T define relatedness of S to known subgroup T

16
16 Fraunhofer IAIS Ranking Optimization Problem Rationale Subgroup discovery gives quality q(S) = g(S) a (p(S)-p 0 ) User defines ranking by pairs „S1 > S2“ (S1 is better than S2) Find true ranking q * such that S1 > S2 q * (S1) > q * (S2) Assumption (justfied by assuming hidden labels of interestingness of examples) Define linear ranking function log q * (S) = (a,1,w) r(S)

17
17 Fraunhofer IAIS Ranking Optimization Problem (2/2) Solution similar to ranking SVM Optimization problem: Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S)

18
18 Fraunhofer IAIS Ranking Optimization Problem (2/2) Solution similar to ranking SVM Optimization problem: Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S) Deviation from parameter a 0 in subgroup discovery

19
19 Fraunhofer IAIS Ranking Optimization Problem (2/2) Solution similar to ranking SVM Optimization problem: Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S) Deviation from parameter a 0 in subgroup discovery Constant weight for g(S) defines margin

20
20 Fraunhofer IAIS Iterative Procedure Why? Google: ~10 12 web pages Same number of possible subgroups on 12-dimensional data set with 9 distinct values per attribute cannot compute all subgroups for single-step ranking Approach Optimization problem gives new estimate of a Transform weight of subgroups–features into weights for original examples Idea: replace binary y with numeric value. Appropriate offset guarantees that subgroup-q is approximates optimized q* subgroup ranking search

21
21 Fraunhofer IAIS Experiments Simulation on UCI data Replace true label with most correlated attribute Use true label to simulate user Measure correspondence of algorithm‘s ranking with subgroups found on true label Tests ability of approach to flexibly adapt to correlated patterns Performance measure Area under the curve – retrieval of true top 100 subgroups Kendall‘s - internal consistency of returned ranking

22
22 Fraunhofer IAIS Results Wilcoxon signed rank test confirms significance 3 Data sets with minimal AUC are exactly the ones with minimal correlation between true and proxy label! Data set AUC Diabetes0.2560.008 Breast-w0.7590.120 Vote0.6640.051 Segment0.5960.601 Vehicle0.0530.500 Heart-c0.1800.036 Primary-tumor0.7390.532 Hypothyroid0.7290.307 Ionosphere0.2270.708 Credit-a0.0500.241 Credit-g0.0190.285 Colic1.9E-40.213 Anneal0.0300.329 Soybean1.9E-40.040 Mushroom0.5420.320 mean0.3230.286

23
23 Fraunhofer IAIS Conclusions Example of ranking on complex, knowledge-rich data Interestingness of subgroups patterns can be significantly increased with interactive ranking-based method Step toward automating machine learning for end-users Future work: Validation with true users Active learning approach

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google