Download presentation

Presentation is loading. Please wait.

Published byMatthew Flynn Modified over 2 years ago

1
Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing

2
Overview nSelective sampling nCluster creation using an index nCluster and case utility scores nEvaluation

3
Selective Sampling selected cases labelled cases select interesting cases unlabelled cases (pool) Index case-base Relevance feedback Distance learning Patient monitoring

4
Uncertainty and Representativeness +- ? ? +- ? ? ? ? ? ?

5
Sampling Procedure L = set of labelled cases U = set of unlabelled cases LOOP model <= create-domain-model (L) clusters <= create-clusters(model, L, U) k-clusters <= select-clusters(k, clusters, L, U) FOR 1 to Max-Batch-Size case <= select-case(k-clusters, L, U) L <= L U get-label(case, oracle) U <= L \ case UNTIL stopping-criterion

6
Overview nSelective sampling nCluster creation using an index nCluster and case utility scores nEvaluation

7
Forming Clusters 5 labelled (4X, 1Y) 6 unlabelled 0 labelled 6 unlabelled f3 5 labelled (2X, 2Z, 1Y) 0 unlabelled < N>= N 5 labelled (2X, 2Y, 1Z) 6 unlabelled f1 f2 ab d e 5 labelled (4Y, 1Z) 0 unlabelled c

8
Analysing Clusters X X X Y X Y X X Y Z Z Y Y Y Y Z X X Y Z

9
Overview nSelective sampling nCluster creation nCluster and case utility scores nEvaluation

10
Ranking Clusters - Cluster Utility Score

11
Ranking Cases - Case Utility Score

12
Overview nSelective sampling nCluster creation nCluster and case utility scores nEvaluation

13
Evaluation nSelection Heuristics Rnd : randomly select cluster and cases Rnd-Cluster : random cluster with highest ranked cases Rnd-Case : highest ranked cluster random cases Informed-S : highest ranked cluster and cases Informed-M : highest ranked clusters and case nUCI ML (6 datasets) smaller data sets (Zoo, Iris, Lymph, Hep) medium data sets (house votes, breast cancer)

14
Experimental Design Index case-base sampling pool Inc2Inc3Inc4Inc5Inc test set case base size = L + selected cases selected cases = sampling iterations * Max-Batch-Size kNN accuracy

15
Results I RndRnd-clusterRnd-caseInformed-MInformed-S nZoo (7C, 18F, A, P9) nIris (3C, 4F, #+A, P3)

16
Results II RndRnd-clusterRnd-caseInformed-MInformed-S nLymphography (4C, 19F, #+A, P9) nHepatitis (2C, 20F, A+?, P7)

17
Results III RndRnd-clusterRnd-caseInformed-MInformed-S nHouse (2C, 16F, A+?, P3 ) nBreast (2C, 9F, A+?, P7)

18
Conclusions nDeveloped a case selection mechanism exploiting case base partitions nUtility Scores to rank clusters and cases ClUS captures uncertainty within clusters and uses entropy to further weight this score CaUS captures the impact on other cases nSignificant improvement with informed selection on 6 data sets nThe influence of votes, partitions and entropy needs further investigation

19
Training Time Ratio (Informed-M/Rnd) Training set size Zoo Iris Lymphography Hepatitis Training set size House Votes Breast Cancer nSmall data sets (difference 2 sec to 15 sec) nLarge data sets (difference 15 sec to 60 sec)

20
Discussion nImproving the utility scores the changing performance of informed-M, informed-S with different partition numbers needs examined should distances employed with CaUS be transformed? what about considering the votes of the labelled cases? should the training accuracy play a more active role in ClUS? nHow can the presented approach be used for hole discovery? case base maintenance? nShould be evaluated with other sampling methods Uncertainty sampling

21
Entropy L = labelled cases M = 2 p is the proportion of positive cases in L p Θ the proportion of negative cases in L Entropy measures the impurity of L: Entropy(L) = p (-log 2 p ) + p Θ (-log 2 p Θ ) = - p log 2 p - p Θ log 2 p Θ P Entropy Log 2 m Entropy(C unlabelled )= 0 Entropy(+1, -1)= 1 Entropy (+6, -1)= 0.59 Entropy(+7, -2)= 0.76

22
Creation, Sampling, Maintenance Case generation Meta Knowledge Sampling Impact of Sampling

23
Some Requirements for Sampling nUncertainty is not enough? Consider the effect of sampling on the rest of the unlabelled Sampling in dense regions may be good compared to isolated points, because it influences many cases Selecting more than one case may help pick representatives from dense areas, i.e. informed

24
Forming Clusters f1 5 labelled (4X, 1Y) f2 0 labelled f3 5 labelled (2X, 2Y, 1Z) ab d e < N>=N 5 labelled (2X, 2Y, 1Z) 5 labelled (4Y, 1Z) c 5 labelled 6 unlabelled 0 labelled 6 unlabelled f3 5 labelled 0 unlabelled < N>= N 5 labelled 6 unlabelled f1 f2 ab d e 5 labelled 0 unlabelled c

25
Experimental Design nUCI ML (6 datasets) Larger data sets (house votes, breast cancer) Smaller data sets (Zoo, Iris, Lymph, Hep) n5 increasing train / test set sizes equally sized splits for selection pool / test sets Training set or case base initialised with labelled cases 150 with an increment of with an increment of 25 nK-NN accuracy on test set averaged over 25 runs

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google