Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metamorphic Exploration of an Unsupervised Clustering Program

Similar presentations


Presentation on theme: "Metamorphic Exploration of an Unsupervised Clustering Program"— Presentation transcript:

1 Metamorphic Exploration of an Unsupervised Clustering Program
Sen Yang, Dave Towey School of Computer Science, University of Nottingham Ningbo China, Zhejiang, People’s Republic of China Zhi Quan Zhou Institute of Cybersecurity and Cryptology, University of Wollongong, Wollongong, NSW 2522, Australia

2 UNNC FoSE UNNC IDIC … and others Acknowledgements
Thank you to Sen for drafting the PPTs

3 UNNC China UK Malaysia

4 UNNC University of Nottingham Ningbo China
First Sino-foreign University Established in 2004 English Medium of Instruction (EMI) About 8,000 students ~10% international More than 750 staff (academic and professional) From more than 70 countries and regions around the world “An innovation, and centre for innovation”

5 Background I The oracle problem makes it difficult to test Machine Learning (ML) software There are also many (potential) ML users who are not expert, due to ML’s popularity Most related research is about supervised ML

6 Background II Evaluating the quality of clustering algorithms can be challenging No label for validation Users’ subjective expectations matter a lot MT can alleviate the oracle problem

7 Background III ML has been applied in numerous industrial domains, including: Medical, Economy, Automated driving, Decision making, … Supervised ML (Classification problem) ML algorithm Unsupervised ML (Clustering problem)

8 Metamorphic Testing The idea:
It is possible to identify relationships amongst multiple inputs and outputs for a software under test (SUT), even if we don’t know the correctness of individual outputs Metamorphic Relations (MRs) are the necessary properties of the SUT

9 Metamorphic Exploration
Recently, MRs have been applied to enhance system understanding and use These “MRs” need not be necessary properties for software correctness They can be hypothesized by the users Call them “Hypothesized MRs” (HMRs)

10 Our Study A case study of metamorphic exploration using a clustering program Weka, one of the most popular data science platforms for ML and data mining K-means Clustering Algorithm

11 K-means Clustering Algorithm
Attempts to (iteratively) partition a dataset into K distinct non-overlapping subgroups (clusters) where each data point belongs to one and only one group Aims to minimize the following cost function (sum of the squared error) X = {xi}, i = 1,…, n be the set of n d-dimensional points to be clustered into a set of K clusters, C = {ck}, k = 1,…,K. Let μk be the mean of cluster ck.

12 K-means Clustering Algorithm
Fig. 1. Basic steps of K-means algorithm (k = 2). (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations.

13 Hypothesised MRs HMR1: Translation of 2D points along a line parallel to the x- or y-axis should not have an impact on the clustering results HMR2: Adding a duplicate point should not have an impact on the clustering results HMR3: Moving an existing point towards the cluster center should not have an impact on the clustering results HMR4: Adding a dimension in which all points have an equal value should not have an impact on the clustering results. HMR5: Swapping the x- and y-coordinates of each and every point (which is essentially conducting a geometric transformation on the existing points) should not have an impact on the clustering results HMR6: Using the same set of points while changing their input order to the SUT should not have an impact on the clustering results HMR7: In 2D, flipping the data points along one (x- or y-) axis should not have an impact on the clustering results

14 Hypothesised MRs HMR1: Translation of 2D points along a line parallel to the x- or y-axis should not have an impact on the clustering results HMR2: Adding a duplicate point should not have an impact on the clustering results HMR3: Moving an existing point towards the cluster center should not have an impact on the clustering results HMR4: Adding a dimension in which all points have an equal value should not have an impact on the clustering results. HMR5: Swapping the x- and y-coordinates of each and every point (which is essentially conducting a geometric transformation on the existing points) should not have an impact on the clustering results HMR6: Using the same set of points while changing their input order to the SUT should not have an impact on the clustering results HMR7: In 2D, flipping the data points along one (x- or y-) axis should not have an impact on the clustering results

15 Set-up Applied the approach on implementations of K-means clustering algorithm in Weka 3.8.3 Initial test cases (single test case): Iris.2D.arff data Default setting with K=5, S=10 Manually manipulated the follow-up test cases

16 (Preliminary) Results

17 (Preliminary) Results

18 Violation of HRM2 Fig.2. Clustering results of source test data
Fig.3. Clustering results of HMR2

19 (Preliminary) Results

20 (Preliminary) Results

21 Violation of HRM6 Fig.2. Clustering results of source test data
Fig.4. Clustering results of HM6

22 Analysis & Discussion Adding a new data point leads to a new round of calculation of the Euclidean distance when re-locating the new cluster centroids This will also influence the selection of initial cluster center points Changing of the data entry order will have impact on the clustering results

23 Analysis & Discussion This reflects a characteristic of the K-means clustering algorithm itself, and is not a defect in the implementation K-means algorithm is sensitive to the selection of initial clustering centroid

24 Analysis & Discussion

25 Recommendation If the user needs to add new data containing some duplicates to the original dataset, then it is better to choose a hierarchical clustering algorithm, to keep a stable clustering performance If we want to re-generate an earlier test result, we should keep the original order of the data points when they are entered into the system

26 Sen’s Conclusion Compared with the understanding of the SUT before the experiment, now have gained new knowledge and understanding of the system ME can be used to help users explore the system ME can be used to guide users to better use the algorithm

27 Discussion & Future Work
Opportunity to embrace ME as a step towards MT Undergraduate curricula, and elsewhere Metamorphic Exploration (Journal First): Thursday 30 May 15:10, Van-Horne

28 Thank you!

29 Q & A


Download ppt "Metamorphic Exploration of an Unsupervised Clustering Program"

Similar presentations


Ads by Google