Metamorphic Exploration of an Unsupervised Clustering Program

Metamorphic Exploration of an Unsupervised Clustering Program
Sen Yang, Dave Towey School of Computer Science, University of Nottingham Ningbo China, Zhejiang, People’s Republic of China Zhi Quan Zhou Institute of Cybersecurity and Cryptology, University of Wollongong, Wollongong, NSW 2522, Australia

UNNC FoSE UNNC IDIC … and others Acknowledgements
Thank you to Sen for drafting the PPTs

UNNC China UK Malaysia

UNNC University of Nottingham Ningbo China
First Sino-foreign University Established in 2004 English Medium of Instruction (EMI) About 8,000 students ~10% international More than 750 staff (academic and professional) From more than 70 countries and regions around the world “An innovation, and centre for innovation”

Background I The oracle problem makes it difficult to test Machine Learning (ML) software There are also many (potential) ML users who are not expert, due to ML’s popularity Most related research is about supervised ML

Background II Evaluating the quality of clustering algorithms can be challenging No label for validation Users’ subjective expectations matter a lot MT can alleviate the oracle problem

Background III ML has been applied in numerous industrial domains, including: Medical, Economy, Automated driving, Decision making, … Supervised ML (Classification problem) ML algorithm Unsupervised ML (Clustering problem)

Metamorphic Testing The idea:
It is possible to identify relationships amongst multiple inputs and outputs for a software under test (SUT), even if we don’t know the correctness of individual outputs Metamorphic Relations (MRs) are the necessary properties of the SUT

Metamorphic Exploration
Recently, MRs have been applied to enhance system understanding and use These “MRs” need not be necessary properties for software correctness They can be hypothesized by the users Call them “Hypothesized MRs” (HMRs)

Our Study A case study of metamorphic exploration using a clustering program Weka, one of the most popular data science platforms for ML and data mining K-means Clustering Algorithm

K-means Clustering Algorithm
Attempts to (iteratively) partition a dataset into K distinct non-overlapping subgroups (clusters) where each data point belongs to one and only one group Aims to minimize the following cost function (sum of the squared error) X = {xi}, i = 1,…, n be the set of n d-dimensional points to be clustered into a set of K clusters, C = {ck}, k = 1,…,K. Let μk be the mean of cluster ck.

K-means Clustering Algorithm
Fig. 1. Basic steps of K-means algorithm (k = 2). (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations.

Hypothesised MRs HMR1: Translation of 2D points along a line parallel to the x- or y-axis should not have an impact on the clustering results HMR2: Adding a duplicate point should not have an impact on the clustering results HMR3: Moving an existing point towards the cluster center should not have an impact on the clustering results HMR4: Adding a dimension in which all points have an equal value should not have an impact on the clustering results. HMR5: Swapping the x- and y-coordinates of each and every point (which is essentially conducting a geometric transformation on the existing points) should not have an impact on the clustering results HMR6: Using the same set of points while changing their input order to the SUT should not have an impact on the clustering results HMR7: In 2D, flipping the data points along one (x- or y-) axis should not have an impact on the clustering results

Set-up Applied the approach on implementations of K-means clustering algorithm in Weka 3.8.3 Initial test cases (single test case): Iris.2D.arff data Default setting with K=5, S=10 Manually manipulated the follow-up test cases

(Preliminary) Results

Violation of HRM2 Fig.2. Clustering results of source test data
Fig.3. Clustering results of HMR2

(Preliminary) Results

Violation of HRM6 Fig.2. Clustering results of source test data
Fig.4. Clustering results of HM6

Analysis & Discussion Adding a new data point leads to a new round of calculation of the Euclidean distance when re-locating the new cluster centroids This will also influence the selection of initial cluster center points Changing of the data entry order will have impact on the clustering results

Analysis & Discussion This reflects a characteristic of the K-means clustering algorithm itself, and is not a defect in the implementation K-means algorithm is sensitive to the selection of initial clustering centroid

Analysis & Discussion

Recommendation If the user needs to add new data containing some duplicates to the original dataset, then it is better to choose a hierarchical clustering algorithm, to keep a stable clustering performance If we want to re-generate an earlier test result, we should keep the original order of the data points when they are entered into the system

Sen’s Conclusion Compared with the understanding of the SUT before the experiment, now have gained new knowledge and understanding of the system ME can be used to help users explore the system ME can be used to guide users to better use the algorithm

Discussion & Future Work
Opportunity to embrace ME as a step towards MT Undergraduate curricula, and elsewhere Metamorphic Exploration (Journal First): Thursday 30 May 15:10, Van-Horne

Thank you!

Metamorphic Exploration of an Unsupervised Clustering Program

Similar presentations

Presentation on theme: "Metamorphic Exploration of an Unsupervised Clustering Program"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metamorphic Exploration of an Unsupervised Clustering Program

Similar presentations

Presentation on theme: "Metamorphic Exploration of an Unsupervised Clustering Program"— Presentation transcript:

Similar presentations

About project

Feedback