An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li

2 Contents Research Question Introduction to Outliers The problem with many dimensions Subspace outlier detection Techniques considered Evaluation Achievements New framework Left to Do End

3 Research Question What is the best way to find outliers in high dimensional data?

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. The attributes-are-dimensions metaphor Central to the concept of outliers Each attribute is considered to be a dimension A database is a dataset Each object, or tuple, is a point The schema is a space So finding unusual objects is a geometric problem

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Outliers “an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” (Hawkins, 1980)‏ Outliers point to interesting phenomena. Being able to explain outliers adds strength to a model. Outliers can signify important events – network intrusions – credit card fraud – disease outbreaks

DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. The Low Dimensional Case For 1 to about 4 attributes, outlier detection is a solved problem A few techniques exist The most popular is LOF – Local Outlier Factor But they are less reliable as the number of dimensions increases – Because of the curse

The Curse of Dimensionality Consider a dataset with d dimensions For any three points, As d → ∞, a, b and c → ∞ but a/b & a/c → 1 i.e. the distances become more similar as the number of dimensions increases This happens under most common conditions

The Curse of Dimensionality So it becomes like this: for some large distance h Traditional approaches can't find outliers – no points are relatively far away But what if there are some unusual points to be found, but some attributes are distracting us

Subspaces A space within another space Less dimensions e.g. a 2D plane crossing 3D space Can be created by selecting a subset of the set of attributes – This is called feature selection – Equivalent to database projection

Subspace Outlier Detection Outlier Detection in the subspaces Actually looking for “subspace outliers” “point x is an outlier in subspace S” i.e. object x has unusual values for some attributes

Existing Techniques Four subspace outlier detection algorithms were looked at Aggarwal Evolutionary Search Subspace Outlier Degree (SOD)‏ Lazarevic Feature Bagging (LazFB)‏ Most Interesting Subspace Top N Outlier Detection (MOIS)‏ Not much evaluation done Three of them have results for some test data in the papers that define them Different test data for each one No comparisons between them

Distance Metrics Normal distance is Euclidean distance A couple of other distance metrics have been found to increase the contrast between distances when there are many dimensions – Nearest neighbour ranking dist(x, y) = k where y is the k-th nearest point to x – Fractional L p norm dist(x, y) = where p < 1 These have not been tried in outlier detection

Research Plan Compare – MOIS – Lazarevic Feature Bagging – Two benchmark algorithms LOF Distance-based outlier (these are non-subspace outlier detection algorithms)‏ Try new distance metrics nearest neighbour rank fractional Lp norm – Use Lazarevic Feature Bagging and LOF – replace the Euclidean distance function with the chosen metric

Evaluation ROC curve Find a parameter that controls sensitivity – number of reported outliers (positives) per number of points Run the algorithm for many values of that parameter Draw a scatter plot of the true positive vs false positive rates Connect the dots The area under the curve (AUC) is the quality of the algorithm for the test data set

Achievement Implementation New framework

Implementation Only one existing algorithm, MOIS, had an available implementation The others need new implementations Implementing all the algorithms involves much repetition Loading datasets Accessing data points Calculating distances A system for code reuse is desirable Variations of the algorithms must be easy to create For any improved algorithms I design Running many tests should be easy

Framework I decided to use a software framework Standardised API for algorithms Inversion of control User commands framework Framework commands algorithms Existing frameworks considered Weka RapidMiner ELKI Weka and ELKI (0.1) don't natively support outlier detection algorithms RapidMiner carries a high implementation overhead Due to architecture

Framework Looking for the quickest way to implement the algorithms The drawbacks mentioned made those frameworks unsuitable for my task Unless they were extended My decision: create a new framework, created just for subspace outlier detection

The New Framework Some design decisions – Use of Weka API to use functionality already available in Weka – Interactive command-line interface scriptability e.g. easy to run tests on an arbitrary set of datasets – Inheritance-friendly design for quick creation of modified algorithms, metrics and data structures – The Metric class some functions are implement as subclasses of Metric makes those functions easier to replace with new ones used for distance metrics

Left to Do Complete LOF implementation Complete Lazarevic Feature Bagging implementation Implement new distance metrics Run tests Analyse results

Thankyou for listening Questions?

An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Similar presentations

Presentation on theme: "An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Similar presentations

Presentation on theme: "An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li."— Presentation transcript:

Similar presentations

About project

Feedback