Dr. Hongqin FAN Department of Building and Real Estate

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
DBSCAN & Its Implementation on Atlas Xin Zhou, Richard Luo Prof. Carlo Zaniolo Spring 2002.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Detection of Deviant Behavior From Agent Traces Boštjan Kaluža Department of Intelligent Systems, Jožef Stefan Institute Jozef Stefan Institute Jožef Stefan.
15-826: Multimedia Databases and Data Mining
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29.
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
University at BuffaloThe State University of New York WaveCluster A multi-resolution clustering approach qApply wavelet transformation to the feature space.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray.
ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.
Example Data Sets Prior Research Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). Use attribute-based distance.
WPI Center for Research in Exploratory Data and Information Analysis From Data to Knowledge: Exploring Industrial, Scientific, and Commercial Databases.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic.
Assembler Efficient Discovery of Spatial Co-evolving Patterns in Massive Geo-sensory Data Sheng QIAN SIGKDD 2015.
A simple method for multi-relational outlier detection Sarah Riahi and Oliver Schulte School of Computing Science Simon Fraser University Vancouver, Canada.
Anomaly detection with Bayesian networks Website: John Sandiford.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Resistant Learning on the Envelope Bulk for Identifying Anomalous Patterns Fang Yu Department of Management Information Systems National Chengchi University.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Density-Based Clustering Algorithms
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
2001/11/27IDS Lab Seminar1 Adaptive Fraud Detection Advisor: Dr. Hsu Graduate: Yung-Chu Lin Source: Fawcett, Tom and Foster Provost, Journal of Data Mining.
DB group seminar 2006/06/29The University of Hong Kong, Dept. of Computer Science Neighborhood based detection of anomalies in high dimensional spatio-temporal.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
Presented by Ho Wai Shing
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
1 Fuzzy Versus Quantitative Association Rules: A Fair Data-Driven Comparison Shih-Ming Bai and Shyi-Ming Chen Department of Computer Science and Information.
1 Discovery of Structural and Functional Features in RNA Pseudoknots Qingfeng Chen and Yi-Ping Phoebe Chen, Senior Member, IEEE IEEE TRANSACTIONS ON KNOWLEDGE.
Efficient Rule-Based Attribute-Oriented Induction for Data Mining Authors: Cheung et al. Graduate: Yu-Wei Su Advisor: Dr. Hsu.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
Project GuideBenazir N( ) Mr. Nandhi Kesavan RBhuvaneshwari R( ) Batch no: 32 Department of Computer Science Engineering.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Data Preliminaries CSC 600: Data Mining Class 1.
International Conference on Mathematical Modelling and Computational Methods in Science and Engineering, Alagappa University, Karaikudi, India February.
What Is Cluster Analysis?
A Methodology for Finding Bad Data
Machine Learning University of Eastern Finland
Outlier Discovery/Anomaly Detection
CSE572, CBS572: Data Mining by H. Liu
Data Preliminaries CSC 576: Data Mining.
CSE572: Data Mining by H. Liu
Presentation transcript:

Resolution-based outlier factor and outlier mining algorithm for engineering applications Dr. Hongqin FAN Department of Building and Real Estate The Hong Kong Polytechnic University Hong Kong SAR, China Monday, 26 September, 2016

Outline Introduction Resolution-Based (RB) outlier RB-outlier mining algorithm Engineering applications Discussions Conclusions

1. Introduction Outlier mining is aimed to identify these observations deviating from the majority or from local data clusters. Outliers represent some observations of interests: Fraud transaction in financial application; Detection of natural disasters and climate change; Abnormal effects of medical treatment; Anomaly due to change of system status or operations; Anomaly in work performance due to problematic decisions in management.

1. Introduction Opportunities in engineering applications: A clear need to identify the outliers in system operations or management in real time; Implies large chunks of savings in many applications. Challenges in engineering applications: Difficult to define or describe the system due to inherent complexity or complex operational environment; Difficult to describe the clusters in the observations; No clear demarcation between local outliers and global outliers; Difficult to rank the outliers effectively.

1. Introduction Some outlier mining algorithms: Distance-based outlier mining algorithm (Knorr and Ng 1998) Local outlier mining algorithm (Breunig et al. 2000) Connectivity-based mining algorithm (Tang et al. 2002) Current outlier definitions and outlier mining algorithms are difficult to be applied to engineering applications.

2. Resolution-Based (RB) outlier Resolution-based Outlier Factor(ROF) : If the resolution of a dataset changes consecutively between maximum resolution where all the points are non-neighbours, and minimum resolution where all the points are neighbours, the resolution-based outlier factor of an object is defined as the accumulated ratios of sizes of clusters containing this object in two consecutive resolutions. r1, r2. . .ri . . . rR. are the resolutions at each step, R is the total number of resolution change steps from Smax to Smin, ClusterSize (O, r ) is the number of objects in the cluster containing object O at a resolution r . r0 is the state before the resolution scaling begins. At that stage all cluster sizes are 1 (i.e. one point) and the ROF of all points is 0.

2. Resolution-Based (RB) outlier Example: Cluster size at different resolution levels: gradually zoom out. ROF values are collected and accumulated at each level of resolution.

3. RB-outlier mining algorithm RB-CLUSTER RB-MINE

3. RB-outlier mining algorithm

3. RB-outlier mining algorithm Using a synthetic database, comparison is made with two renowned outlier mining algorithms: distance based (DB) and local outlier factor (LOF) based outlier DB-outlier LOF outlier RB-outlier

4. Engineering applications Decision support in construction equipment management: A contractor’s equipment fleet Yearly repair and maintenance cost ($ per yr) Rate of charge ($ per hr) Age (yrs) Some combinations of the attribute values show abnormal behavior of equipment Decisions can be made on equipment repair or disposal/replacement.

4. Engineering applications Other applications Identify abnormal construction equipment operations on the jobsite, based on daily records; Abnormal productivity data in construction project management for improving decisions; Abnormal sensing data in structure health monitoring; Removing noisy data for improved analysis of system operations (focusing on majority only).

5. Discussions Pros No domain dependent parameters of input; Handle database of many clusters of arbitrary shapes; Take both local and global features of the database into account; Ranking top outliers for analysis; It is possible to identify the number of outliers automatically based on the change trend of ROF (i.e. point of elbow). Cons Need trial on resolution change step size; No significant changes observed if the step size is small enough Need to standardize the attributes. Important attributes can be given larger weighting through transformation

6. Conclusions The RB-outlier, ROF, and RB-outlier mining are based on the concept of resolution change; RB-outlier mining algorithms can be effectively used in engineering applications which are distinct from others; RB-outlier mining algorithm is easy to use, flexible, and shows good performance in both synthetic and real life data.

Acknowledgement Special acknowledgement to the following people: Professor Osmar Zaiane Dr. Andrew Foss Mr. Junfeng WU

References Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of 24th international conference on very large databases (VLDB), New York, USA. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: Identifying density- based local outliers. In: Proceedings of ACM SIGMOD international conference on management of data, Dallas. Tang J, Chen Z, Fu AW, Cheung DW(2002) Enhancing effectiveness of outlier detections for low density patterns. In: Proceedings of the 6th Pacific-Asia conference on advances in knowledge discovery and data mining, Taipei, Taiwan, pp 535–548 Fan, H., Kim, H, AbouRizk, S. and Han, S. (2008) “Decision support in construction equipment management using a nonparametric outlier mining algorithm." Journal of Expert Systems with Applications, 35(4). Fan, H., Zaiane, O.,Foss, A. and Wu J. (2009) “Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data.” Journal of Knowledge and Information Systems Springer. 19(1), 31-51.