Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April 27 2004 Presented by Jisu.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

A Data Mining Course for Computer Science and non Computer Science Students Jamil Saquer Computer Science Department Missouri State University Springfield,
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
The Evolution of Spatial Outlier Detection Algorithms - An Analysis of Design CSci 8715 Spatial Databases Ryan Stello Kriti Mehra.
GIS Applications in Traffic Accidents Hongtao Gao.
Critical Analysis Presentation: T-Drive: Driving Directions based on Taxi Trajectories Authors of Paper: Jing Yuan, Yu Zheng, Chengyang Zhang, Weilei Xie,
Vikramaditya Jakkula Washington State University First International Workshop on Smart Homes for Tele-Health.
N EIGHBORHOOD F ORMATION AND A NOMALY D ETECTION IN B IPARTITE G RAPHS Jimeng Sun, Huiming Qu, Deepayan Chakrabarti & Christos Faloutsos Jimeng Sun, Huiming.
WEKA (sumber: Machine Learning with WEKA). What is WEKA? Weka is a collection of machine learning algorithms for data mining tasks. Weka contains.
Software Quality Metrics
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
SQL/SDA: A Query Language for Supporting Spatial Data Analysis and Its Web-Based Implementation SQL/SDA: A Query Language for Supporting Spatial Data Analysis.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Clementine Server Clementine Server A data mining software for business solution.
Spatial Data Mining: Spatial outlier detection Spatial outlier A data point that is extreme relative to it neighbors Given A spatial graph G={V,E} A neighbor.
A Unified Approach to Spatial Outliers Detection Chang-Tien Lu Spatial Database Lab Department of Computer Science University of Minnesota
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Data Mining – Intro.
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
M ULTIFRAME P OINT C ORRESPONDENCE By Naseem Mahajna & Muhammad Zoabi.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Biswanath Panda, Mirek Riedewald, Daniel Fink ICDE Conference 2010 The Model-Summary Problem and a Solution for Trees 1.
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
The Project – Database Design. The following is the high mark band for the Database design: Analysed a given situation and produced and analysed a given.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
DB group seminar 2006/06/29The University of Hong Kong, Dept. of Computer Science Neighborhood based detection of anomalies in high dimensional spatio-temporal.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
A Metrics Program. Advantages of Collecting Software Quality Metrics Objective assessments as to whether quality requirements are being met can be made.
Presented by: Daniel Hess, Yun Zhang. Motivation Problem statement Major contributions Key concepts Validation methodology Assumptions Recommended changes.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Data Mining – Intro.
CPS : Information Management and Mining
IST 220 – Intro to Databases
Proposal for Term Project
A paper on Join Synopses for Approximate Query Answering
Parallel Density-based Hybrid Clustering
Waikato Environment for Knowledge Analysis
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
CLUSTER BY: A NEW SQL EXTENSION FOR SPATIAL DATA AGGREGATION
CSE591: Data Mining by H. Liu
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April 27 2004 Presented by Jisu Oh (Group 2) Slides Available at http://www.users.cs.umn.edu/~joh/csci8715/HW-list.htm

Topics: Motivation Problem Statement Key Concepts Major Contributions Validation Methodology Assumptions Conclusions Future work

Motivation Machine learning /Data mining Enables a computer program to analyze large-scale data Decide important information which can be used to make predictions or to make decisions faster and more accurately.

Motivation Weka A collection of machine learning algorithms for solving real-world data mining problems Provides data mining functions (eg, regressions, association rules, and clustering algorithm) Limitation: operates on traditional non-spatial database

Problem Statement Input Data set Output : detected outliers as Minneapolis/St. Paul traffic data set Output : detected outliers as Plain text (timeslot, time, station, Zs(x)) Overall traffic volume Neighbor relationship graph between stations

Problem Statement(cont.) Constraints Algorithm from paper “A unified approach Detecting Spatial Outliers” Dataset should be numeric Objective To find sets of spatial outliers and show the results visually

Key Concepts Spatial outliers Definition – spatially referenced objects whose non-spatial attribute values are significantly different from the values of its neighborhood. Example – a new house in an old neighborhood of a growing metropolitan area In this project, outlier is one station which has a high volume compared to the neighboring stations at certain time slot. As you knew, spatial outliers are ~~~ For example,

Key Concepts (contd.) Algorithm S(x) = [f(x)-Ey∈ N(x)(f(y))] Proposed in the paper, “A Unified Approach to Detecting Spatial Outliers”, by S. Shekhar, C. T. Lu, and P. Zhang S(x) = [f(x)-Ey∈ N(x)(f(y))] : difference between f(x) - attribute value of a sensor located at x Ey - average attribute value of x’s neighbors Zs(x) = |s(x) –s/σs| > θ : spatial statistic, where θ is a z-score for user specified confidence interval

Key Concepts (contd.) Algorithm (example) 1 2 3 4 5 S(x) = f(x) –Ey 1 2 3 4 5 20 6 7 8 9 2 5 10 11 12 7 8 100 2 1 3 6 7 8 9 S(x) = f(x) –Ey = 100 – (2+8)/2 = 95 1 2 3 4 5 20 6 7 8 9 2 5 10 11 12 7 8 5 2 1 3 6 7 8 9 s : 0.22 σs : 23.8 Zs(x) = |s(x) –s|/σs = 3.98 Z-score for 95% C.I. = 2 3.98 > 2 Thus, 100 is an outlier Outlier is replaced by Ey. 100 -> 5

Major Contributions Top k outliers query processing User interface similar to an UI of Weka Providing visualization of outliers plain text (time slot, time, station, Zs(x)) overall traffic volume neighbor relationship graph between stations Keeping user-specified results

Major Contributions (contd.) Top k outliers query processing Fig.1. Top 3 outliers from dataset 19970115N.dat

Major Contributions (contd.) User Interface Weka based. Add one more button in weka. Same framework But work Independently. Simple and easy to use. Satisfy all user interface properties. (simple, user language, reduce memory, …) User specified confidence intervals, 68%, 95%, 99%, and number of outliers to find. Weka doesn’t provide enough options for detecting spatial outlier so that we need our own interface for that. Fig.2 User interface of the spatial outlier detection application v.s. weka

Major Contributions (contd.) Visualization outliers Benefit~ Where these information(visualization) can be applied. Fig.3 Plain text results of detected outliers

Major Contributions (contd.) Detected outliers Visualization outliers Fig.4 Overall traffic volume and Neighbor relationship graph between stations

Major Contributions (contd.) Visualization outliers Fig.4 Overall traffic volume and Neighbor relationship graph between stations

Major Contributions (contd.) Keeping Results Enable to save and print user-specified results User allow to keep their all results by saving and printing them. Enable to save and print all text results and image(traffic volume, stations relationship) Why this function is needed? Can compare and contrast each results using this information. Let’s go to the DEMO!

Validation Methodology Experiments with three different data set Data set Most outliers found at station 19970115N.dat 24 19970116N.dat 19970125N.dat 124 Provide three examples using different data set. Data set 1 : 19970115N.dat : station 24 Data set 2 :16 station 24 Data set 3 :125 station 124 Show station relationship Station number chosen as one of outliers works parameter of visualization stations. This allows users easily see neighbor relationship between stations. In other words, users can see why that station should be one of outliers.

Assumptions Data format is set The original data consists of traffic volume and occupancy. Detection outlier is based on volume. Data format : @relation 19970115N @station 150 @timeslot 288 1 3 4 7 45 100 …. Users are familiar with statistical concepts (e.g., confidence interval, C.I.)

Conclusion Adding one more package in Weka to find sets of spatial outliers Showing results visually in the user interface similar to the user interface of Weka by top k outliers query processing providing visualization of outliers allowing to keep user-specified results

Future work - e.g., SAR(Spatial Auto Regression), co-location Upgrade to allow various file format and data type Experiments to find more efficient algorithm using different outlier detection algorithms Add more spatial data mining options - e.g., SAR(Spatial Auto Regression), co-location

Thanks!