Data Anomalies in Data Mining and Knowledge Discovery in Data

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Lecture-19 ETL Detail: Data Cleansing
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
INTERPRET MARKETING INFORMATION TO TEST HYPOTHESES AND/OR TO RESOLVE ISSUES. INDICATOR 3.05.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Quantifying Data.
Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Semantics and Syntax of Dublin Core Usage in Open Archives Initiative Data Providers of Cultural Heritage Materials Arwen Hutt, University of Tennessee.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
Data Mining and Decision Support
1 Project Quality Management QA and QC Tools & Techniques Lec#10 Ghazala Amin.
Data Mining What is to be done before we get to Data Mining?
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Criminal Justice and Criminology Research Methods, Second Edition Kraska / Neuman © 2012 by Pearson Higher Education, Inc Upper Saddle River, New Jersey.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Understanding Populations & Samples
Lecture 1.31 Criteria for optimal reception of radio signals.
Data Mining.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
CHAPTER 13 Data Processing, Basic Data Analysis, and the Statistical Testing Of Differences Copyright © 2000 by John Wiley & Sons, Inc.
What Is Cluster Analysis?
By Arijit Chatterjee Dr
Security Methods and Practice CET4884
Chapter 5: The Art of Ensuring Integrity
Intrusion Detection Systems
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Quality By Suparna Kansakar.
A survey of network anomaly detection techniques
Warm up – Unit 4 Test – Financial Analysis
Data Mining Anomaly/Outlier Detection
Analyzing Categorical Data
Classification & Prediction
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
Statistics for the Social Sciences
CSCI N317 Computation for Scientific Applications Unit Weka
Lecture 6: Data Quality and Pandas
國立臺北科技大學 課程:資料庫系統 2015 fall Chapter 14 Normalization.
Data Quality Data Exploration
Data Processing, Basic Data Analysis, and the
Lecture 1: Descriptive Statistics and Exploratory
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Data Mining Anomaly Detection
Data Pre-processing Lecture Notes for Chapter 2
Instructor Materials Chapter 5: Ensuring Integrity
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Data Mining Anomaly Detection
Presentation transcript:

Data Anomalies in Data Mining and Knowledge Discovery in Data An Analysis of Data Anomalies in Data Mining and Knowledge Discovery in Data The 2008 International Conference on Data Mining, Las Vagas. *Mary C. Malone, Comcast Cable Corporation, USA *James Dullea, Villanova University, Pennsylvania, USA 2010.12.06. 김 현 숙

Contents 1. Definitions Data Anomalies 4. Managing 2. Types 3. Detecting 4. Managing Data Cleaning Outlier Mining Data Acquisition Anomalies Errors in Manual Data Entry Data Semantic Anomalies Outliers Outliers as Noise Outliers as Interesting Data Applications for Detecting Data Anomalies

Definitions of Data Anomalies English language definition 공통 규칙(common rule)에서 벗어난 데이터 불규칙, 뭔가 다르고, 유별나서 쉽게 분류되지 않는 데이터 Other Definition frequent events : considered normal, anomalies : considered rare. events that deviate substantially from a known (explicit or implicit) model of some domain noise  in the form of irrelevant  or meaningless data anomalous data and erroneous data are used synonymously. a considerable part of data has quality problems. These problems are labeled as errors, anomalies or even dirtiness. the percentage of anomalous data that occurs in a data set is estimated to be relatively small or about 5%

Contents 1. Definitions Data Anomalies 4. Managing 2. Types 3. Detecting 4. Managing Data Cleaning Outlier Mining Data Acquisition Anomalies Errors in Manual Data Entry Data Semantic Anomalies Outliers Outliers as Noise Outliers as Interesting Data Applications for Detecting Data Anomalies

Types of Data Anomalies(1/2) single-source anomalies : occur within a single data source multi-source anomalies : occur when different data sources are combined to form a whole. according to how they originate within a data source defining missing data not missing but wrong data not missing and not wrong but unusable data dirty data that uses a “successive hierarchical refinement” approach

Types of Data Anomalies(2/2) Dullea가 정의한 five categories data acquisition anomalies data semantic anomalies maverick values predictor issues organization and granularity issues Müller가 정의한 classification scheme syntax semantics coverage maverick values Maverick : 어느 쪽에도 속하지 않는 organization and granularity : 조직과 알갱이 We incorporate two different approaches for classifying data anomalies as an outline data acquisition anomalies, data semantic anomalies and maverick values, and augment these categories with classifications as defined in Müller  

Data Acquisition Anomalies(1/4) Errors in Manual Data Entry Ex) when reporting sex – in that M or F are the only given choices – but neither choice is selected, leaving the field blank Missing values in a data tuple can occur may or may not have a value, but because the respondent was unsure  of the correct value, the field was left blank. Ex) zip codes Empty values occurs as a result of typographical errors or incorrect data entry. Ex) PATIENT_ID or SSN Improper input or incorrect input Ex) What color are your grandfather’s eyes?  everyone has two grandfathers. Bad questions the person entering the data “guesses.” Ex) Cleveland State University study showed 50% of majors as “undecided” for both spring and fall 1999 terms. Respondent guessing

Data Acquisition Anomalies(2/4) Such errors can be prevented by creating better data entry systems defining more accurate questions in surveys, or by employing better data collection forms like radio buttons to enforce mutually exclusive choices on a respondent for a selection such as M or F. postal systems can also be used to check zip codes against input addresses But because of human error, such as typos, some data entry problems are unavoidable.

Data Acquisition Anomalies(3/4) Syntactic errors involve errorneous formatting and values used for representing the entities in a database. lexical error the table is expected to have five columns, but some or all of the rows contain only four columns a missing or empty value semantic anomaly contradiction between values that violated a dependency between attributes   result in coverage anomaly if the value that is missing is defined for an attribute that had been assigned a not null constraint in the database.

Data Acquisition Anomalies(4/4) Data Semantic Anomalies occur when this mapping is corrupted, thus, altering the meaning of the data in some way. Attributes with the same name but different meanings or different names and the same meanings Ex) YY or Y for year. Ex) as in the relationship between AGE and DATE_OF_BIRTH for a tuple representing persons. the naming of one attribute depends on the value of another attribute Overloaded attributes where an attribute has more than one meaning Ex) clothing code in a store, “V00403sp33100502” -> (V) = style, (004) = the color BLUE, etc...

Outliers outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a database, outliers appear as objects that do not conform to the general behavior of the data. Most data mining methods consider outliers to be noise, causing problems for classification or clustering. outliers detection methods statistical tests based on a probability model of the data others use distance measures to identify objects that are a significantly large distance from any other cluster in the data deviation-based methods examine differences in the main characteristics of objects in a group to identify outliers

Outliers as Noise Dullea defines outliers as maverick values occur at the extreme end of a distribution and greatly vary from the distribution mean occur when one or two values strongly bias an average distribution (ex: average age in a group) these maverick values are considered to be detrimental noise that causes the analysis to suffer The goal in most data mining applications, especially classifiers and predictors, is to detect and remove these outliers from the data   detrimental : 해로운 , 이롭지 못한

Outliers as Interesting Data The data is considered interesting only when values occur that fall outside the range of what is expected. Ex) computer network intrusion detection systems The systems first establish a baseline of normal network traffic data The network is then monitored against this baseline. When network data changes significantly in comparison to the baseline, it indicates possible intrusion, thus the salient  characteristics of the outlying  data are what make it useful, valued and deemed  interesting in such systems. Salient : 두드러진, 현저한 Deem : 의견을 가지다, 생각하다.

Contents 1. Definitions Data Anomalies 4. Managing 2. Types 3. Detecting 4. Managing Data Cleaning Outlier Mining Data Acquisition Anomalies Errors in Manual Data Entry Data Semantic Anomalies Outliers Outliers as Noise Outliers as Interesting Data Applications for Detecting Data Anomalies

Applications for Detecting Data Anomalies anomaly detection systems type detect anomalies as they occur, logging issues for later analysis, or sometimes sounding alarms that require immediate human intervention. Real time detect anomalies and manage them within an iterative cycle of detection and cleaning. Offline

Outlier Detection outlier detection method Statistical and depend on analyzing the overall distribution of the data. Statistical and distance-based It is used when data is not uniformly distributed and, therefore, cannot benefit from the statistical and distance-based methods. Density-based identifies outliers by examining the main characteristics in a group and how objects deviate from those characteristics. Deviation-based

Contents 1. Definitions Data Anomalies 4. Managing 2. Types 3. Detecting 4. Managing Data Cleaning Outlier Mining Data Acquisition Anomalies Errors in Manual Data Entry Data Semantic Anomalies Outliers Outliers as Noise Outliers as Interesting Data Applications for Detecting Data Anomalies

Managing Data Anomalies(1/2) Examples of Data Cleaning are often replaced with a mean, median or midrange value determined by the data expert. Missing or empty values Some tools, help convert names and addresses in several countries and some can even detect and correct wrongly entered street addresses. convert names and addresses such as AGE and DATE_OF_BIRTH causing data semantic anomalies can be handled using data profiling techniques that deduce the correct values by creating metadata, or data about the data. Some discrepancies between attributes

Managing Data Anomalies(1/2) Outlier Mining is the term used to describe systems that focus on detecting or mining for outliers as indicators of change in the normal behavior of a system often, the outliers are used to trigger alarms that require some immediate response or system intervention applications credit card fraud severe weather prediction computer network intrusion outbreaks  of disease  

Conclusion described some of the most commonly known data anomaly types and compared some of them described how manual data entry systems are highly susceptible to human error Ex) data entry systems are poorly designed outlier noise that should be removed indicator or alarm that is sought as a goal in outlier mining data cleaning is necessary noisy data may result in unreliable classifiers or predictors not much literature is available on the topic of anomaly detection and management, and what is available is mostly highly case specific, opportunities for research in this area are wide Susceptible: 영향을 받기 쉬운