Data Anomalies in Data Mining and Knowledge Discovery in Data

Data Anomalies in Data Mining and Knowledge Discovery in Data
An Analysis of Data Anomalies in Data Mining and Knowledge Discovery in Data The 2008 International Conference on Data Mining, Las Vagas. *Mary C. Malone, Comcast Cable Corporation, USA *James Dullea, Villanova University, Pennsylvania, USA 김 현 숙

Contents 1. Definitions Data Anomalies 4. Managing 2. Types
3. Detecting 4. Managing Data Cleaning Outlier Mining Data Acquisition Anomalies Errors in Manual Data Entry Data Semantic Anomalies Outliers Outliers as Noise Outliers as Interesting Data Applications for Detecting Data Anomalies

Definitions of Data Anomalies
English language definition 공통 규칙(common rule)에서 벗어난 데이터 불규칙, 뭔가 다르고, 유별나서 쉽게 분류되지 않는 데이터 Other Definition frequent events : considered normal, anomalies : considered rare. events that deviate substantially from a known (explicit or implicit) model of some domain noise in the form of irrelevant or meaningless data anomalous data and erroneous data are used synonymously. a considerable part of data has quality problems. These problems are labeled as errors, anomalies or even dirtiness. the percentage of anomalous data that occurs in a data set is estimated to be relatively small or about 5%

Types of Data Anomalies(1/2)
single-source anomalies : occur within a single data source multi-source anomalies : occur when different data sources are combined to form a whole. according to how they originate within a data source defining missing data not missing but wrong data not missing and not wrong but unusable data dirty data that uses a “successive hierarchical refinement” approach

Types of Data Anomalies(2/2)
Dullea가 정의한 five categories data acquisition anomalies data semantic anomalies maverick values predictor issues organization and granularity issues Müller가 정의한 classification scheme syntax semantics coverage maverick values Maverick : 어느 쪽에도 속하지 않는 organization and granularity : 조직과 알갱이 We incorporate two different approaches for classifying data anomalies as an outline data acquisition anomalies, data semantic anomalies and maverick values, and augment these categories with classifications as defined in Müller

Data Acquisition Anomalies(1/4)
Errors in Manual Data Entry Ex) when reporting sex – in that M or F are the only given choices – but neither choice is selected, leaving the field blank Missing values in a data tuple can occur may or may not have a value, but because the respondent was unsure of the correct value, the field was left blank. Ex) zip codes Empty values occurs as a result of typographical errors or incorrect data entry. Ex) PATIENT_ID or SSN Improper input or incorrect input Ex) What color are your grandfather’s eyes?  everyone has two grandfathers. Bad questions the person entering the data “guesses.” Ex) Cleveland State University study showed 50% of majors as “undecided” for both spring and fall 1999 terms. Respondent guessing

Such errors can be prevented by creating better data entry systems defining more accurate questions in surveys, or by employing better data collection forms like radio buttons to enforce mutually exclusive choices on a respondent for a selection such as M or F. postal systems can also be used to check zip codes against input addresses But because of human error, such as typos, some data entry problems are unavoidable.

Syntactic errors involve errorneous formatting and values used for representing the entities in a database. lexical error the table is expected to have five columns, but some or all of the rows contain only four columns a missing or empty value semantic anomaly contradiction between values that violated a dependency between attributes result in coverage anomaly if the value that is missing is defined for an attribute that had been assigned a not null constraint in the database.

Data Semantic Anomalies occur when this mapping is corrupted, thus, altering the meaning of the data in some way. Attributes with the same name but different meanings or different names and the same meanings Ex) YY or Y for year. Ex) as in the relationship between AGE and DATE_OF_BIRTH for a tuple representing persons. the naming of one attribute depends on the value of another attribute Overloaded attributes where an attribute has more than one meaning Ex) clothing code in a store, “V00403sp ” -> (V) = style, (004) = the color BLUE, etc...

Outliers outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a database, outliers appear as objects that do not conform to the general behavior of the data. Most data mining methods consider outliers to be noise, causing problems for classification or clustering. outliers detection methods statistical tests based on a probability model of the data others use distance measures to identify objects that are a significantly large distance from any other cluster in the data deviation-based methods examine differences in the main characteristics of objects in a group to identify outliers

Outliers as Noise Dullea defines outliers as maverick values
occur at the extreme end of a distribution and greatly vary from the distribution mean occur when one or two values strongly bias an average distribution (ex: average age in a group) these maverick values are considered to be detrimental noise that causes the analysis to suffer The goal in most data mining applications, especially classifiers and predictors, is to detect and remove these outliers from the data detrimental : 해로운 , 이롭지 못한

Outliers as Interesting Data
The data is considered interesting only when values occur that fall outside the range of what is expected. Ex) computer network intrusion detection systems The systems first establish a baseline of normal network traffic data The network is then monitored against this baseline. When network data changes significantly in comparison to the baseline, it indicates possible intrusion, thus the salient characteristics of the outlying data are what make it useful, valued and deemed interesting in such systems. Salient : 두드러진, 현저한 Deem : 의견을 가지다, 생각하다.

Applications for Detecting Data Anomalies
anomaly detection systems type detect anomalies as they occur, logging issues for later analysis, or sometimes sounding alarms that require immediate human intervention. Real time detect anomalies and manage them within an iterative cycle of detection and cleaning. Offline

Outlier Detection outlier detection method Statistical and
depend on analyzing the overall distribution of the data. Statistical and distance-based It is used when data is not uniformly distributed and, therefore, cannot benefit from the statistical and distance-based methods. Density-based identifies outliers by examining the main characteristics in a group and how objects deviate from those characteristics. Deviation-based

Managing Data Anomalies(1/2)
Examples of Data Cleaning are often replaced with a mean, median or midrange value determined by the data expert. Missing or empty values Some tools, help convert names and addresses in several countries and some can even detect and correct wrongly entered street addresses. convert names and addresses such as AGE and DATE_OF_BIRTH causing data semantic anomalies can be handled using data profiling techniques that deduce the correct values by creating metadata, or data about the data. Some discrepancies between attributes

Managing Data Anomalies(1/2)
Outlier Mining is the term used to describe systems that focus on detecting or mining for outliers as indicators of change in the normal behavior of a system often, the outliers are used to trigger alarms that require some immediate response or system intervention applications credit card fraud severe weather prediction computer network intrusion outbreaks of disease

Conclusion described some of the most commonly known data anomaly types and compared some of them described how manual data entry systems are highly susceptible to human error Ex) data entry systems are poorly designed outlier noise that should be removed indicator or alarm that is sought as a goal in outlier mining data cleaning is necessary noisy data may result in unreliable classifiers or predictors not much literature is available on the topic of anomaly detection and management, and what is available is mostly highly case specific, opportunities for research in this area are wide Susceptible: 영향을 받기 쉬운

Data Anomalies in Data Mining and Knowledge Discovery in Data

Similar presentations

Presentation on theme: "Data Anomalies in Data Mining and Knowledge Discovery in Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Anomalies in Data Mining and Knowledge Discovery in Data

Similar presentations

Presentation on theme: "Data Anomalies in Data Mining and Knowledge Discovery in Data"— Presentation transcript:

Similar presentations

About project

Feedback