Data Mining: Data Preparation

Data Mining: Data Preparation

1. What is Data Mining 3 Data mining is the process of discovering interesting patterns (or knowledge) from large amounts of data. The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically.

Some Definitions Data : Data are any facts, numbers, or text that can be processed by a computer. operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data forecast data, meta data - data about the data itself, such as logical database design or data dictionary definitions Information: The patterns, associations, or relationships among all this data can provide information.

Definitions Continued..
Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in terms of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most subject to promotional efforts. Data Warehouses: Data warehousing is defined as a process of centralized data management and retrieval.

Data Rich, Information Poor

Data Mining process

Knowledge discovery as a process
is depicted in Figure and consists of an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data mining step may interact with the user or a knowledge base.

Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Data Mining System Classification
Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

A data mining system can be classified according to the following criteria −
Database Technology Statistics Machine Learning Information Science Visualization Pattern Recognition Other Disciplines Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.

Major Issues in Data Mining
Mining different kinds of knowledge in databases: Because different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery tasks. These tasks may use the same database in different ways. Interactive mining of knowledge: Because it is difficult to know exactly what can be discovered within a database, the data mining process should be interactive

Incorporation of background knowledge: Background knowledge, or information regarding the domain under study, may be used to guide the discovery process Data mining query languages and ad hoc data mining: (Ad hoc analysis is a business intelligence process designed to answer a single, specific business question.) Relational query languages(such as SQL) allow users to pose ad hoc queries for data retrieval Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level languages, visual representations or different pattern, so that the knowledge can be easily understood and directly usable by humans.

Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional cases, or incomplete data objects. As a result, the accuracy of the discovered patterns can be poor. Data cleaning methods and data analysis methods that can handle these noise. Pattern evaluation—the interestingness problem: A data mining system can uncover lots of patterns. Many of the patterns discovered may be uninteresting to the given user, either because they represent common knowledge or lack novelty (Innovations) Some Pattern like Table, Graphs, Charts. Performance issues: These include efficiency, scalability, and easy to understand.

Data Warehouse example

Data Preprocessing Why preprocess the data? Data cleaning
Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest noisy: containing errors (Noisy data is meaningless data.) inconsistent: containing inconsistency in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data (quality data used in decision making and planning) Data warehouse needs consistent integration of quality data

Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness (Correctness) Believability Accessibility

Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove noisy data, and resolve inconsistencies Data integration Integration of multiple databases or files Data transformation Data transformation is the process of converting data or information from one format to another, data is transformed or consolidated into forms appropriate for mining, by performing summary or aggregation operations. Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results

Forms of data preprocessing

Data Cleaning Data cleaning tasks Fill in missing values
Smooth out noisy data Correct inconsistent data

Missing Data Data is not always available Missing data may be due to
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Noisy Data Noise: random error or variance in a measured variable
Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data

How to Handle Noisy Data?
Binning method: first sort data and partition into (equi-depth) bins(container) Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression(drop or failure) smooth by fitting the data into regression functions

Data Integration Data integration: Schema integration
combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different

Handling Redundant Data
Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation Smoothing: remove noise from data
Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled(size) to fall within a small, specified range (where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0 or 0.0 to 1.0) min-max normalization z-score normalization normalization by decimal scaling

Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical(logical or systematic) results Data reduction strategies (to reduce data) Data cube aggregation Dimensionality reduction(encoding mechanism used) Numerosity reduction(data are replaced by alternative) Discretization and concept hierarchy generation (powerful tool for data mining)

Data Cube Aggregation The lowest level of a data cube
the aggregated data for an individual entity of interest Multiple levels of aggregation in data cubes Further reduce the size of data Reference appropriate levels Use the smallest representation which is enough to solve the task

Dimensionality Reduction
Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered(bunch) but not if data is dirty

Sampling Choose a representative subset of the data
Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods

Sampling SRSWOR (simple random sample without replacement) SRSWR
Raw Data SRSWOR (simple random sample without replacement) SRSWR

Discretization Three types of attributes: Discretization:
Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

Discretization and Concept hierarchy
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research

Data Mining: Data Preparation

Similar presentations

Presentation on theme: "Data Mining: Data Preparation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Data Preparation

Similar presentations

Presentation on theme: "Data Mining: Data Preparation"— Presentation transcript:

Similar presentations

About project

Feedback