Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.ePowerPoint.com.

Similar presentations


Presentation on theme: "Www.ePowerPoint.com."— Presentation transcript:

1

2 Chapter 2: Data Preprocessing

3 Chapter 3: Data Preprocessing
Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation and Data Discretization Summary 3

4 Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, … Consistency: some modified but some not, dangling, … Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood?

5 Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records 没有高质量的数据,就没有高质量的挖掘结果 现实世界的数据是“肮脏的”——数据多了,什么问题都会出现 不完整 缺少数据值;缺乏某些重要属性;仅包含汇总数据; e.g., occupation="" 有噪声 包含错误或者孤立点 e.g. Salary = -10 数据不一致 e.g., 在编码或者命名上存在差异 e.g., 过去的等级: “1,2,3”, 现在的等级: “A, B, C” e.g., 重复记录间的不一致性 e.g., Age=“42” Birthday=“03/07/1997”

6 Why Is Data Dirty? Incomplete data may come from
“Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning 不完整数据的成因 数据收集的时候就缺乏合适的值 数据收集时和数据分析时的不同考虑因素 人为/硬件/软件 问题 噪声数据(不正确的值)的成因 数据收集工具的问题 数据输入时的 人为/计算机 错误 数据传输中产生的错误 数据不一致性的成因 不同的数据源 违反了函数依赖性

7 Missing values - example
Hospital Check-in Database Value may be missing because it is unrecorded or because it is inapplicable In medical data, value for Pregnant? attribute for Jane is missing, while for Joe or Anna should be considered Not applicable Some programs can infer missing values Name Age Sex Pregnant? .. Mary 25 F N Jane 27 - Joe 30 M Anna 2

8 Why Is Data Preprocessing Important?
No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse 没有高质量的数据,就没有高质量的挖掘结果 高质量的决策必须依赖高质量的数据 e.g. 重复值或者空缺值将会产生不正确的或者令人误导的统计 数据仓库需要对高质量的数据进行一致地集成 数据预处理将是构建数据仓库或者进行数据挖掘的工作中占工作量最大的一个步骤

9 Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data 数据清理 填写空缺的值,平滑噪声数据,识别、删除孤立点,解决不一致性 数据集成 集成多个数据库、数据立方体或文件 数据变换 规范化和聚集 数据归约 得到数据集的压缩表示,它小得多,但可以得到相同或相近的结果。包括数据聚集、属性子集选择、维度归约、概念分层、数据离散化 数据离散化 数据归约的一部分,通过概念分层和数据的离散化来规约数据,对数字型数据特别重要

10 Forms of Data Preprocessing


Download ppt "Www.ePowerPoint.com."

Similar presentations


Ads by Google