Data Mining: Data Prepossessing What is to be done before we get to Data Mining?

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Descriptive Measures MARE 250 Dr. Jason Turner.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Calculating & Reporting Healthcare Statistics
Descriptive Statistics A.A. Elimam College of Business San Francisco State University.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Intro to Descriptive Statistics
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Introduction to Statistics Chapter 3 Using Statistics to summarize.
Coefficient of Variation
Business Statistics BU305 Chapter 3 Descriptive Stats: Numerical Methods.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Chapter 2: Data Preprocessing
Describing Data: Numerical
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
CS685: Special Topics in Data Mining Jinze Liu September 9,
Chapter 3 – Descriptive Statistics
Descriptive Statistics Roger L. Brown, Ph.D. Medical Research Consulting Middleton, WI Online Course #1.
1 1 Slide Descriptive Statistics: Numerical Measures Location and Variability Chapter 3 BA 201.
M07-Numerical Summaries 1 1  Department of ISM, University of Alabama, Lesson Objectives  Learn when each measure of a “typical value” is appropriate.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Univariate Descriptive Statistics Heibatollah Baghi, and Mastee Badii George Mason University.
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Business Statistics Spring 2005 Summarizing and Describing Numerical Data.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
PCB 3043L - General Ecology Data Analysis. PCB 3043L - General Ecology Data Analysis.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining: Concepts and Techniques — Chapter 2 —
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
Unit 2 Section 2.3. What we will be able to do throughout this part of the chapter…  Use statistical methods to summarize data  The most familiar method.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining What is to be done before we get to Data Mining?
Statistics -Descriptive statistics 2013/09/30. Descriptive statistics Numerical measures of location, dispersion, shape, and association are also used.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
AP PSYCHOLOGY: UNIT I Introductory Psychology: Statistical Analysis The use of mathematics to organize, summarize and interpret numerical data.
Lecture 8 Data Analysis: Univariate Analysis and Data Description Research Methods and Statistics 1.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Statistics in Management
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Measures of Central Tendency
UNIT-2 Data Preprocessing
Central Tendency and Variability
Averages and Variation
Numerical Descriptive Measures
Central tendency and spread
Numerical Descriptive Measures
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
BUSINESS MATHEMATICS & STATISTICS.
Lecture 1: Descriptive Statistics and Exploratory
Data Mining Data Preprocessing
Data Mining: Characterization
By Sandeep Patil, Department of Computer Engineering, I²IT
Numerical Descriptive Measures
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Data Mining: Data Prepossessing What is to be done before we get to Data Mining?

Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Why Data Preprocessing? Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” – noisy: containing errors or outliers e.g., Salary=“-10” – inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Why Is Data Dirty? Incomplete data may come from – “Not applicable” data value when collected – Different considerations between the time when the data was collected and when it is analyzed. – Human/hardware/software problems Noisy data (incorrect values) may come from – Faulty data collection instruments – Human or computer error at data entry – Errors in data transmission Inconsistent data may come from – Different data sources – Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning

Why Is Data Preprocessing Important? No quality data, no quality mining results! – Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics – Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: – Accuracy – Completeness – Consistency – Timeliness – Believability – Value added – Interpretability – Accessibility

Major Tasks in Data Preprocessing 1- Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies 2- Data integration – Integration of multiple databases, data cubes, or files 3- Data transformation – Normalization and aggregation 4- Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results 5- Data discretization – Part of data reduction but with particular importance, especially for numerical data – Discretization is the process of putting values into buckets so that there are a limited number of possible states(distinct values).

Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Mining Data Descriptive Characteristics Motivation – To better understand the data: central tendency, variation and spread Measures of central tendency refers to finding the mean, median and mode. – mean, median, mode, and midrange Measures of data dispersion measure how spread out a set of data is. – quartiles, interquartile range (IQR), and variance

Mining Data Descriptive Characteristics Distributive measure – A measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. Algebraic measure – An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. Holistic measure –A measure that must be computed on the entire data set as a whole.

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): – Weighted arithmetic mean: – Trimmed mean: chopping extreme values Median: A holistic measure – Middle value if odd number of values, or average of the middle two values otherwise – Estimated by interpolation (for grouped data): Mode – Value that occurs most frequently in the data – Unimodal, bimodal, trimodal – Empirical formula:

Measuring the Central Tendency Midrange is the average of the largest and smallest values in the set.

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data