Data Mining: Data Prepossessing What is to be done before we get to Data Mining?

Slides:

Advertisements

Similar presentations

UNIT – 1 Data Preprocessing

Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.

Descriptive Measures MARE 250 Dr. Jason Turner.

DATA PREPROCESSING Why preprocess the data?

1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Calculating & Reporting Healthcare Statistics

Descriptive Statistics A.A. Elimam College of Business San Francisco State University.

Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.

B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.

Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.

Intro to Descriptive Statistics

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Introduction to Statistics Chapter 3 Using Statistics to summarize.

Coefficient of Variation

Business Statistics BU305 Chapter 3 Descriptive Stats: Numerical Methods.

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

Chapter 1 Data Preprocessing

Chapter 2: Data Preprocessing

Describing Data: Numerical

September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.

CS685: Special Topics in Data Mining Jinze Liu September 9,

Chapter 3 – Descriptive Statistics

Descriptive Statistics Roger L. Brown, Ph.D. Medical Research Consulting Middleton, WI Online Course #1.

1 1 Slide Descriptive Statistics: Numerical Measures Location and Variability Chapter 3 BA 201.

M07-Numerical Summaries 1 1  Department of ISM, University of Alabama, Lesson Objectives  Learn when each measure of a “typical value” is appropriate.

1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.

Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

1 Univariate Descriptive Statistics Heibatollah Baghi, and Mastee Badii George Mason University.

9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.

Business Statistics Spring 2005 Summarizing and Describing Numerical Data.

Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.

PCB 3043L - General Ecology Data Analysis. PCB 3043L - General Ecology Data Analysis.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,

Data Mining: Concepts and Techniques — Chapter 2 —

Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.

Unit 2 Section 2.3. What we will be able to do throughout this part of the chapter…  Use statistical methods to summarize data  The most familiar method.

February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Mining What is to be done before we get to Data Mining?

Statistics -Descriptive statistics 2013/09/30. Descriptive statistics Numerical measures of location, dispersion, shape, and association are also used.

PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.

AP PSYCHOLOGY: UNIT I Introductory Psychology: Statistical Analysis The use of mathematics to organize, summarize and interpret numerical data.

Lecture 8 Data Analysis: Univariate Analysis and Data Description Research Methods and Statistics 1.

Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.

Course Outline 1. Pengantar Data Mining 2. Proses Data Mining

Statistics in Management

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining: Data Preparation

Measures of Central Tendency

UNIT-2 Data Preprocessing

Central Tendency and Variability

Averages and Variation

Numerical Descriptive Measures

Central tendency and spread

Numerical Descriptive Measures

Data Preprocessing Modified from

Chapter 1 Data Preprocessing

BUSINESS MATHEMATICS & STATISTICS.

Lecture 1: Descriptive Statistics and Exploratory

Data Mining Data Preprocessing

Data Mining: Characterization

By Sandeep Patil, Department of Computer Engineering, I²IT

Numerical Descriptive Measures

Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel

Presentation transcript:

Data Mining: Data Prepossessing What is to be done before we get to Data Mining?

Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Why Data Preprocessing? Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” – noisy: containing errors or outliers e.g., Salary=“-10” – inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Why Is Data Dirty? Incomplete data may come from – “Not applicable” data value when collected – Different considerations between the time when the data was collected and when it is analyzed. – Human/hardware/software problems Noisy data (incorrect values) may come from – Faulty data collection instruments – Human or computer error at data entry – Errors in data transmission Inconsistent data may come from – Different data sources – Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning

Why Is Data Preprocessing Important? No quality data, no quality mining results! – Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics – Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: – Accuracy – Completeness – Consistency – Timeliness – Believability – Value added – Interpretability – Accessibility

Major Tasks in Data Preprocessing 1- Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies 2- Data integration – Integration of multiple databases, data cubes, or files 3- Data transformation – Normalization and aggregation 4- Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results 5- Data discretization – Part of data reduction but with particular importance, especially for numerical data – Discretization is the process of putting values into buckets so that there are a limited number of possible states(distinct values).

Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

Mining Data Descriptive Characteristics Motivation – To better understand the data: central tendency, variation and spread Measures of central tendency refers to finding the mean, median and mode. – mean, median, mode, and midrange Measures of data dispersion measure how spread out a set of data is. – quartiles, interquartile range (IQR), and variance

Mining Data Descriptive Characteristics Distributive measure – A measure (i.e., function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set. Algebraic measure – An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures. Holistic measure –A measure that must be computed on the entire data set as a whole.

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): – Weighted arithmetic mean: – Trimmed mean: chopping extreme values Median: A holistic measure – Middle value if odd number of values, or average of the middle two values otherwise – Estimated by interpolation (for grouped data): Mode – Value that occurs most frequently in the data – Unimodal, bimodal, trimodal – Empirical formula:

Measuring the Central Tendency Midrange is the average of the largest and smallest values in the set.

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data