January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Descriptive Measures MARE 250 Dr. Jason Turner.
DATA PREPROCESSING Why preprocess the data?
Advanced Data Mining: Introduction
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Data Mining: Concepts and Techniques
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining – Intro.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Data Mining.
Chapter 2: Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Chapter 1. Introduction Motivation: Why data mining?
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
CS685: Special Topics in Data Mining Jinze Liu September 9,
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
Chapter 1 Introduction to Data Mining
1 1 Slide Descriptive Statistics: Numerical Measures Location and Variability Chapter 3 BA 201.
Introduction Pertemuan 01 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Mining: Concepts and Techniques — Chapter 2 —
February 13, 2016 Data Mining: Concepts and Techniques 1 1 Data Mining: Concepts and Techniques These slides have been adapted from Han, J., Kamber, M.,
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining What is to be done before we get to Data Mining?
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Mining – Intro.
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Mining Techniques and Applications
Data Mining.
Data warehouse & Data Mining: Concepts and Techniques
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Introduction to Data Mining
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
Data Mining Concepts and Techniques
Lecture 1: Descriptive Statistics and Exploratory
Data Mining: Characterization
Data Mining: Concepts and Techniques
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

January 17, 2016Data Mining: Concepts and Techniques 2 Multi-Dimensional View of Data Mining Data to be mined Relational, data warehouse, transactional, stream, object- oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

January 17, 2016Data Mining: Concepts and Techniques 3 Major Issues in Data Mining Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

January 17, 2016Data Mining: Concepts and Techniques4 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

January 17, 2016Data Mining: Concepts and Techniques5 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

January 17, 2016Data Mining: Concepts and Techniques6 Why Is Data Dirty? Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) Duplicate records also need data cleaning

January 17, 2016Data Mining: Concepts and Techniques7 Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

January 17, 2016Data Mining: Concepts and Techniques8 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: Intrinsic, contextual, representational, and accessibility

January 17, 2016Data Mining: Concepts and Techniques9 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data

January 17, 2016Data Mining: Concepts and Techniques10 Forms of Data Preprocessing

January 17, 2016Data Mining: Concepts and Techniques11 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary

January 17, 2016Data Mining: Concepts and Techniques12 Mining Data Descriptive Characteristics Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube

January 17, 2016Data Mining: Concepts and Techniques13 Measuring the Central Tendency Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:

January 17, 2016Data Mining: Concepts and Techniques14 Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 – Q 1 Five number summary: min, Q 1, M, Q 3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2)