Data, pre-processing and exploration

Slides:



Advertisements
Similar presentations
Principles of data mining
Advertisements

Basic techniques for cluster detection
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.

Lecture Notes for Chapter 2 Introduction to Data Mining
Beginning the Research Design
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Descriptive statistics (Part I)
1 The Assumptions. 2 Fundamental Concepts of Statistics Measurement - any result from any procedure that assigns a value to an observable phenomenon.
Data Mining – Intro.
Thomas Songer, PhD with acknowledgment to several slides provided by M Rahbar and Moataza Mahmoud Abdel Wahab Introduction to Research Methods In the Internet.
Data Mining Lecture 2: data.
Data Mining Techniques
Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
@ 2012 Wadsworth, Cengage Learning Chapter 5 Description of Behavior Through Numerical 2012 Wadsworth, Cengage Learning.
Completing the Experiment. Your Question should be in the proper format: The Effect of Weight on the Drone’s Ability to Fly in Meters In this format,
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Smith/Davis (c) 2005 Prentice Hall Chapter Four Basic Statistical Concepts, Frequency Tables, Graphs, Frequency Distributions, and Measures of Central.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
10/3/20151 PUAF 610 TA Session 4. 10/3/20152 Some words My –Things to be discussed in TA –Questions on the course and.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Chapter 2 Describing Data.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
EDPSY Chp. 2: Measurement and Statistical Notation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
What is Data? Attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Measurements Statistics WEEK 6. Lesson Objectives Review Descriptive / Survey Level of measurements Descriptive Statistics.
Research Methodology Lecture No :32 (Revision Chapters 8,9,10,11,SPSS)
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
Appendix I A Refresher on some Statistical Terms and Tests.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Exploring Data: Summary Statistics and Visualizations
Chapter 12 Understanding Research Results: Description and Correlation
Data Mining: EXPLORING DATA
Exploring, Displaying, and Examining Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Chapter 5 STATISTICS (PART 1).
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Exploring Data
Basic Statistical Terms
CSCI N317 Computation for Scientific Applications Unit Weka
Data Transformations targeted at minimizing experimental variance
Lecture 1: Descriptive Statistics and Exploratory
Group 9 – Data Mining: Data
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Biostatistics Lecture (2).
Presentation transcript:

Data, pre-processing and exploration Chapter Three Data, pre-processing and exploration

Chapter Overview Data, data types and operations Properties of various data sets Data source and data warehouse Issues of data quality Data pre-processing operations Data summary and visualisation Online analytic processing (OLAP) Data exploration and visualisation in Weka

Data, Data Types and Operations Data object and attributes Data object or instance: individual independent recording of a real life object/event. Characterised by its recorded values on a fixed set of features or attributes Feature or attribute: a specific property or characteristic of the data object. Measurement: assigning a valid value to an attribute according to an appropriate measurement scale. Collection: collecting measurement results or recorded values

Data, Data Types and Operations Data object and attributes (cont’d) An example 123, “John Smith”, “03/02/1990”, 20, “male”, 1.82, 78 ID number, collected Name collected Birthday collected Age calculated Gender collected Body height measured Body weight

Data, Data Types and Operations Data object and attributes (cont’d) Measurement and measurement errors Precision: the closeness of measurements to one another, represented by the standard deviation of the measurements, e.g. repeated measure of body temperature Bias: a systematic variation of measurements from the intended quantity measurement, only known when external reference available, e.g. bias in weight measure instrument Accuracy: the closeness of the measure to the true value, indicated by the number of significant digits used in the measurement, e.g. measure of money: pound vs. penny Collection errors Incorrect data recording at the point of entry, e.g. “Hongpo Do” as for “Hongbo Du”

Data, Data Types and Operations Attribute domain types and operations Categorical/Qualitative types Nominal, e.g. Gender (M, F) A set of names: no concept of order nor difference Operators applicable: =,  1:1 transformation permissible, e.g. ID: 11  e901 Ordinal, e.g. Grade (A, B, C, D, E) A set of names: with order but no concept of difference Operator applicable: =, , <, >, ,  Order-preserving transformation permitted, e.g. Grade: A  First, B  Second, C  Third, D  Pass, E  BarePass.

Data, Data Types and Operations Attribute domain types and operations Numeric/Quantitative types Interval, e.g. Temperature in C A set of numeric values: both order and difference exist Operators applicable: =, , <, >, , , +, - e.g. temperature (F and C), calendar year Transformation new = a*old + b permitted, e.g. F  C Ratio, e.g. Length A set of numeric values: order, difference and ratio The set has an absolute zero Operator applicable: =, , <, >, , , +, -, ,  Transformation new = a*old permitted, e.g. meter  feet

Data Sets Various forms Table of records Ordered data Graph-based data Relational table Join of relational tables Numerical spreadsheet (data matrix) Boolean strings (document-term matrix) Ordered data Time series and temporal sequence Data sequence Spatial data Graph-based data Non record-based data

Data Sets Various forms (illustrated) GGTTCCGCCTTCAGCC CCGCGCCCGCAGGG… Data Matrix Relational Table Transaction Database Page1 link1 link2 Page2 link3 Page4 www zzzz Page3 xxxx yyyy Web Structure Spatial Data GGTTCCGCCTTCAGCC CCGCGCCCGCAGGG… Data Sequence

Data Sets Properties Type: file structure, e.g. ARFF for Weka, DAT for See5 Size: measured in terms of the total number of records or total number of bytes, e.g. small (MB), medium (GB) and large (TB) Dimensionality: number of attributes Sparsity: Values are skewed to some extreme or sub-ranges Asymmetric values (some are more important than others) Resolution Right level of data details Related to the intended purpose

Data Sets Properties (example insurance data set) Type: ARFF Dimensionality: 7 Asymmetric: Y/N Skewed? Resolution: detailed Size: 14722 records

Data Source and Data Warehouse Sources of data Local data source available Local operational systems from different departments Third-party external data source Enterprise/Organisational data warehouse An organisational database for decision making A central data repository separate from operational systems Enforcing organisation-wide data consistency and integration Providing data details as well as data summarisation Providing data values as well as meta-data Equipped with data analysis and reporting tools As a data source for data mining

Data Source and Data Warehouse Star schema for data warehouse Central fact table Dimension tables Limited use of join operations Part(p#, pname, weight, colour) Supply(s#, p#, pj#, qty) Supplier(s#, sname, city, status) Project(pj#, jname, status, date)

Issues of Data Quality Main quality indicators Accuracy: data recorded with sufficient precision and little bias Correctness: data recorded without error and spurious objects Completeness: any parts of data records missing Consistency: compliance with established rules and constraints Redundancy: unnecessary duplicates Using the indicators to quantify quality of a data set Improving quality if possible

Issues of Data Quality Some examples Accuracy & correctness with the road accident reports in Exercise 1.3(c). Completeness with the UK family expenditure surveys in Exercise 1.3(a). Incompleteness introduced by data integration using outer join operation Consistency in questionnaires, e.g. eating fruit & veg. Q1: “give the fruit&veg portion consumed yesterday”: 2 Q2: “give the fruit&veg portion consumed today:” 3 Q3: “do you eat more today than yesterday?” No. Redundancy in a local company’s database of 40,000 records about 15,000 client companies.

Issues of Data Quality Why is quality important? “Garbage in, garbage out!” Total data quality control requires a cultural change (comparing with total product quality control) For data mining, tackling the quality issue at the data source cannot be always expected By cleaning the data as much as possible By developing and using more tolerate mining solutions Data quality is relevant to the intended purpose of data mining, e.g. Do spelling errors in student names really matter when only the increase/decrease of student numbers in particular subject areas over the years is of interest?

Data Pre-processing Overview Purpose: for speedy, cost-effective and high quality outcomes of data mining Pre-processing tasks (not all are independent from each other) Data aggregation Data sampling Dimension reduction Feature selection Feature creation Discretisation/binarisation Variable transformation Dealing with missing values

Data Pre-processing Data aggregation What: to summarise low level data details to higher level data abstraction Why: to reduce the time of mining, to rescale data values, and to discover more stable patterns How: By generalisation using a given concept hierarchy By applying aggregate functions (e.g. count, sum, average) Dropping some attributes

Data Pre-processing Data sampling What: selecting a subset of the given data set Why: to make it possible to use sophisticated mining algorithms within a time limit. Caution: the sample must be representative of the original data set How: Random sampling Stratified sampling Progressive sampling With or without replacement Data population Sampling method Selected subset

Data Pre-processing Feature selection What: reducing dimensionality by selecting a subset of attributes Purposes: To remove/reduce redundant features To remove irrelevant features with no useful information for the mining task How: Manually with common sense and domain knowledge Letting the mining solution to select suitable features (the embedded approach) Filter and wrapper approaches attributes Subset selection One subset evaluation Stopping criterion ok Not ok Selected subset Validate with Mining task

Data Pre-processing Data dimension reduction What: reduce redundancy implied among attributes e.g. are all 9600 dimensions for a 120x80 pixel image necessary? Curse of dimensions: as dimensionality increases Data become more diverse, and any patterns are getting less significant and more peculiar. The processing time may increase substantially. Why: to reduce redundancy and effects of the curse How: Linear algebra techniques Principal component analysis (PCA) Independent component analysis (ICA) Single value decomposition (SVD) Feature selection (as described before)

Data Pre-processing Feature creation What: to create a new set of features from the original features Purpose: in the new feature space, meaningful and relevant patterns can be extracted more easily. The number of features may be reduced. How: Using feature extraction methods to extract new features from the existing ones, e.g. extracting colour, texture and shape from image of pixel values Mapping data to a new space, e.g. wavelet transformation of pixel values of images to a frequency domain Constructing new features from the existing ones using domain knowledge, e.g. using transaction dates to construct a new feature customer tenure that indicates the loyalty of the customer to the company

Data Pre-processing Data discretisation What: to convert continuous attribute values to discrete categorical values The purposes: Requirement for some data mining solutions Better data mining results (not always) How: Deciding how many categories to have and where split points should be Mapping values to categories Determine the number & locations of the split points  t1 t2 t3 t4 Mapping values within each sub-range to a category label 

Data Pre-processing Data discretisation (cont’d) Discretisation methods: Unsupervised: without concern to the outcome of a specific attribute, normally used for clustering and association rule mining e.g. equal width, equal depth, clustering Supervised: with respect to the outcome of the class attribute, normally used for classification Simple methods: sorting according to the class attribute, and then discretising the attribute values for each class. Sophisticated methods: the discretisation of the attribute values purifies the outcome of the class, e.g. using entropy to measure the degree of purity, and deciding the split points recursively, similar to decision tree induction Merging methods, merging small intervals into a larger one with a stop criterion

Data Pre-processing Data binarisation What: to convert discrete categorical values to binary Boolean attribute values The purpose: the same as for discretisation How: Convert m categorical values to values in [0, m-1] Convert each to binary number of n bits where n = log2m Use m asymmetric binary variables to represent each of m values

Data Pre-processing Variable transformation What: transform all values of an attribute to other values The purposes: Remove the effect of the outlier values Make the result data visualisation more interpretable Make the values more comparable How: Transformation using function e.g. log(x) Standardisation/normalisation e.g. division-by-range

Data Pre-processing Handling missing values What: to treat attributes with null values The purposes: Improve data quality Better mining results How: Elimination (may not always be possible) Using sensible default, e.g. Spending Amount is set to 0 By data imputation Average, median, or mode of the whole data population Average, median or mode of the nearest neighbours Postponing the handling and making the mining methods adaptive to missing values

Data Exploration Exploring data before mining Knowing data is essential for successful data mining Purposes: Better understanding of the characteristics of data Better decision over data pre-processing tasks Even being able to discover some hidden patterns Categories of data exploration techniques Summary statistics: using a small set of descriptors to describe the characteristics of a large data set Data visualisation: using graphical or tabular forms to reveal hidden data patterns Online Analytic Processing (OLAP) Data exploration and exploratory data analysis (EDA)

Data Exploration Summary statistics Frequency and mode for categorical attributes: Frequency of value Mode: the most frequently occurred value Percentiles for ordinal or continuous attributes: Given an attribute x and an integer p (0p100), the percentile xp is a value of x such that p% observed values of x are less than xp. Mean and median for continuous attributes: Mean and median Median is a better indication of “average” when data distribution is skewed or outliers are present Trimmed mean and median (after trimming top and bottom p%)

Data Exploration Summary statistics (cont’d) Measures of spread: Range Variance (2) Standard Deviation () Absolute average deviation (AAD) Multivariate summary statistics Mean vector Matrix of covariance Correlation

Data Exploration Data visualisation Rationale: human eyes are good at spotting patterns, particularly visual patterns. Major ways of visualising data Tabular form Graphical form Points and links Visual representation must be related to the data types of the attributes Visualising data as well as all its implicit relationships The visualisation must be comprehensible The visualisation of data must tell the truth

Data Exploration Data visualisation techniques Pie Chart Parallel Dimension Chart Stem & Leaf Plot Bar Chart Scatter Plot Star Dimension Chart

Data Exploration Online analytic processing (OLAP) Interactive reporting tool Treating a data set as a multidimensional hypercube Fast operation and fast result delivery A typical OLAP query: “For each product, find its market share in its category today minus its market share in its category in 1994” Result of the OLAP query:

Data Exploration OLAP: Multidimensional hypercube Jan Feb Dec March Buckingham Milton Keynes Northampton 1998 2000 1999 Total Customer = 5 Customer Names March Milton Keynes 1999

Data Exploration OLAP: Hierarchies winter spring summer Buckingham Milton Keynes Northampton 1998 2000 1999 autumn January February March winter April May June spring July August September summer October November December autumn Jan Feb Dec March Buckingham Milton Keynes Northampton 1998 2000 1999

Data Exploration OLAP: Operations Pivoting Slicing and dicing Selecting attributes to define the cube Visually rotating the cube to show a face Slicing and dicing Selecting a part of a cube Visually slicing a segment of a cube along a dimension Rolling-up Moving up along a hierarchy Drilling-down Moving down along a hierarchy Performing aggregate functions while rolling-up or drilling-down

Data Exploration in Weka Explorer ARFF file format Data set name Numeric attribute names and types Schema section Categorical attribute name and values Data section One data record per line; Values separated by “,”; “?” represents unknown.

Data Exploration in Weka Explorer Glance of an opened data set Summary statistics Visualisation of value distribution

Data Exploration in Weka Explorer Visualisation in Weka (limited)

Data Exploration in Weka Explorer Filters for pre-processing Many filters Supervised/unsupervised Attribute/instance Choose followed by parameter setting in command line

Chapter Summary The domain types determine the validity of operations applied. Transformation from one domain to another must preserve the domain characteristics. Data sets can be of various forms and from different sources. Data warehouse serves as a data source for data mining. Data quality is relevant to the intended application purpose. Data pre-processing operations are essential for good mining. Knowing the data is important for good data mining. Understanding of data is achieved via exploring, summarising and visualising data. OLAP serves as a data exploration and summarisation tool.

References Read Chapter 3 of Data Mining Techniques and Application Useful further references Tan, P-N., Steinbach, M. and Kumar, V. (2006), Introduction to Data Mining, Addison-Wesley, Chapters 2 and 3