Data Mining: Concepts and Techniques — Chapter 2 —

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Descriptive Statistics
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Descriptive Statistics
Distance Measures Tan et al. From Chapter 2.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Data Mining: Concepts and Techniques — Chapter 2 —
1 Data Mining: Concepts and Techniques — Chapter 2 —
Chapter 2 Describing Data with Numerical Measurements
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
September 4, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Chapter 1 Descriptive Analysis. Statistics – Making sense out of data. Gives verifiable evidence to support the answer to a question. 4 Major Parts 1.Collecting.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
COMP5331: Knowledge Discovery and Data Minig
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
October 15, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Chapter 2 Describing Data.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Big Data Analysis and Mining Lecture 2: Getting to Know Your Data Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall
1 1 MSCIT-5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
Categorical vs. Quantitative…
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Chapter 2: Getting to Know Your Data
1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.
1 Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University.
Exploratory Data Analysis
Chapter 2: Getting to Know Your Data
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Data Understanding
Data Mining: Concepts and Techniques — Chapter 2 —
Chapter 2: Getting to Know Your Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
Description of Data (Summary and Variability measures)
Data Mining: Concepts and Techniques — Chapter 2 —
Understanding Data Characteristics
CS6220: Data Mining Techniques
Understanding Basic Characteristics of Data
Data Mining: Concepts and Techniques Data Understanding
Basic Statistical Terms
Data Mining: Concepts and Techniques — Chapter 2 —
Welcome!.
Group 9 – Data Mining: Data
Data Pre-processing Lecture Notes for Chapter 2
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Data Mining: Concepts and Techniques — Chapter 2 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign Simon Fraser University ©2013 Han, Kamber, and Pei. All rights reserved.

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

Types of Data Sets Record Relational records Data matrix, e.g., numerical matrix, crosstabs Document data: text documents: term-frequency vector Transaction data Graph and network World Wide Web Social or information networks Molecular Structures Ordered Video data: sequence of images Temporal data: time-series Sequential Data: transaction sequences Genetic sequence data Spatial, image and multimedia: Spatial data: maps Image data: Video data:

Data Objects Data sets are made up of data objects. A data object represents an entity. Examples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. Database rows -> data objects; columns ->attributes.

Attributes Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object. E.g., customer _ID, name, address Types: Nominal Binary Ordinal Numeric Interval-scaled Ratio-scaled

Attribute Types Nominal: categories, states, or “names of things”, order not meaningful Hair_color = {auburn, black, blond, brown, grey, red, white} Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important e.g., gender Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive) Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades, army rankings

Numeric Attribute Types Quantity (integer or real-valued) Interval Measured on a scale of equal-sized units Values have order E.g., temperature in C˚or F˚, calendar dates No true zero-point Ratio Inherent zero-point We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚). e.g., temperature in Kelvin, length, counts, monetary quantities

Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of documents Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Continuous attributes are typically represented as floating-point variables

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

Basic Statistical Descriptions of Data Measures of central tendency Mean, median, mode Dispersion of data range, quartiles and interquartile range, five-number summary and boxplots, variance and standard deviation

Measuring the Central Tendency Mean: Note: n is sample size and N is population size. Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data): Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula: Median interval

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed April 17, 2017 Data Mining: Concepts and Techniques

Measuring the Dispersion of Data Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, median, Q3, max Boxplot: A simple graphical device to display the overall shape of a distribution, including the outliers. Ends of the box are the quartiles; median is marked within the box; add whiskers to mark min and max, and plot outliers individually Outlier: Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are outliers

Boxplot Analysis Draw a box plot for the following dataset. 10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4 Here, Q2(median) = 14.6 Q1 = 14.4 Q3 = 14.9 IQR = Q3 – Q1 = 14.9-14.4 = 0.5 Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65. So, the outliers are at 10.2, 15.9, and 16.4. The ends of the box are at 14.4 and 14.9. The median 14.6 is marked within the box. The whiskers extend to 14.1 and 15.1. The outliers 10.2, 15.9, and 16.4 are plotted individually.

Measuring the Dispersion of Data Variance and standard deviation Variance: Standard deviation σ is the square root of variance σ2

Example of Standard Deviation Find out the Mean, the Variance, and the Standard Deviation of the following dataset. 9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4 Here, the mean = 7 Using the formula for variance, σ2 = 8.9 Therefore, the standard deviation is σ = 2.98

Graphic Displays of Basic Statistical Descriptions Boxplot: graphic display of five-number summary Histogram: x-axis are values, y-axis represents frequencies Quantile plot: each value xi is paired with fi indicating that approximately fi * 100% of data are below the value xi Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

Histogram Example

Quantile Plot Used to check whether your data is normal To make a quantile plot: If the data distribution is close to normal, the plotted points will lie close to a slopped straight line

Quantile Plot Example Create a quantile plot for the following dataset. 3, 5, 1, 4, 10 First sort the data: 1, 3, 4, 5, 10 Calculate the sample quantiles: 0.1, 0.3, 0.5, 0.7, 0.9 Plot the graph

Scatter Plot

Positively and Negatively Correlated Data

Uncorrelated Data

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

Data Visualization Why data visualization? Gain insight into the data Search for patterns, trends, structure, irregularities, relationships among data Categorization of visualization methods: Pixel-oriented visualization techniques Geometric projection visualization techniques Icon-based visualization techniques Hierarchical visualization techniques Visualizing complex data and relations

Pixel-Oriented Visualization Techniques For a data set of m dimensions, create m windows on the screen, one for each dimension The m dimension values of a record are mapped to m pixels at the corresponding positions in the windows The colors of the pixels reflect the corresponding values Income (b) Credit Limit (c) transaction volume (d) age 26 26

Geometric Projection Visualization Techniques Visualization of geometric transformations and projections of the data Helps users find interesting projections of multidimensional data Methods Scatterplot and scatterplot matrices

Icon-Based Visualization Techniques Visualization of the data values as features of icons Typical visualization methods Chernoff Faces Stick Figures

Chernoff Faces A way to display multidimensional data of up to 18 dimensions as a cartoon human face Components of the faces such as eyes, ears, mouth, and nose represent values of the dimensions by their shape, size, placement and orientation

Data Mining: Concepts and Techniques Stick Figure A 5-piece stick figure (1 body and 4 limbs) Two attributes mapped to the two axes, remaining attributes mapped to angle or length of limbs A census data figure showing age, income, gender, education, etc. used by permission of G. Grinstein, University of Massachusettes at Lowell Data Mining: Concepts and Techniques

Hierarchical Visualization Techniques Visualization of the data using a hierarchical partitioning into subspaces instead of visualizing all dimensions at the same time Methods Dimensional Stacking Worlds-within-Worlds Tree-Map Cone Trees InfoCube

Dimensional Stacking Used by permission of M. Ward, Worcester Polytechnic Institute Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes

Visualizing Complex Data and Relations Visualizing non-numerical data: text and social networks Tag cloud: visualizing user-generated tags The importance of tag is represented by font size/color Newsmap: Google News Stories in 2005

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

Similarity and Dissimilarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity Numerical measure of how different two data objects are Lower when objects are more alike Two data structures commonly used to measure the above Data matrix Dissimilarity matrix

Data Matrix and Dissimilarity Matrix Stores n data points with p dimensions Two modes – stores both objects and attributes Dissimilarity matrix Stores n data points, but registers only the dissimilarity between objects i and j A triangular matrix Single mode as it only stores dissimilarity values

Proximity Measure for Nominal Attributes Can take 2 or more states, e.g., map_color may have attributes red, yellow, blue and green Dissimilarity can be computed based on the ratio of mismatches m: # of matches, p: total # of attributes

Dissimilarity between Nominal Attributes Here, we have one nominal attribute, test-1, so p=1. The dissimilarity matrix is as shown below:

Proximity Measure for Binary Attributes Object j A contingency table for binary data Dissimilarity for symmetric binary attributes: Dissimilarity for asymmetric binary attributes : Jaccard coefficient (similarity measure for asymmetric binary attributes): Object i

Dissimilarity between Binary Variables Example Gender is a symmetric attribute, others are asymmetric binary Let the values Y and P be 1, and the value N 0 Suppose the distance between patients is computed based only on the asymmetric attributes

Distance on Numeric Data: Minkowski Distance Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric

Special Cases of Minkowski Distance h = 1: Manhattan distance h = 2: Euclidean distance h  . supremum distance This is the maximum difference between any component (attribute) of the vectors

Example: Minkowski Distance Dissimilarity Matrices Manhattan (L1) Euclidean (L2) Supremum

Proximity Measures for Ordinal Variables Let M represent the number of possible values that an ordinal attribute can have replace each xif by its corresponding rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by compute the dissimilarity using any of the distance measures for numeric attributes, e.g., Euclidean distance

Proximity Measures for Ordinal Variables Consider the data in the adjacent table: Here, the attribute Test has three states: fair, good and excellent, so Mf=3 For step 1, the four attribute values are assigned the ranks 3,1,2 and 3 respectively. Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5 and rank 3 to 1.0 For step 3, using Euclidean distance, a dissimilarity matrix is obtained as shown Therefore, students 1 and 2 are most dissimilar, as are students 2 and 4 Student Test 1 Excellent 2 Fair 3 Good 4

Attributes of Mixed Type A database may contain all attribute types : Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects For nominal and ordinal attributes, use the technique mentioned earlier to compute dissimilarity matrix For numeric attributes use the following formula to calculate dissimilarity

Attributes of Mixed Type - Example Consider the data in the table: The dissimilarity matrices for the nominal and ordinal data are shown to the right computed using the methods discussed before To compute dissimilarity matrix for the numeric attribute, maxhxh=64, minhxh=22. Using the formula from previous slide, the dissimilarity matrix is obtained as shown below:

Attributes of Mixed Type - Example The three dissimilarity matrices can now be used to compute the overall dissimilarity between two objects using the equation The resulting dissimilarity matrix is

Cosine Similarity Cosine similarity is a measure of similarity that can be used to compare documents. A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document called term-frequency vector. Cosine measure: If d1 and d2 are two term-frequency vectors, then cos(x, y) = (x  y) /||x|| ||y|| , where  indicates dot product, ||x|| is the length of vector x defined as . Similarly, ||y|| is the length of vector y. A cosine value of 0 means that the two vectors are at 90 degrees to each other and there is no match. The closer the cosine value to 1, the greater the match between the vectors

Example: Cosine Similarity Ex: Find the similarity between documents 1 and 2 from previous slide. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) Using the formula, cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94 The cosine similarity shows that the two documents are quite similar.

Chapter 2: Getting to Know Your Data Data Objects and Attribute Types Basic Statistical Descriptions of Data Data Visualization Measuring Data Similarity and Dissimilarity Summary

Summary Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled Many types of data sets, e.g., numerical, text, graph, image, etc. Gain insight into the data by: Basic statistical data description: central tendency, dispersion, graphical displays Data visualization: Measure data similarity Above steps are the beginning of data preprocessing

Exercise 1 of 2 Find the dissimilarity value between Alyssa and Chris and between Alyssa and Diane. Student Gender HairColor Test1 Test2 Alyssa F Black Excellent 80 Chris M Good 85 Jessica Brown Fair 55 Diane Blonde

Exercise 2 of 2 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8), compute the: Euclidean distance Manhattan distance Supremum distance Cosine similarity