Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Measures of Dispersion
Descriptive Statistics
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
Sullivan – Statistics: Informed Decisions Using Data – 2 nd Edition – Chapter 3 Introduction – Slide 1 of 3 Topic 16 Numerically Summarizing Data- Averages.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Summarising and presenting data
Distance Measures Tan et al. From Chapter 2.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Very Basic Statistics.
Data Mining: Concepts and Techniques — Chapter 2 —
Distance Measures Tan et al. From Chapter 2. Similarity and Dissimilarity Similarity –Numerical measure of how alike two data objects are. –Is higher.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
September 4, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
COMP5331: Knowledge Discovery and Data Minig
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
With Statistics Workshop with Statistics Workshop FunFunFunFun.
Descriptive Statistics
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Chapter 2 Describing Data.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Skewness & Kurtosis: Reference
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Larson/Farber Ch 2 1 Elementary Statistics Larson Farber 2 Descriptive Statistics.
What is Data? Attributes
Chapter 2: Getting to Know Your Data
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University.
Data Preliminaries CSC 600: Data Mining Class 1.
Exploratory Data Analysis
Chapter 2: Getting to Know Your Data
Data Mining: Concepts and Techniques Data Understanding
Chapter 2: Getting to Know Your Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture 2-2 Data Exploration: Understanding Data
Data Mining: Concepts and Techniques
COP 6726: New Directions in Database Systems
Lecture Notes for Chapter 2 Introduction to Data Mining
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Description of Data (Summary and Variability measures)
Similarity and Dissimilarity
Understanding Data Characteristics
Understanding Basic Characteristics of Data
Descriptive Statistics
Data Mining: Concepts and Techniques Data Understanding
Basic Statistical Terms
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Data Mining: Concepts and Techniques — Chapter 2 —
Data Preliminaries CSC 576: Data Mining.
Data Pre-processing Lecture Notes for Chapter 2
Biostatistics Lecture (2).
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao

Data Collection of data objects and their attributes A data object represents an entity – Examples: Sales database: customers, store items, sales Medical database: patients, treatments University database: students, professors, courses – Also called records, examples, instances, points, objects, tuples Data objects are described by attributes – Properties or characteristics of data objects – Also called variables, fields, characteristics, features 1

Example 2 Attributes Objects

Data Types Text – Each textual document is a collection of words Transactional data – Each transaction involves a set of items Graph – Vertices and edges Sequential data – An ordered sequence, e.g., a DNA sequence with A, T, C, G Spatial-temporal data – Time and location are implicit attributes Multimedia data – Audio, video, … 3

Types of Attributes Nominal: categories, states or “names of things” – Special case: Binary – Examples: eye color, race, gender, zip codes Ordinal: values have a meaningful order but magnitude between successive values is unknown – Examples: rankings (e.g., taste of potato chips on a scale from 1- 10), grades, height in {tall, medium, short} Interval: on a scale of equal-sized units – Examples: calendar dates, temperatures in Celsius or Fahrenheit Ratio – Examples: temperature in Kelvin (10 K˚ is twice as high as 5 K˚), length, time, counts 4

Types of Attributes 5 Attribute TypeDescriptionExamples Nominal / Binary The values are just different names that provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, gender Ordinal The values provide enough information to order objects. ( ) pain level, rating, grades, street numbers IntervalThe differences between values are meaningful, i.e., a unit of measurement exists (+, - ) calendar dates, temperature in Celsius or Fahrenheit RatioBoth differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length

Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables 6

Basic Statistical Description Motivation – To better understand the data: central tendency, variation and spread Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc. – Numerical dimensions correspond to sorted intervals Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures – Folding measures into numerical dimensions 7

Measuring the Central Tendency Mean: n is sample size – Weighted arithmetic mean Median (2 nd quantile) : Arranging all data points from lowest value to highest value and picking the middle one – Middle value if odd number of values, or average of the middle two values Mode: Value that occurs most frequently in the data – Not necessarily unique 8 symmetric positively skewednegatively skewed

Measuring the Central Tendency Comparison of common central stats of values { 1, 2, 2, 3, 4, 7, 9 } 9 TypeDescriptionExampleResult Arithmetic mean Sum of values of a data set divided by number of values ( ) / 7 4 Median Middle value separating the greater and lesser halves of a data set 1, 2, 2, 3, 4, 7, 93 Mode Most frequent value in a data set 1, 2, 2, 3, 4, 7, 92

Measuring the Dispersion of Data Quartiles, outliers – Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Q 1 : the middle number between the smallest and the median of the data set Q 3 : the middle number between the median and the highest of the data set – Inter-quartile range: IQR = Q 3 – Q 1 – Five number summary: min, Q 1, median, Q 3, max – Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) – Variance – Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) 10

Measuring the Dispersion of Data 11 Boxplot N(0,1σ 2 )

Boxplot Data is represented with a box The ends of the box are at the first and third quartiles – The height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum – Max length = 1.5*IQR Outliers: points beyond a specified outlier threshold, plotted individually 12

Histogram A graph display of tabulated frequencies, shown as bars – Shows what proportion of cases fall into each of several categories – The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 13

Histograms Often Tell More than Boxplots Two histograms may have the same boxplot representation – The same values for: min, Q1, median, Q3, max But they have rather different data distributions 14

Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another – View: is there is a shift in going from one distribution to another? Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2 15

Scatter Plot Provides a first look at bivariate data to see clusters of points, outliers, etc. – Each pair of values is treated as a pair of coordinates and plotted as points in the plane 16

Scatterplot Matrix Matrix of scatterplots of the k-dimension data – total of (k 2 /2-k) scatterplots 17

Similarity and Dissimilarity Similarity – Numerical measure of how alike two data objects are – Value is higher when objects are more alike – Often falls in the range [0,1] Dissimilarity (e.g., distance) – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies Proximity refers to a similarity or dissimilarity 18

Proximity Measure for Nominal Attributes Method 1: Simple matching – For object i and j, m: # of matches, p: total # of variables Method 2: Use a large number of binary attributes – creating a new binary attribute for each of the M nominal states A color attribute with values of red, yellow, blue, green, etc. Create a series of new attributes red?, yellow?, blue?, green? … 19

Proximity Measure for Binary Attributes A contingency table for binary data Distance measure for symmetric binary variables Distance measure for asymmetric binary variables Jaccard coefficient (similarity measure for asymmetric binary variables) 20 Object i Object j

Example 21 Compute the distance between different individuals based on asymmetric binary attributes – Gender is a symmetric attribute, the remaining attributes are asymmetric binary – The values Y and P be 1, and the value N 0

Distance on Numeric Data Minkowski distance – where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties – Positive definiteness: d(i, j) > 0 if i ≠ j, and d(i, i) = 0 – Symmetry: d(i, j) = d(j, i) – Triangle Inequality: d(i, j)  d(i, k) + d(k, j) A distance that satisfies these properties is a metric 22

Special Cases of Minkowski Distance h = 1: Manhattan distance (city block, L 1 norm) –E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: Euclidean distance (L 2 norm) h   : “supremum” distance (L  norm) –This is the maximum difference between any component (attribute) of the vectors 23

Example 24 Manhattan (L 1 ) Euclidean (L 2 ) Supremum

Distance on Ordinal Variables An ordinal variable can be discrete or continuous – Order is important, e.g., rank Can be treated like interval-scaled – replace x if by their rank – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by – compute the dissimilarity using methods for interval-scaled variables 25

Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document – Applications: information retrieval, biologic taxonomy, gene feature mapping If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos(d 1, d 2 ) = (d 1  d 2 ) /||d 1 || ||d 2 || where  indicates vector dot product, ||d||: the length of vector d 26

Example Find the similarity between documents 1 and 2 d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1  d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d 1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = ||d 2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 So, cos(d 1, d 2 ) =

Cosine Similarity 28 This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count of each document, but the angle between the documents