Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.

Similar presentations


Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao."— Presentation transcript:

1 Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao

2 Data Collection of data objects and their attributes A data object represents an entity – Examples: Sales database: customers, store items, sales Medical database: patients, treatments University database: students, professors, courses – Also called records, examples, instances, points, objects, tuples Data objects are described by attributes – Properties or characteristics of data objects – Also called variables, fields, characteristics, features 1

3 Example 2 Attributes Objects

4 Data Types Text – Each textual document is a collection of words Transactional data – Each transaction involves a set of items Graph – Vertices and edges Sequential data – An ordered sequence, e.g., a DNA sequence with A, T, C, G Spatial-temporal data – Time and location are implicit attributes Multimedia data – Audio, video, … 3

5 Types of Attributes Nominal: categories, states or “names of things” – Special case: Binary – Examples: eye color, race, gender, zip codes Ordinal: values have a meaningful order but magnitude between successive values is unknown – Examples: rankings (e.g., taste of potato chips on a scale from 1- 10), grades, height in {tall, medium, short} Interval: on a scale of equal-sized units – Examples: calendar dates, temperatures in Celsius or Fahrenheit Ratio – Examples: temperature in Kelvin (10 K˚ is twice as high as 5 K˚), length, time, counts 4

6 Types of Attributes 5 Attribute TypeDescriptionExamples Nominal / Binary The values are just different names that provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, gender Ordinal The values provide enough information to order objects. ( ) pain level, rating, grades, street numbers IntervalThe differences between values are meaningful, i.e., a unit of measurement exists (+, - ) calendar dates, temperature in Celsius or Fahrenheit RatioBoth differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length

7 Discrete and Continuous Attributes Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables – Note: binary attributes are a special case of discrete attributes Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables 6

8 Basic Statistical Description Motivation – To better understand the data: central tendency, variation and spread Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc. – Numerical dimensions correspond to sorted intervals Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures – Folding measures into numerical dimensions 7

9 Measuring the Central Tendency Mean: n is sample size – Weighted arithmetic mean Median (2 nd quantile) : Arranging all data points from lowest value to highest value and picking the middle one – Middle value if odd number of values, or average of the middle two values Mode: Value that occurs most frequently in the data – Not necessarily unique 8 symmetric positively skewednegatively skewed

10 Measuring the Central Tendency Comparison of common central stats of values { 1, 2, 2, 3, 4, 7, 9 } 9 TypeDescriptionExampleResult Arithmetic mean Sum of values of a data set divided by number of values (1+2+2+3+4+7+9) / 7 4 Median Middle value separating the greater and lesser halves of a data set 1, 2, 2, 3, 4, 7, 93 Mode Most frequent value in a data set 1, 2, 2, 3, 4, 7, 92

11 Measuring the Dispersion of Data Quartiles, outliers – Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Q 1 : the middle number between the smallest and the median of the data set Q 3 : the middle number between the median and the highest of the data set – Inter-quartile range: IQR = Q 3 – Q 1 – Five number summary: min, Q 1, median, Q 3, max – Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) – Variance – Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) 10

12 Measuring the Dispersion of Data 11 Boxplot N(0,1σ 2 )

13 Boxplot Data is represented with a box The ends of the box are at the first and third quartiles – The height of the box is IQR The median is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum – Max length = 1.5*IQR Outliers: points beyond a specified outlier threshold, plotted individually 12

14 Histogram A graph display of tabulated frequencies, shown as bars – Shows what proportion of cases fall into each of several categories – The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 13

15 Histograms Often Tell More than Boxplots Two histograms may have the same boxplot representation – The same values for: min, Q1, median, Q3, max But they have rather different data distributions 14

16 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another – View: is there is a shift in going from one distribution to another? Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2 15

17 Scatter Plot Provides a first look at bivariate data to see clusters of points, outliers, etc. – Each pair of values is treated as a pair of coordinates and plotted as points in the plane 16

18 Scatterplot Matrix Matrix of scatterplots of the k-dimension data – total of (k 2 /2-k) scatterplots 17

19 Similarity and Dissimilarity Similarity – Numerical measure of how alike two data objects are – Value is higher when objects are more alike – Often falls in the range [0,1] Dissimilarity (e.g., distance) – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies Proximity refers to a similarity or dissimilarity 18

20 Proximity Measure for Nominal Attributes Method 1: Simple matching – For object i and j, m: # of matches, p: total # of variables Method 2: Use a large number of binary attributes – creating a new binary attribute for each of the M nominal states A color attribute with values of red, yellow, blue, green, etc. Create a series of new attributes red?, yellow?, blue?, green? … 19

21 Proximity Measure for Binary Attributes A contingency table for binary data Distance measure for symmetric binary variables Distance measure for asymmetric binary variables Jaccard coefficient (similarity measure for asymmetric binary variables) 20 Object i Object j

22 Example 21 Compute the distance between different individuals based on asymmetric binary attributes – Gender is a symmetric attribute, the remaining attributes are asymmetric binary – The values Y and P be 1, and the value N 0

23 Distance on Numeric Data Minkowski distance – where i = (x i1, x i2, …, x ip ) and j = (x j1, x j2, …, x jp ) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties – Positive definiteness: d(i, j) > 0 if i ≠ j, and d(i, i) = 0 – Symmetry: d(i, j) = d(j, i) – Triangle Inequality: d(i, j)  d(i, k) + d(k, j) A distance that satisfies these properties is a metric 22

24 Special Cases of Minkowski Distance h = 1: Manhattan distance (city block, L 1 norm) –E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: Euclidean distance (L 2 norm) h   : “supremum” distance (L  norm) –This is the maximum difference between any component (attribute) of the vectors 23

25 Example 24 Manhattan (L 1 ) Euclidean (L 2 ) Supremum

26 Distance on Ordinal Variables An ordinal variable can be discrete or continuous – Order is important, e.g., rank Can be treated like interval-scaled – replace x if by their rank – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by – compute the dissimilarity using methods for interval-scaled variables 25

27 Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document – Applications: information retrieval, biologic taxonomy, gene feature mapping If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos(d 1, d 2 ) = (d 1  d 2 ) /||d 1 || ||d 2 || where  indicates vector dot product, ||d||: the length of vector d 26

28 Example Find the similarity between documents 1 and 2 d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1  d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d 1 ||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 ||d 2 ||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 So, cos(d 1, d 2 ) = 0.94 27

29 Cosine Similarity 28 This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count of each document, but the angle between the documents


Download ppt "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao."

Similar presentations


Ads by Google