Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical.

Similar presentations


Presentation on theme: "Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical."— Presentation transcript:

1 Data Mining Strategies

2 Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical (nominal)  Ordinal (only order matters)  Interval (difference between two vars is meaningful)  Ratio (when variable is 0.0 there is none of that data; Kelvin is but C and F are not)

3 What to Know about the Scales  The measurement principle involved for each scale  Examples of the measurement scales  Permissible arithmetic operations for each scale

4 Categorical Scale Data  The values of the scale have no numeric meaning  Examples  Gender  Ethnicity  Marital Status  Hair Color  Operations  Counting (only)

5 Ordinal Scale Data  The categories can be ordered  But the intervals between adjacent scale values are indeterminate  Examples  Movie ratings (0, 1 or 2 thumbs up)  U.S.D.A. beef (good, choice, prime)  The rank order of anything  Operations  Counting  Greater than or less than operations

6 Interval Scale Data  Intervals between adjacent scale values are equal  Examples  Degrees Fahrenheit  Most personality measures  IQ intelligence score  Operations  Counting  Greater than or less than operations  Addition and subtraction of scale values.

7 Ratio Scale Data  There is a rationale zero point for the scale  An absolute zero  Examples  Degrees Kelvin  Annual income in dollars  Length, distance, size cm, kB, inches, km  Operations  All plus  Multiplication and division of scale values.

8 Variables  Independent  Input x  Dependent  Output f(x) f(x) = 3+ 2x 2

9 Data Mining Strategies  Unsupervised (No dependent variables used)  Clustering  Market Basket Analysis  Information Visualization  Supervised (At least one dependent variable used for training)  Classification  Estimation  Prediction

10 Clustering  Cluster analysis divides data into groups (clusters) that are meaningful, useful or both  Clusters capture the natural structure of the data  Clustering allows us to think about the data at a new level of abstraction  Cluster analysis is often the first step in a data mining project

11 Cluster of Stars

12 Water Clusters

13 Cellular Clusters

14 Cluster Analysis  Uses information found in the data that describes objects and their relationships  Goal: That objects within a group be similar to one another and different from objects in other groups  The greater the similarity within groups and the greater the difference between groups, the better the clustering

15 How Many Clusters?

16 Three Clusters Identified

17 Six Clusters Identified

18 Types of Clustering  Partitional clustering  Heirarchical clustering  Exclusive clustering  Overlaping clustering  Fuzzy clustering  Complete clustering  Partial clustering

19 Partitional Clustering  A division of a set of data into non- overlaping clusters  Each data point is in exactly one cluster  Example of Partitional Clustering Example of Partitional Clustering

20 Heirarchical clustering  Permit subclusters (nested clusters within clusters)  Example of Hierarchical Clustering Example of Hierarchical Clustering

21 Exclusive clustering  Each object is assigned to a single cluster

22 Overlaping Clustering  Non-exclusive  A data point can belong to two or more clusters simultaneously

23 Fuzzy Clustering  Every data point belongs to every cluster with a membership weight.  Membership ranges from 0 (absolutely does not belong) to 1 (absolutely belongs)  The sum of the membership weights for each point is 1 C1 40% C2 60% C1 C2 C1 01% C2 99% C1 75% C2 25%

24 Complete Clustering  Assigns every data point to a cluster  No data point is left out of a cluster

25 Partial Clustering  Does not assign every data point to a cluster  Some data points can not belong to any cluster  Noise  Outliers  Uninteresting background  Classify newspaper stories  Many fall into  Global warming  Terrorism  Some stories are unique  Cable Tie just graduated from the CofC in CS

26 K-Means 1. Select K points as initial centroids 2. Repeat 1. Form K cluster by assigning each point to its closest centroid. 2. Recompute the centroid of each cluster. 3. Until centroids so not change Chris Starr: A centroid is the center of a cluster Chris Starr: A centroid is the center of a cluster

27 The centroids are repositioned until stable in the K-means algorithm.

28 Observe Your Environment  Start looking for clusters around you  Think about how the clusters are formed  Are they hierarchical?  Are they fuzzy clusters?  Are they complete clusters?


Download ppt "Data Mining Strategies. Scales of Measurement  Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680  Four Scales  Categorical."

Similar presentations


Ads by Google