© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

© Vipin Kumar CSci 8980 Fall 2002 2 What is Data? l Objects and the attributes of objects –Attribute: variable, field, characteristic, feature, or observation – Object: record, point, case, sample, entity, or item – Objects have attributes. – Attributes describe objects l A data set is collection of data objects.

© Vipin Kumar CSci 8980 Fall 2002 3 Types of Attributes l There are different types of attributes – Nominal: Values are just labels.  Examples: ID numbers, eye color, zip codes – Ordinal: The values can be ordered.  Examples: street numbers, rankings (e.g., taste of potato chips on a scale from 1-10), grades – Interval: Differences are meaningful  Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio: Ratios are meaningful  Examples: temperature in Kelvin, length, time, counts

© Vipin Kumar CSci 8980 Fall 2002 5 Properties l The type of an attribute depends on which of the following properties it has. – Distinctness: =  – Order: – Addition: + - – Multiplication: * / l Length has all these properties

Attribute Type DescriptionExamplesOperations NominalThe values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation,  2 test OrdinalThe values of an ordinal attribute provide enough information to order objects. ( ) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests IntervalFor interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests RatioFor ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation

Attribute Level TransformationComments NominalAny permutation of valuesIf all employee ID numbers were reassigned, would it make any difference? OrdinalAn order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Intervalnew_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Rationew_value = a * old_valueLength can be measured in meters or feet.

© Vipin Kumar CSci 8980 Fall 2002 9 Discrete and Continuous Attributes l Discrete – A discrete attribute has only a finite or countably infinite set of values, e.g., zip codes, counts, or the set of words in a collection of documents. Discrete attributes are often represented as integer variables. Note that binary attributes are a special case of discrete attributes and assume only two values, e.g., true/false, yes/no, male/female. Binary attributes are often represented as Boolean variables, or as integer variables that take on the values 0 or 1. l Continuous – A continuous attribute is one whose values that are real numbers, e.g., temperature, height, or weight. (Practically, real values can only be measured and represented to a finite number of digits.) Continuous attributes are typically represented as floating-point variables.

© Vipin Kumar CSci 8980 Fall 2002 10 Record Data Much of the original data mining work and much of today's current work is focused around record data, i.e., data that consists of a collection of records (data objects), each of which consists of a fixed set of data fields (attributes).

© Vipin Kumar CSci 8980 Fall 2002 11 Data Matrix If the data objects in a collection of data all have the same fixed set of numeric attributes, then the data objects can be thought of as points (vectors) in a multi-dimensional space, where each dimension represents a distinct attribute describing the object. Thus, a set of data objects can be interpreted as an m by n matrix, where there are $m$ rows, one for each object, and $n$ columns, one for each attribute.

© Vipin Kumar CSci 8980 Fall 2002 12 Document Data l Each document becomes a `term' vector, where each term is a component (attribute) of the vector, and where the value of each component of the vector is the number of times the corresponding term occurs in the document.

© Vipin Kumar CSci 8980 Fall 2002 13 Transaction Data l Transaction data is a special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

© Vipin Kumar CSci 8980 Fall 2002 18 Ordered Data: Spatio-Temporal Data A key interest is finding connections between the ocean and the land. l Global snapshots of values for a number of variables on land surfaces or water. l Monthly over a range of 10 to 50 years. Research Goals: l Find global climate patterns of interest to Earth Scientists

© Vipin Kumar CSci 8980 Fall 2002 19 Data Quality How can we detect problems with the data? What can we do about these problems? We need to know what kinds of problems are possible, i.e., what sorts of situations correspond to poor data quality. The following are some well known problems: noise and outliers missing values duplicate data inconsistent values

© Vipin Kumar CSci 8980 Fall 2002 20 Missing Values Eliminate Data Objects A simple and effective strategy is to eliminate those records with missing values. A related strategy is to eliminate attributes which have missing values. Estimate Missing Values Sometimes the data set is such that missing data can be reliably estimated. For example, consider a time series that changes in a reasonably smooth fashion, but has a few, widely scattered missing values. In such cases, the missing values can be estimated (interpolated) by using the remaining values. As another example, consider a data set that has many similar data points. In this situation, a nearest neighbor approach can be used to estimate the missing value. More specifically, the attribute values of the points closest to the point with the missing value are used to estimate the missing value. If the attribute is continuous, then the average attribute value of the nearest neighbors is used, while if the attribute is categorical, then the most commonly occurring attribute value can be taken. Ignore the Missing Value During Analysis Many data mining approaches can be modified to operate by ignoring missing values. For example, suppose that objects are being clustered and the similarity between pairs of data objects needs to be calculated. If one or both objects of a pair have missing values for some attributes, then the similarity can be calculated by using only the other attributes. It is true that the similarity will only be approximate, but unless the number of attributes is small and/or the number of missing values is high, this degree of inaccuracy may not matter much. Likewise, many classification schemes can handle missing values relatively straightforwardly.

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Similar presentations

Presentation on theme: "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Similar presentations

Presentation on theme: "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback