Data Mining Data Last modified 1/16/19.

Data Mining Data Last modified 1/16/19

What is Data? Collection of data objects and their attributes
An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Objects

Attribute Values Attribute values: numbers or symbols assigned to an attribute Attributes versus attribute values Same attribute can be mapped to different values Example: height can be measured in feet or meters Different attributes can map to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different ID has no limit but age has a maximum and minimum value

Types of Attributes There are different types of attributes
Nominal (Categorical) Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts

Properties of Attribute Values
The type of an attribute depends on which of the following 4 properties it possesses: Distinctness: =  Order: < > Addition: Multiplication: * / Attributes with Properties Nominal distinctness Ordinal distinctness & order Interval distinctness, order & addition Ratio all 4 properties

Attribute Type Description Examples Operations Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current

Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet.

Discrete & Continuous Attributes
Discrete Attribute Has only a finite set of values Examples: zip codes, counts, or the set of words in a collection of documents Sometimes represented as integer variables. Note: binary attributes are a special case of discrete attributes Note: even if represented as integer, only rely on distinctiveness and do not use relations like < (less than) Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes typically represented as floating-point Can use numerical relations like < (less than)

Characteristics of Structured Data
Dimensionality Number of features/variables What is the “Curse of Dimensionality” With many dimensions, data more sparse relative to instance space and patterns harder to find Sparsity (and provide example) Most values missing and typically only presence counts Document representation using “bag of words” Resolution Patterns depend on the scale Give an example of how changing resolution can help Hint: think about weather patterns, rainfall over a time period

Types of data sets Record Graph Ordered Data Matrix Document Data
Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data

Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes

Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Document Data Each document becomes a `term' vector,
each term is a component (attribute) of the vector, the value of each component is the number of times the corresponding term occurs in the document.

Document Data What can we say about document data using the terms we just introduced? Is it highly dimensional? Is it sparse? It is highly dimensional and sparse How many words to you think a college educated English speaker knows? Uses? May know 80,000 words, but uses only 5,000 in speech and 10,000 in writing. Oxford English dictionary has 171,476 words Ziff’s law: word frequency inversely proportional to rank 2nd ranked word occurs half as frequently as 1st ranked “the” 7%, “of” 3.5%, “and” 2% … 135 words cover half of all word usage (but not after stemming)

Transaction Data A special type of record data, where
each record (transaction) involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Graph Data Hypertext links form a Web Generic Graph

Ordered Data Sequences of transactions Items/Events
An element of the sequence

Ordered Data Example Sensor Data Error Logs/Alarm Data
Genomic sequence data

Time Series Data Graphical representation of smartphone accelerometer data for the walking (top) and jogging activity (bottom). Accelerometer measures acceleration along all three spatial axes.

Ordered Data Spatio-Temporal Data
Average Monthly Temperature of land and ocean

Data Quality Issues What kinds of data quality problems?
Noise (random component of measurement error) Artifacts (deterministic/systematic errors) Outliers (anomalous objects-not errors) Missing values Inconsistent values (zip code conflicts with City) Duplicate data

Precision, Bias, and Accuracy
Precision: closeness of repeated measurements (of same quantity) to one another Can be measured by standard deviation Bias: systematic variation of measurements from the quantity being measured Can only determined if know the “true” value Accuracy: closeness of measurement to true value Depends on precision and bias, but no formula for it

More on Bias and Accuracy
In the field of machine learning and data mining, bias generally means something very different. What does it mean? Bias is not systematic error, but an extra-evidentiary leap. It is good since without it learning is not possible. When we use data mining to build a model, there must be a bias to generalize from the data. Accuracy does not really have a different meaning, but can be easily measured Compare predicted value (from model) against the true value (that must be provided or else training not possible). Defined for discrete classes and expressed as percentage (%) (or error rate = 100 – accuracy).

Noise Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen As defined in some textbooks must be random Two Sine Waves Two Sine Waves + Noise

Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Are outliers sometimes important?

Missing Values Reasons for missing values
Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) What are methods for handling missing values Eliminate Data Objects (examples) Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities)

Duplicate Data Data set may include data objects that are duplicates or objects that correspond to same entity (but vary slightly) Major issue when merging data from heterogeneous sources Examples: Same person with multiple addresses Data cleaning Process of dealing with duplicate data issues

Issues Related to Applications
Timeliness: some data can become stale Relevance: data must be relevant to build a good model. For predicting car accident rates, would want age information, since that is relevant Knowledge about the data Ideally want good documentation. For example, it could explain that some features are correlated.

Data Preprocessing Aggregation Sampling Dimensionality Reduction
Feature subset selection Feature creation Discretization and Binarization Variable Transformation

Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “stable” data Aggregated data tends to have less variability

Aggregation Variation of Precipitation in Australia
Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation

Aggregation Example Build activity recognition model (classifier) using smartphone accelerometer data What type of data is this? Time-series (numerical) How do you build a classifier from this data? Based on our limited knowledge of classification, what is the problem? The problem is that we expect record data (which forms an example) not time-series data. What is the solution? Use a sliding window approach to aggregate using simple higher level features (e.g., ave, standard-deviation, etc.)

Sampling Sampling is often used for data selection
It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming Either way you want a representative sample

Types of Sampling Simple Random Sampling Stratified sampling
An equal probability of selecting any particular item Two variations: Sampling without replacement (can’t be selected again) Sampling with replacement (can be selected again) Stratified sampling Split the data into several partitions; then draw random samples from each partition In data mining, it is most common to stratify based on the class Stratified sampling most important when there are unusual or rare objects. In such cases random sampling could lead to large differences.

Types of Sampling (cont)
Progressive Sampling Proper sampling size may be difficult to determine, so adaptive or progressive sampling techniques can be used. May require extra computation as well as a way to determine when good enough For predictive models, learning curves plot accuracy versus training set size, so can progressively sample until near a plateau One example is geometric progression

Sample Size 8000 points Points Points

Sample Size What sample size is necessary to get at least one object from each of 10 groups.

Curse of Dimensionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies It becomes much harder to find meaningful patterns. Definitions of density and distance between points, which is critical for clustering and outlier detection, also become less meaningful

Dimensionality Reduction
Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise

Linear Algebra Techniques
One way to reduce dimensionality is to create new features that are combos of several original features Linear algebra techniques can project data from higher dimensions into lower dimensions Principal Component Analysis (PCA) is a linear algebra technique for continuous attributes that finds new attributes (principal components) that are linear combinations of the original attributes and are perpendicular to each other Singular Value Decomposition (SVD) is a specific technique related to PCA This topic is beyond the scope of this course, but you may want to try some PCA techniques

Feature Subset Selection
Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

Feature Subset Selection
Techniques: Brute-force approach: Try all possible feature subsets as input to DM algorithm What is the problem with this? Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm (decision trees) Filter approaches: Features are selected before data mining algorithm is run Check correlation with class and if none then drop Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes (but don’t try all possibilities)

Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Feature extraction New features from original features Typically domain specific Image recognition: add features corresponding to edges Density from volume and mass useful if trying to identify material composition Mapping Data to New Space For time series, may use Fourier transform to identify frequency characteristics

Discretization & Binarization
Some data mining tasks allow only discrete features Association analysis needs to know only the presence or absence of something (not quantity) Others may be simplified if fewer values and may be easier to interpret Data mining toolkits provide methods for performing discretization

Discretization of Categorical Values
Lets say you have one feature with m values View this as an integer and encode it with log2m binary features (round up) Book example: awful, poor, OK, good, great encoded as 0-4 using 3 binary variables x1, x2, x3 Possible problem: correlations between variables that are not meaningful Encode more simply using m binary features Each feature represents one value; others set to 0 Appropriate for association analysis E.g., gender male/female replaced with two binary features Useful also for neural networks with multiple outputs Automatic steering: outputs turn-right, straight, turn-left Only one value is enabled

Discretization of Continuous Values
Original Data Equal interval width Equal frequency K-means

Architecture for Feature Subset Selection
Four main steps: Measure for evaluating a subset For wrapper approach see how the algorithm performs Search strategy for selecting subset Stopping criterion Sufficient performance, improvement not happening in future iterations, max feature size Validation procedure

Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Used more for statistical methods where want certain properties (symmetry or balanced distribution) Standardization and Normalization For distance-based methods mapping all features to same range can prevent one feature from dominating

Similarity and Dissimilarity
Why might you need to measure these things? Similarity Numerical measure of how alike two data objects are Is higher when objects are more alike Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0, upper limit varies Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Features
The following table shows the similarity and dissimilarity between two objects, x and y, with respect to a single, simple attribute.

Euclidean Distance Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.

Euclidean Distance Distance Matrix

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. r=1 Manhattan distance r = 2 Euclidean distance

Similarity Between Binary Vectors
Common situation is that objects p and q have only binary attributes Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) Useful when almost all values are 0, since SMC would always be close to 1

SMC versus Jaccard: Example
q = M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / ( ) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / ( ) = 0

Cosine Similarity If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where  indicates vector dot (aka inner) product and ||d|| is the length of vector d (i.e., the square root of the vector dot product) We compute the cosine of angle between the two vectors Cos(0°) = 1 and means vectors are similar Cos(90°) = 0 and means vectors are totally dissimilar Example: d1 = d2 = d1  d2= 31 + 20 + 00 + 50 + 00 + 00 + 0     2 = 5 ||d1|| = (33 + 22 + 00 + 55 + 00 + 00 + 00 +22 +00 + 00)0.5 = (42) 0.5 = 6.48 ||d2|| = (11 + 00 + 00 + 00 + 00 + 00 + 00 + 11 + 00 + 22) 0.5 = (6) 0.5 = 2.25 cos(d1, d2) = .3150

Cosine Similarity for Text
Cosine similarity used for text matching the vectors are the term frequencies Cosine similarity varies from 0 to 1 since negative values not possible

Correlation Measures linear relationship between objects
That is between the object attributes Based upon covariance and standard deviation Corr(x,y) = covariance(x,y) / SD(x) x SD(Y) Note that the bar over the variable indicates mean

Correlation cont. Correlation always in the range -1 to +1
+1 (-1) means perfect positive (negative) linear relationship There are metrics to compute rank correlation, to compare rankings Useful if your data mining output is a ranking Can compare to ground truth

Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.

Statistics versus Data Mining

Data Mining and Statistics: What’s the Connection?
1997 Paper by Jerome H. Friedman What did you think about the article? Did it help you understand the difference between these 2 fields? What comments or points resonated with you? Do you think this old article is extremely dated or that the points are no longer relevant?

Key points in Friedman Article
Data Mining is a vaguely defined field Numerous definitions Defined partially by methods (decision trees, neural nets, nearest neighbor) Associated with commercial products Largely a commercial enterprise To sell hardware and software Initially hardware manufacturers were in the DM business to sell hardware (e.g., Silicon Graphics, IBM)

Friedman: Contents of DM Packages
Data Mining Packages: Easy to use graphical interface and graphical output of models and results Include certain methods Classification methods, clustering, association analysis Data Mining Packages do not include: Hypothesis testing and experimental design ANOVA, Linear/logistic regression*, factor analysis * Not true now

Friedman: Is DM Intellectual Discipline?
According to Friedman only the methodology is an intellectual discipline But thinks in the future DM will be as computer power increases Can probably conclude it is now, but may not be as coherent as other disciplines How would you describe it as an intellectual discipline? I would say that it emphasizes computational methods to analyze data

Friedman: Should Statistics include DM?
If DM part of Stats then: DM published in Stats journals DM taught in stats departments Answer unclear (then) Many disciplines started in stats and perhaps could have stayed there: Pattern recognition, machine learning, neural networks, data visualization

Friedman: What is Statistics?
Statistics defined by a set of tools Probability theory, Real analysis, Decision theory Probabilistic inference based on mathematics Some felt that stats should remain focused on this area of success Stats journals required proofs; DM rarely uses proofs Statistics was not defined by a set of problems, namely data analysis Methods not using probability not included This is an odd way to define a field, most disciplines defined by set of problems it addresses

What Happened to Statistics?
Friedman raised issues but did not say what should happen. He listed possible futures and its implications What actually happened? Clearly DM took off on its own. Largely defined by problems it can address Includes statistics but there is a preference for computational/algorithmic methods May diminish in time if data science takes over. In 2018 Stats programs still do not incorporate data mining (but require programming)

Data Mining Data Last modified 1/16/19.

Similar presentations

Presentation on theme: "Data Mining Data Last modified 1/16/19."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Data Last modified 1/16/19.

Similar presentations

Presentation on theme: "Data Mining Data Last modified 1/16/19."— Presentation transcript:

Similar presentations

About project

Feedback