Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.

Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2
Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Input: Concepts, Instances, Attributes
Preparing the input Missing values Getting to know data

Metadata Information about the data that encodes background knowledge
Can be used to restrict search space Examples: Dimensional considerations (i.e. expressions must be dimensionally correct) Circular orderings (e.g. degrees in compass) Partial orderings (e.g. generalization/specialization relations)

Preparing the Input Denormalization Other Issues
Problem: different data sources (e.g. sales department, customer billing department, …) Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Data must be assembled, integrated, cleaned up “Data warehouse”: consistent point of access External data may be required (“overlay data”) Critical: type and level of data aggregation

The ARFF Format % % ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

Sparse Data In some applications most attribute values in a dataset are zero E.g.: word counts in a text categorization problem ARFF supports sparse data 0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

Missing Values Frequently indicated by out-of-range entries
Types: unknown, unrecorded, irrelevant Reasons: Malfunctioning equipment Changes in experimental design Collation of different datasets Measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume there are no missing values Might need to be coded as additional value

Inaccurate Values Reason: data has not been collected for mining
Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes  values need to be checked for consistency Typographical and measurement errors in numeric attributes  outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data

Getting to Know the Data
Simple visualization tools are very useful Nominal attributes: histograms (Distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D plots show dependencies Need to consult domain experts Too much data to inspect? Take a sample!

Data Transformations Attribute selection Dirty data
Scheme-independent, scheme-specific Dirty data Data cleansing, robust regression, anomaly detection

Just Apply a Learner? Scheme/parameter selection
treat selection process as part of the learning process Modifying/creating the input: Feature engineering to make learning possible or easier

Attribute Selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data IBL very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem Relevant attributes can also be harmful

Scheme-independent Attribute Selection
Filter approach: assess based on general characteristics of the data One method: find smallest subset of attributes that separates data Another method: use different learning scheme E.g., Use attributes selected by c4.5, or coefficients of linear model, possibly applied recursively (recursive feature elimination) IBL-based attribute weighting techniques Can’t easily find redundant attributes Correlation-based Feature Selection (CFS) Correlation between attributes measured by symmetric uncertainty Goodness of subset of attributes measured by (breaking ties in favor of smaller subsets) Add formulas

Attribute Subsets for Weather Data

Searching Attribute Space
Number of attribute subsets is exponential in number of attributes Common greedy approaches: forward selection backward elimination More sophisticated strategies: Bidirectional search Best-first search: can find optimum solution Beam search: approximation to best-first search Genetic algorithms

Scheme-specific Selection
Wrapper approach to attribute selection Implement “wrapper” around learning scheme Evaluation criterion: cross-validation performance Time consuming greedy approach, k attributes  k2  time prior ranking of attributes  linear in k Can use significance test to stop cross-validation for subset early if it is unlikely to “win” (race search) can be used with forward, backward selection, prior ranking, or special-purpose schemata search

Student Questions: Feature Selection
In attribute selection, less attributes means a smaller space to search during model construction and less opportunities to make wrong decisions and arrive at misleading, insufficiently justified generalizations. Computationally, wouldn't the actual model building in big data be demanding? Since this model is motivated by computational savings.

Student Questions: Feature Engineering
Throughout section 7.1 selecting which attributes to use in the machine learning algorithm is discussed in detail but I am curious if there is a specific way to generate attributes or if that is wholly decided upon by the data scientist and specialists in the field.

Automatic Data Cleansing
To improve a decision tree: Remove misclassified instances, then re-learn! Better (of course!): Human expert checks misclassified instances Attribute noise vs class noise Attribute noise should be left in training set When? Pros and Cons? Don’t train on clean set and test on dirty one! Systematic class noise (e.g. one class substituted for another): leave in training set? Pros and Cons? Unsystematic class noise: eliminate from training set, if possible

Robust Regression “Robust” statistical method  one that addresses problem of outliers To make regression more robust: Minimize absolute error, not squared error Remove outliers (e.g. 10% of points farthest from the regression plane) Minimize median instead of mean of squares (copes with outliers in x and y direction) Finds narrowest strip covering half the observations

Detecting Anomalies Visualization can help to detect anomalies
Automatic approach: committee of different learning schemes E.g. decision tree nearest-neighbor learner linear discriminant function Conservative approach: delete instances incorrectly classified by all of them Problem: might sacrifice instances of small classes

One-Class Learning Usually training data is available for all classes
Some problems exhibit only a single class at training time Test instances may belong to this class or a new class not present at training time One-class classification Predict either target or unknown Some problems can be re-formulated into two-class ones Other applications truly don't have negative data

Outlier Detection Outlier/novelty detection is sometimes called one-class classification Generic approach: identify outliers as instances that lie beyond distance d from percentage p of the training data Alternatively, estimate density of the target class and mark low probability test instances as outliers Threshold can be adjusted to obtain a suitable rate of outliers

Generating Artificial Data
Another possibility is to generate artificial data for the outlier class Can then apply any off-the-shelf classifier Can tune rejection rate threshold if classifier produces probability estimates Generate uniformly random data Curse of dimensionality – as # attributes increase it becomes infeasible to generate enough data to get good coverage of the space

Questions

Student Questions 7.1 mentions that there have been attempts to come up with universally acceptable measures or terms for relevance. What would be an example? How can you systematically add noise to the training set so that it correctly models noise in the test (and hopefully real-world) data? In robust regression why is the least squares regression line affected by anomalous data affected the most? If there a way to remove classification errors in the training set due to noise other than by going through each entry manually? In regards to sparse data, can't a high enough amount of missing data invalidate any experiments or results gained from the data? In outlier detection, how do you distinguish an outlier from noisy data?

Student Questions What is the benefit of the ARFF format over other filer formats (ex. XML)? What are some methods for editing the training set for misspelled words or synonyms? What is the goal when choosing which attributes to branch on in a decision tree? Naive Bayes works well with forward selection. How about backwards selection?

Student Questions When preparing input why do in some cases is external data needed? Although instances can be removed after learning from them to reduce overfitting and complexity of the algorithm couldn't leaving some of these incorrect instances in the data set cause the incorrect selection of attributes to use and therefor ruin the method of choosing incorrect instances. i.e. Wouldn't leaving incorrect instances in the training set cause the methods talked about in section 7.5 to be null.

Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.

Similar presentations

Presentation on theme: "Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.

Similar presentations

Presentation on theme: "Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H."— Presentation transcript:

Similar presentations

About project

Feedback