CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Advertisements

Rule extraction in neural networks. A survey. Krzysztof Mossakowski Faculty of Mathematics and Information Science Warsaw University of Technology.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
TYPES OF DATA. Qualitative vs. Quantitative Data A qualitative variable is one in which the “true” or naturally occurring levels or categories taken by.
Math 2260 Introduction to Probability and Statistics
Introduction to Statistics Quantitative Methods in HPELS 440:210.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Introduction to Quantitative Research
Lecture Notes for Chapter 2 Introduction to Data Mining
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Measurement Cal State Northridge  320 Andrew Ainsworth PhD.
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Input: Concepts, Attributes, Instances. 2 Module Outline  Terminology  What’s a concept?  Classification, association, clustering, numeric prediction.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Data Mining Techniques
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Presentation.
Graphical Analysis. Why Graph Data? Graphical methods Require very little training Easy to use Massive amounts of data can be presented more readily Can.
Statistical Process Control Chapters A B C D E F G H.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Sections 1-3 Types of Data. PARAMETERS AND STATISTICS Parameter: a numerical measurement describing some characteristic of a population. Statistic: a.
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 2: Input: Concepts, Instances and Attributes Rodney Nielsen Many of these slides were.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
Descriptive Statistics becoming familiar with the data.
V. Rouillard  Introduction to measurement and statistical analysis GRAPHICAL PRESENTATION OF EXPERIMENTAL DATA It is nearly always the case that.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
Anthony J Greene1 Central Tendency 1.Mean Population Vs. Sample Mean 2.Median 3.Mode 1.Describing a Distribution in Terms of Central Tendency 2.Differences.
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten,
Data Mining What is to be done before we get to Data Mining?
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
2 NURS/HSCI 597 NURSING RESEARCH & DATA ANALYSIS GEORGE MASON UNIVERSITY.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques
Data Preliminaries CSC 600: Data Mining Class 1.
Data Mining – Input: Concepts, instances, attributes
Introduction to Quantitative Research
Elementary Statistics
Populations.
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Classification Algorithms
Creating a Codebook.
Data Transformations targeted at minimizing experimental variance
Lecture 1: Descriptive Statistics and Exploratory
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Data Pre-processing Lecture Notes for Chapter 2
Data Mining CSCI 307 Spring, 2019
Biostatistics Lecture (2).
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics

What’s in an attribute? Each instance is described by a fixed predefined set of features, its “attributes” But: number of attributes may vary in practice Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend of value of another one Possible attribute types (“levels of measurement”): Nominal, ordinal, interval and ratio 2

Nominal quantities Values are distinct symbols Values themselves serve only as labels or names Nominal comes from the Latin word for name Example: attribute “outlook” from weather data Values: “sunny”, “overcast”, and “rainy” No relation is implied among nominal values (no ordering or distance measure) Only equality tests can be performed 3

Ordinal quantities Impose order on values But: no distance between values defined Example: attribute “temperature” in weather data Values: “hot” > “mild” > “cool” Note: addition and subtraction don’t make sense Example rule: If temperature < hot Then pl ay = yes Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”) 4

Interval quantities Interval quantities are not only ordered but measured in fixed and equal units Example 1: attribute “temperature” expressed in degrees Fahrenheit Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense Zero point is not defined! 5

Ratio quantities ● Ratio quantities are ones for which the measurement scheme defines a zero point ● Example: attribute “distance”  Distance between an object and itself is zero ● Ratio quantities are treated as real numbers  All mathematical operations are allowed 6

Attribute types used in practice Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, “enumerated”, or “discrete” “enumerated” and “discrete” unnecessarily imply order Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous” “continuous” unnecessarily implies mathematical continuity 7

Metadata Information about the data that encodes background knowledge Can be used to restrict search space Examples: Dimensional considerations (i.e. expressions must be dimensionally correct) Circular orderings (e.g. degrees in compass) Partial orderings (e.g. generalization/specialization relations) 8

Nominal vs. ordinal ● Attribute “age” nominal ● Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) 9 If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age  pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

Missing values Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Reasons: Malfunctioning equipment Changes in experimental design Collation of different datasets Measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Most algorithms don’t assume significance: “missing” may need to be coded as additional value 10

Inaccurate values Reason: data has not been collected for mining it Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes  values need to be checked for consistency Typographical and measurement errors in numeric attributes  outliers need to be identified Errors may be deliberate (e.g. wrong phone numbers) Other problems: duplicates, stale data 11

Getting to know the data Simple visualization tools are very useful Nominal attributes: histograms (Is distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D plots show dependencies Need to consult domain experts Too much data to inspect? Take a sample! 12

The Mystery Sound  Can you tell what this sound is? 13