Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.

Slides:



Advertisements
Similar presentations
Comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources.
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.
Classification Techniques: Decision Tree Learning
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
1 Input and Output Thanks: I. Witten and E. Frank.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Introduction to Data Mining with XLMiner
Lecture Notes for Chapter 2 Introduction to Data Mining
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Input: Concepts, Attributes, Instances. 2 Module Outline  Terminology  What’s a concept?  Classification, association, clustering, numeric prediction.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.
Chapter 6 Decision Trees
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Berendt: Knowledge and the Web, 2014, 1 Knowledge and the Web – Inferring new knowledge from data(bases):
Department of Computer Science, University of Waikato, New Zealand Geoff Holmes WEKA project and team Data Mining process Data format Preprocessing Classification.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Appendix: The WEKA Data Mining Software
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 2: Input: Concepts, Instances and Attributes Rodney Nielsen Many of these slides were.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
1 CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H. Witten,
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Department of Computer Science, University of Waikato, New Zealand Geoff Holmes WEKA project and team Data Mining process Data format Preprocessing Classification.
Machine Learning in Practice Lecture 4 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques
Data Preliminaries CSC 600: Data Mining Class 1.
Data Mining – Input: Concepts, instances, attributes
Data Science Algorithms: The Basic Methods
Lecture Notes for Chapter 2 Introduction to Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Data Science Input: Concepts, Instances and Attributes WFH: Data Mining, Chapter 2 Rodney Nielsen Many/most of these slides were adapted from: I. H.
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Machine Learning Techniques for Data Mining
Clustering.
Lecture 1: Descriptive Statistics and Exploratory
Data Preliminaries CSC 576: Data Mining.
Data Pre-processing Lecture Notes for Chapter 2
Data Mining CSCI 307 Spring, 2019
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall

2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Input: Concepts, instances, attributes Terminology ● What’s a concept?  Classification, association, clustering, numeric prediction ● What’s in an example?  Relations, flat files, recursion ● What’s in an attribute?  Nominal, ordinal, interval, ratio ● Preparing the input  ARFF, attributes, missing values, getting to know data

3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Terminology ● Components of the input:  Concepts: kinds of things that can be learned ● Aim: intelligible and operational concept description  Instances: independent examples of a concept ● Note: more complicated forms of input are possible  Attributes: measuring aspects of an instance ● We will focus on nominal and numeric ones

4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) What’s a concept? ● Styles of learning: Classification learning: predicting a discrete class Association learning: detecting associations between features Clustering: grouping similar instances into clusters Numeric prediction: predicting a numeric quantity ● Concept: thing to be learned ● Concept model/hypothesi: ● output of learning scheme

5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) What’s in an attribute? ● Each instance is described by a fixed predefined set of features, its “attributes” ● Possible attribute types (“levels of measurement”):  Nominal  Ordinal  Interval

6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Nominal quantities Values are distinct symbols  Values themselves serve only as labels or names  Nominal comes from the Latin word for name Example: attribute “outlook” from weather data  Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values (no ordering or distance measure) ● Only equality tests can be performed

7 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Ordinal quantities Imposes order on values, but: no distance between values defined ● Example: attribute “temperature” in weather data  Values: “hot” > “mild” > “cool” ● Example rule: temperature < hot  play = yes ● Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

8 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Interval quantities Interval quantities are not only ordered but measured in fixed and equal units ● Example 1: attribute “temperature” expressed in degrees Fahrenheit ● Example 2: attribute “year” ● Difference of two values makes sense ● Sum or product doesn’t make sense  Zero point is not defined!

9 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Attribute types used in practice Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”  But: “enumerated” and “discrete” imply order ● Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous”  But: “continuous” implies mathematical continuity

10 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Metadata Information about the data that encodes background knowledge: Source Date recorded Type Range Resolution

11 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) The ARFF format % % ARFF file for weather data with some numeric features outlook {sunny, overcast, temperature humidity windy {true, play? {yes, sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes...

12 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Additional attribute types ● ARFF supports string attributes:  Similar to nominal attributes but list of values is not pre-specified ● It also supports date attributes:  Uses the ISO-8601 combined date and time format description today date

13 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Sparse data ● In some applications most attribute values in a dataset are zero  E.g.: word counts in a text categorization problem ● ARFF supports sparse data ● This also works for nominal attributes (where the first value corresponds to “zero”) 0, 26, 0, 0, 0,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

14 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Nominal vs. ordinal ● Attribute “age” nominal ● Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age  pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

15 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Missing values Frequently indicated by out-of-range entries  Types: unknown, unrecorded, irrelevant  Reasons: ● malfunctioning equipment ● changes in experimental design ● collation of different datasets ● measurement not possible ● Missing value may have significance in itself  May need to be “imputed” or estimated  Instances or attributes may need to be removed

16 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Inaccurate values ● Often data has not been collected for data analytics ● Result: garbage in  garbage out ● Reasons for errors: ● Typographical errors in nominal  edits needed ● Measurement errors in numeric attributes  outliers need to be identified ● Equipment failure  regular calibrations important ● Errors may be deliberate (e.g. wrong zip codes) ● Other problems: duplicate records, stale data

17 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) Getting to know the data ● Simple visualization tools are very useful  Nominal attributes: histograms  Is the distribution consistent with background knowledge?  Numeric attributes: graphs  Any obvious outliers? ● 2-D and 3-D plots show dependencies ● Need to consult domain experts ● Too much data to inspect? Take a sample!