BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda:

Slides:



Advertisements
Similar presentations
Section 1-2 Data Classification
Advertisements

Data Engineering.
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 2 = Finish Chapter “Introduction and Data Collection”
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Unit 1 Section 1.2.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
INTRODUCTION TO STATISTICS Yrd. Doç. Dr. Elif TUNA.
R Software We Will Use: Can be downloaded from
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Introduction to Statistics What is Statistics? : Statistics is the sciences of conducting studies to collect, organize, summarize, analyze, and draw conclusions.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
1  Specific number numerical measurement determined by a set of data Example: Twenty-three percent of people polled believed that there are too many polls.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Unit 1 Section : Variables and Types of Data  Variables can be classified in two ways:  Qualitative Variable – variables that can be placed.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 BUS 297D: Data Mining Professor David Mease Lecture 2 Agenda: 1) Assign Homework #1 (Due Thursday 9/10) 2) Finish lecture over Chapter 2 3) Start lecture.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
INTRODUCTION TO STATISTICS CHAPTER 1: IMPORTANT TERMS & CONCEPTS.
MIS2502: Data Analytics Advanced Analytics - Introduction.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
Biostatistics Introduction Article for Review.
Do Now  47 TCNJ students were asked to complete a survey on campus clubs and activities. 87% of the students surveyed participate in campus clubs and.
An Introduction to Data Mining
Chapter 1 Introduction to Statistics 1-1 Overview 1-2 Types of Data 1-3 Critical Thinking 1-4 Design of Experiments.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Preliminaries CSC 600: Data Mining Class 1.
Unit 1 Section 1.2.
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Elementary Statistics
What is Data? Attributes Objects An attribute is a property or
R Software We Will Use: Can be downloaded from
MIS2502: Data Analytics Advanced Analytics - Introduction
Statistics 202: Statistical Aspects of Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
Introduction to Data Mining- CMPT 741 Instructor: Ke Wang
Data Mining: Introduction
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
statistics Specific number
Data Mining: Introduction
Data Mining: Concepts and Techniques Course Outline
Data Mining: Introduction
Vocabulary of Statistics
statistics Specific number
Course Introduction CSC 576: Data Mining.
Data Mining: Introduction
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Data Preliminaries CSC 576: Data Mining.
Chapter 1 Introduction to Statistics
Data Pre-processing Lecture Notes for Chapter 2
Chapter 1 Chapter 1 Introduction to Statistics Larson/Farber 6th ed.
Lecture Slides Essentials of Statistics 5th Edition
Presentation transcript:

BUS 297D: Data Mining Professor David Mease Lecture 1 Agenda: 1) Go over information on course web page 2) Lecture over Chapter 1 3) Discuss necessary software 4) Lecture over Chapter 2

BUS 297D: Data Mining Professor David Mease Course web page and greensheet: http://www.cob.sjsu.edu/mease_d/bus297D/ This page is linked from my SJSU web page It is also linked from my personal page www.davemease.com which is easily found by querying “David Mease” or simply “Mease” on any search engine

Introduction to Data Mining Chapter 1: Introduction by Tan, Steinbach, Kumar Chapter 1: Introduction

What is Data Mining? Data mining is the process of automatically discovering useful information in large data repositories. (page 2) There are many other definitions

In class exercise #1: Find a different definition of data mining online. How does it compare to the one in the text on the previous slide?

Data Mining Examples and Non-Examples -Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) -Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com, etc.) NOT Data Mining: -Look up phone number in phone directory -Query a Web search engine for information about “Amazon”

Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in hypothesis formation

Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/credit card transactions Computers have become cheaper and more powerful Competitive pressure is strong Provide better, customized services for an edge

In class exercise #2: Give an example of something you did yesterday or today which resulted in data which could potentially be mined to discover useful information.

Origins of Data Mining (page 6) Draws ideas from machine learning, AI, pattern recognition and statistics Traditional techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data AI/Machine Learning/ Pattern Recognition Statistics Data Mining

2 Types of Data Mining Tasks (page 7) Prediction Methods: Use some variables to predict unknown or future values of other variables. Description Methods: Find human-interpretable patterns that describe the data.

Examples of Data Mining Tasks Classification [Predictive] (Chapters 4,5) Regression [Predictive] (covered in stats classes) Visualization [Descriptive] (in Chapter 3) Association Analysis [Descriptive] (Chapter 6) Clustering [Descriptive] (Chapter 8) Anomaly Detection [Descriptive] (Chapter 10)

Software We Will Use: You should make sure you have access to the following two software packages for this course Microsoft Excel R Can be downloaded from http://cran.r-project.org/ for Windows, Mac or Linux

Downloading R for Windows:

Downloading R for Windows:

Downloading R for Windows:

Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 2: Data

What is Data? An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, instance, or observation Attributes Objects

Reading Data into Excel Download it from the web at http://www.cob.sjsu.edu/mease_d/stats202log.txt Then right click on it and select “Open With” then “Choose Program”

Reading Data into Excel Choose Excel Oops! Everything is in the first column! Solution: click on the first column then select “Data” then “Text to Columns”

Reading Data into Excel Choose “Delimited” and hit “Next”

Reading Data into Excel Check only “Space” then “Next” again

Reading Data into Excel Q: Why did this work? Why don’t all spaces cause column splits?

Reading Data into Excel Q: Why did this work? Why don’t all spaces cause column splits? A: The file is escaped using quotes. (read http://en.wikipedia.org/wiki/Delimiter for more information)

Reading Data into R Download it from the web at http://www.cob.sjsu.edu/mease_d/stats202log.txt Set your working directory: setwd("C:/Documents and Settings/David/Desktop") Read it in: data<-read.csv("stats202log.txt", sep=" ",header=F)

Reading Data into R Look at the first 5 rows: data[1:5,] V1 V2 V3 V4 V5 V6 V7 V8 V9 1 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET / HTTP/1.1 200 2867 www.davemease.com 2 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET /mease.jpg HTTP/1.1 200 4583 www.davemease.com 3 69.224.117.122 - - [19/Jun/2007:00:31:46 -0400] GET /favicon.ico HTTP/1.1 404 2295 www.davemease.com 4 128.12.159.164 - - [19/Jun/2007:02:50:41 -0400] GET / HTTP/1.1 200 2867 www.davemease.com 5 128.12.159.164 - - [19/Jun/2007:02:50:42 -0400] GET /mease.jpg HTTP/1.1 200 4583 www.davemease.com V10 V11 V12 1 http://search.msn.com/results.aspx?q=mease&first=21&FORM=PERE2 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) - 2 http://www.davemease.com/ Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) - 3 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322) - 4 - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 - 5 http://www.davemease.com/ Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4 -

Reading Data into R Look at the first column: data[,1] [1] 69.224.117.122 69.224.117.122 69.224.117.122 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 … [1901] 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 [1911] 65.57.245.11 67.164.82.184 67.164.82.184 67.164.82.184 171.66.214.36 171.66.214.36 171.66.214.36 65.57.245.11 65.57.245.11 65.57.245.11 [1921] 65.57.245.11 65.57.245.11 73 Levels: 128.12.159.131 128.12.159.164 132.79.14.16 171.64.102.169 171.64.102.98 171.66.214.36 196.209.251.3 202.160.180.150 202.160.180.57 ... 89.100.163.185

Experimental vs. Observational Data (Important but not in book) Experimental data describes data which was collected by someone who exercised strict control over all attributes. Observational data describes data which was collected with no such controls. Most all data used in data mining is observational data so be careful. Examples: -Diet Coke vs. Weight -Carbon Dioxide in Atmosphere vs. Earth’s Temperature

Qualitative vs. Quantitative (P. 26) Types of Attributes: Qualitative vs. Quantitative (P. 26) Qualitative (or Categorical) attributes represent distinct categories rather than numbers. Mathematical operations such as addition and subtraction do not make sense. Examples: eye color, letter grade, IP address, zip code Quantitative (or Numeric) attributes are numbers and can be treated as such. Examples: weight, failures per hour, number of TVs, temperature

Types of Attributes (P. 25): All Qualitative (or Categorical) attributes are either Nominal or Ordinal. Nominal = categories with no order Ordinal = categories with a meaningful order All Quantitative (or Numeric) attributes are either Interval or Ratio. Interval = no “true” zero, division makes no sense Ratio = true zero exists, division makes sense

Types of Attributes: Some examples: Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: letter grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit, GMAT score Ratio Examples: temperature in Kelvin, length, time, counts

Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: =  Order: < > Addition and subtraction: + - Multiplication and division: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties

Discrete vs. Continuous (P. 28) Discrete Attribute Has only a finite or countably infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables Note: binary attributes are a special case of discrete attributes which have only 2 values Continuous Attribute Has real numbers as attribute values Can compute as accurately as instruments allow Examples: temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables

Discrete vs. Continuous (P. 28) Qualitative (categorical) attributes are always discrete Quantitative (numeric) attributes can be either discrete or continuous

In class exercise #3: Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. a) Number of telephones in your house b) Size of French Fries (Medium or Large or X-Large) c) Ownership of a cell phone d) Number of local phone calls you made in a month e) Length of longest phone call f) Length of your foot g) Price of your textbook h) Zip code i) Temperature in degrees Fahrenheit j) Temperature in degrees Celsius k) Temperature in Kelvins

Types of Data in R R often distinguishes between qualitative (categorical) attributes and quantitative (numeric) In R, qualitative (categorical) = “factor” quantitative (numeric) = “numeric”

Types of Data in R For example, the IP address in the first column of http://www.cob.sjsu.edu/mease_d/stats202log.txt is a factor > data<-read.csv("stats202log.txt", sep=" ",header=F) > data[,1] [1] 69.224.117.122 69.224.117.122 69.224.117.122 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 128.12.159.164 … [1901] 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 65.57.245.11 [1911] 65.57.245.11 67.164.82.184 67.164.82.184 67.164.82.184 171.66.214.36 171.66.214.36 171.66.214.36 65.57.245.11 65.57.245.11 65.57.245.11 [1921] 65.57.245.11 65.57.245.11 73 Levels: 128.12.159.131 128.12.159.164 132.79.14.16 171.64.102.169 171.64.102.98 171.66.214.36 196.209.251.3 202.160.180.150 202.160.180.57 ... 89.100.163.185 > is.factor(data[,1]) [1] TRUE > data[,1]+10 [1] NA NA NA NA NA NA NA NA … Warning message: + not meaningful for factors …

Types of Data in R However, the 8th column looks like it should be numeric. Why is it not? How do we fix this? > data[,8] [1] 2867 4583 2295 2867 4583 2295 1379 2294 4432 7134 2296 2297 3219968 1379 2294 4432 7134 2293 2297 2294 … [1901] 2294 4432 7134 2294 4432 7134 2294 2867 4583 2295 2294 4432 7134 2294 4432 7134 2294 2294 2294 2294 [1921] 2294 2294 Levels: - 1135151 122880 1379 1510 2290 2293 2294 2295 2296 2297 2309 238 241 246 248 250 2725487 280535 2867 3072 3219968 4432 4583 626 7134 7482 > is.factor(data[,8]) [1] TRUE > is.numeric(data[,8]) [1] FALSE

Types of Data in R A: We should have told R that “-” means missing when we read it in. > data<-read.csv("stats202log.txt", sep=" ",header=F, na.strings = "-") > is.factor(data[,8]) [1] FALSE > is.numeric(data[,8]) [1] TRUE

Types of Data in R Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in R?

Types of Data in R Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in R? A: Use quotes: > zip_codes<- as.factor(c("94550","00123","43614"))

Types of Data in Excel Excel is not quite as picky and allows you to mix types more Also, you can change between a lot of different predefined formats in Excel by right clicking a column and then selecting “Format Cells” and looking under the “Number” tab

Types of Data in Excel Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in Excel?

Types of Data in Excel Q: How would we create an attribute giving the following zip codes 94550, 00123, 43614 for three observations in Excel? A: Right click on the column then choose “Format Cells” then under the “Number” tab select “Text”

Working with Data in R Creating Data: Some simple operations: > aa<-c(1,10,12) > aa [1] 1 10 12 Some simple operations: > aa+10 [1] 11 20 22 > length(aa) [1] 3

Working with Data in R Creating More Data: > bb<-c(2,6,79) > my_data_set<-data.frame(attributeA=aa,attributeB=bb) > my_data_set attributeA attributeB 1 1 2 2 10 6 3 12 79

Working with Data in R Indexing Data: > my_data_set[,1] [1] 1 10 12 [1] 1 10 12 > my_data_set[1,] attributeA attributeB 1 1 2 > my_data_set[3,2] [1] 79 > my_data_set[1:2,] 2 10 6

Working with Data in R Indexing Data: Arithmetic: > my_data_set[c(1,3),] attributeA attributeB 1 1 2 3 12 79 Arithmetic: > aa/bb [1] 0.5000000 1.6666667 0.1518987

Working with Data in R Summary Statistics: > mean(my_data_set[,1]) [1] 7.666667 > median(my_data_set[,1]) [1] 10 > sqrt(var(my_data_set[,1])) [1] 5.859465

Working with Data in R Writing Data: Help!: > setwd("C:/Documents and Settings/David/Desktop") > write.csv(my_data_set,"my_data_set_file.csv") Help!: > ?write.csv