Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

So What Do We Know? Variables can be classified as qualitative/categorical or quantitative. The context of the data we work with is very important. Always.
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Chapter 2 Organizing Data Understandable Statistics Ninth Edition
Exploratory Data Analysis (Descriptive Statistics)
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 9/6/12 Describing Data: One Variable SECTIONS 2.1, 2.2, 2.3, 2.4 One categorical.
1 BA 555 Practical Business Analysis Housekeeping Review of Statistics Exploring Data Sampling Distribution of a Statistic Confidence Interval Estimation.
Presenting Data in Tables & Charts Organizing Numerical Data.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Homework Questions. Quiz! Shhh…. Once you are finished you can work on the warm- up (grab a handout)!
CHAPTER 1: Picturing Distributions with Graphs
Alok Srivastava Chapter 2 Describing Data: Graphs and Tables Basic Concepts Frequency Tables and Histograms Bar and Pie Charts Scatter Plots Time Series.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Exploratory Data Analysis: Two Variables
Statistics 3502/6304 Prof. Eric A. Suess Chapter 3.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
1 Chapter 3 Looking at Data: Distributions Introduction 3.1 Displaying Distributions with Graphs Chapter Three Looking At Data: Distributions.
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Bellwork 1. If a distribution is skewed to the right, which of the following is true? a) the mean must be less than the.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni.
Math 145 September 11, Recap  Individuals – are the objects described by a set of data. Individuals may be people, but they may also be animals.
Chapter 5 Describing Distributions Numerically.
Types of Graphs.
Univariate EDA. Quantitative Univariate EDASlide #2 Exploratory Data Analysis Univariate EDA – Describe the distribution –Distribution is concerned with.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Descriptive Statistics  Individuals – are the objects described by a set of data. Individuals may be people, but they may also be animals or things. 
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
Elementary Statistics (Math 145) June 19, Statistics is the science of collecting, analyzing, interpreting, and presenting data. is the science.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
The Basic Practice of Statistics
Prof. Eric A. Suess Chapter 3
Thursday, May 12, 2016 Report at 11:30 to Prairieview
Exploring Data: Summary Statistics and Visualizations
MATH-138 Elementary Statistics
EXPLORATORY DATA ANALYSIS and DESCRIPTIVE STATISTICS
Exploring, Displaying, and Examining Data
Chapter 4 Review December 19, 2011.
Laugh, and the world laughs with you. Weep and you weep alone
Bar graphs are used to compare things between different groups
CHAPTER 1: Picturing Distributions with Graphs
CHAPTER 1: Picturing Distributions with Graphs
Topic 5: Exploring Quantitative data
Descriptive Statistics
Treat everyone with sincerity,
Descriptive Statistics
Good Morning AP Stat! Day #2
Elementary Statistics (Math 145)
Descriptive Statistics
Organizing, Summarizing, &Describing Data UNIT SELF-TEST QUESTIONS
Exploratory Data Analysis
Welcome!.
Methods of Acquiring Information
CHAPTER 1: Picturing Distributions with Graphs
Honors Statistics Review Chapters 4 - 5
Descriptive Statistics
Probability and Statistics
Advanced Algebra Unit 1 Vocabulary
Descriptive Statistics
Math 145 January 24, 2007.
Math 145 May 28, 2009.
Introductory Statistics
Presentation transcript:

Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data Binning 1 Chapter 3: Exploratory Data Analysis

Review Data mining tasks – Description (Ch. 3: Exploratory Data Analysis) – Estimation (Ch. 4: Statistical Approaches) – Classification (Ch. 5, 6 & 7: k-NN, DT & NN) – Prediction (Chapters 4 to 7) – Clustering (Ch. 8 & 9: k-means & Kohonen Networks) – Association (Ch. 10) 2

Hypothesis Testing vs. EDA hypothesis-testing is used when there is a priori hypotheses – Example: whether a recent increase in the cell phone fee has led to a decrease in market share Otherwise, EDA or graphical data analysis is used – Especially, when there is large unknown databases. 3

What EDA can do Delve into the data set Examine the interrelationships among the attributes Identify interesting subsets of the observations Develop an initial idea of possible associations between the attributes and the target variable, if any 4

How? Three combinations: – 2 Categorical variables (gender, race, state,…): Pie chart, bar chart, … Cross Tabulation … – 2 Numerical variables (height, age, salary, price,…): Regression or Correlation Histogram Analyzing mean, median, standard dev., … Scatter Plot … – 1 Categorical, 1 Numerical variables Side-by-side box plots 5

Churn dataset 6

7

8

Review: two types of data Categorical variables (Qualitative) Numerical variables (Quantitative) 9

Exploring Categorical Variables: Bar chart: international plan Those who have selected the International Plan have a greater chance of leaving the company’s service 10 Equal bar length

Exploring Categorical Variables: Cross Tabulation: international plan 11 To quantify the relationship between International Plan and churning: Total churned: Of total churned, did not use Int. Plan Of total churned, used Int. Plan

Exploring Categorical Variables: Cross Tabulation: international plan 12 1.What it is about the International Plan that is inducing customers to leave! 2.The data mining model we use to predict churn will probably include whether or not the customer selected the International Plan.

Uncover Anomalous Fields 13 Cross Tabulation of Area Code and State => Eliminating the redundant field

Exploring Numerical Variable: statistical parameters 14 Compare mean and median in each field

Mean vs Median Symmetric distribution

Mean vs Median Right skewed distribution

Mean vs Median Left skewed distribution

Extreme example Income in small town of 6 people $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 Mean is $31,830 and median is $32,000 Bill Gates moves to town $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000 Mean is $5,741,571 median is $35,000 Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data

Exploring Numerical Variable: Correlation 19

Dealing with Correlated variables 20

Dealing with Correlated variables 21 => Eliminating redundant variable

22 Exploring Numerical Variable: Histogram

Summary of one variable EDA 23

EDA: 1 Categorical, 1 Numerical variables: Box plots A box plot is a graph of five numbers (often called the five number summary) – minimum – Maximum – Median – 1 st quartile – 3 rd quartile 24

EDA: 1 Categorical, 1 Numerical variables: Box plots 25 Pets and Stress, Allen et al. (1988). Three groups: alone, Friend, Pet

Selecting Interesting Subsets of the Data 26 43% of the subset are churners

Binning (or banding) Binning refers to the categorization of numerical or categorical variables into a manageable set of classes which are convenient for analysis. Example: – the number of day minutes could be categorized (binned) into three classes: Low Medium high 27

Binning 1.the classes of equal width 2.equalizing the number of records in each class 3.identifiable groups of records, which, with respect to the target variable, have similar behavior 4.clustering 28

29 If Customer service call > 3, class = High Else class = low