Exploring Data: Summary Statistics and Visualizations

Slides:



Advertisements
Similar presentations
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Statistics Lecture 2. Last class began Chapter 1 (Section 1.1) Introduced main types of data: Quantitative and Qualitative (or Categorical) Discussed.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Department of Civil, Structural, and Environmental Engineering University at Buffalo,
Tan,Steinbach, Kumar: Exploratory Data Analysis (with modifications by Ch. Eick) Data Mining: “New” Teaching Road Map 1. Introduction to Data Mining and.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Practice Page 65 –2.1 Positive Skew Note Slides online.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Notes Unit 1 Chapters 2-5 Univariate Data. Statistics is the science of data. A set of data includes information about individuals. This information is.
Data Mining: Exploring Data
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Statistics Unit 6.
Introduction to Statistics
Introduction to Statistics
Exploratory Data Analysis
Data Mining: Exploring Data
Chapter 16: Exploratory data analysis: numerical summaries
BAE 6520 Applied Environmental Statistics
ISE 261 PROBABILISTIC SYSTEMS
Data Mining: EXPLORING DATA
Chapter 3 Describing Data Using Numerical Measures
Chapter 5 : Describing Distributions Numerically I
Data Mining: Exploring Data
Practice Page Practice Page Positive Skew.
Statistics Unit Test Review
Objective: Given a data set, compute measures of center and spread.
Unit 6 Day 2 Vocabulary and Graphs Review
Averages and Variation
Statistical Reasoning
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Data Mining: Exploring Data
Descriptive Statistics
Unit 4 Statistics Review
DAY 3 Sections 1.2 and 1.3.
Chapter 5: Describing Distributions Numerically
Statistics Unit 6.
Topic 5: Exploring Quantitative data
CHAPTER 1 Exploring Data
Data Mining: Exploring Data
Data Mining: “New” Teaching Road Map
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Drill {A, B, B, C, C, E, C, C, C, B, A, A, E, E, D, D, A, B, B, C}
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
Displaying and Summarizing Quantitative Data
Data Analysis and Statistical Software I Quarter: Spring 2003
Numerical Descriptive Measures
Day 52 – Box-and-Whisker.
Honors Statistics Review Chapters 4 - 5
MCC6.SP.5c, MCC9-12.S.ID.1, MCC9-12.S.1D.2 and MCC9-12.S.ID.3
Data Mining: Exploring Data
Data exploration and visualization
Presentation transcript:

Exploring Data: Summary Statistics and Visualizations CSC 600: Data Mining Class 2

Today Exploring Data: Summary Statistics, Visualizations Python Crash Course

Exploring Data Data Exploration: a preliminary investigation of the data in order to better understand its specific characteristics Patterns can sometime be found simply by visualizing the data (and then can be used to explain the data mining results) Summary statistics also used Aid in selection appropriate preprocessing and data analysis techniques

Summary Statistics Capture various characteristics of a large set of values Common summary statistics: Mean Standard deviation Range Mode Most summary statistics can be calculated in a single pass through the data.

Frequency Class Size Frequency Freshman 200 0.33 Sophomore 160 0.27 Junior 130 0.22 Senior 110 0.18 Often used with categorical values.

Mode The value that has the highest frequency. The mode (especially with discrete / continuous data) may reveal value that symbolizes a missing value. The value that has the highest frequency. Class Size Frequency Freshman 200 0.33 Sophomore 160 0.27 Junior 130 0.22 Senior 110 0.18 Often used with categorical values.

Percentiles For ordered data, percentile is useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile xp is a value of x such that p% of the observed values are less than xp. Example: the 75th percentile is the value such that 75% of all values are less than it.

Mean Measure the “location” of a set of values Mean is a very, very common measurement But is sensitive to outliers

Median Commonly used instead of mean if outliers are present Median is the middle value if odd number of values are present; average of the two middle values if even number of values Ordered set of n values: {x1, …, xn}

Trimmed Mean Specify a percentage p between 0 and 100 Top and bottom (p/2)% of data is not used in mean calculation p=0, corresponds to standard mean p=50, corresponds to median calculation

Range “Measure of spread” Ordered set of n values: {x1, …, xn} Can be misleading if most values are concentrated, but a few values are extreme

Variance / Standard Deviation “Measure of spread” Ordered set of n values: {x1, …, xn}

Because variance and standard deviation measures use the mean, they can also be sensitive to outliers.

Other Measures of Spread Absolute Average Deviation Median Absolute Deviation Interquartile Range

Skewness Measures the degree to which the values are symmetrically distributed about the center

Mean vs. Median If the distribution of values is skewed, then the median is a better indicator of the middle, compare to the mean.

Visualizations The display of information in a graphic or tabular format. Many visualization formats exist (contour plots, graphs, heat maps) to display high-dimensional information. Since this isn’t a visualization course, we’ll mainly use “traditional” two-dimensional graphic types. How can we transform a dataset with many attributes into two dimensions? Selection: Typically by selecting two dimensions at a time Can also only “select” a subset of records to display Other techniques also exist

Iris Data Set Next few slides will demonstrate visualization using the classic Iris dataset Freely available from UCI (University of California at Irvine) Machine Learning Lab Relatively very small 150 records of Iris flowers (50 for each species) Attributes: Sepal length (centimeters) Sepal width (centimeters) Petal length (centimeters) Petal width (centimeters) Class (species of Iris) {Setosa, Versicolour, Virginica}

Visualization: Histogram For showing the distribution of values Divide values into bins; show number of objects that fall into each bin Shape of histogram depends on number of bins

Sepal length data Bins of equal width Histogram (10 bins) Histogram (20 bins)

Histogram Previous slide showed histogram of a continuous attribute For categorical attributes, each category is a bin. If there are too many bins, then values need to be combined in some way.

Relative Frequency Histogram Instead of counts on the y-axis, the relative frequency (density) is used. > hist(iris$Sepal.Length,freq = F,xlab="Sepal Length",breaks=20)

Visualization: Box Plot Note: only 50% of the data is in the box! Box plots show the distribution of the values for a single numerical attribute. Whiskers: top and bottom lines of the box plot outlier 10th percentile 25th percentile 75th percentile 50th percentile 90th percentile

Box Plot Whiskers can represent several possible alternate values Note: only 50% of the data is in the box! Whiskers can represent several possible alternate values Best to describe the convention used in a legend along the chart 1 sd above mean Greatest data value within 1.5 of IQR Maximum of data (no outliers graphed in this case) 1 sd below mean Smallest data value within 1.5 of IQR Minimum of data

Box plots are relatively compact; many of them can be shown on the same plot.

Visualization: Scatter Plot Data objects are plotted as a point in a 2d-plane: one attribute on x-axis, the other on y-axis Assumed that both attributes are discrete or continuous Scatter plot matrix: organized way to examine a number of scatter plots simultaneously Scatter plots for multiple pairs of attributes

When class labels are available, a scatter plot matrix can visualize how much two attributes separate the classes.

References Introduction to Data Mining, 1st edition, Tan et al. http://en.wikipedia.org/wiki/Box_plot