Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.

Slides:



Advertisements
Similar presentations
Measures of Dispersion boxplots. RANGE difference between highest and lowest value; gives us some idea of how much variation there is in the categories.
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Measures of Variation Sample range Sample variance Sample standard deviation Sample interquartile range.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
1.2 Describing Distributions with Numbers. Center and spread are the most basic descriptions of what a data set “looks like.” They are intuitively meant.
Five-Number Summary 1 Smallest Value 2 First Quartile 3 Median 4
Measures of Central Tendency
1 Distribution Summaries Measures of central tendency Mean Median Mode Measures of spread Range Standard Deviation Interquartile Range (IQR)
Chapter In Chapter 3… … we used stemplots to look at shape, central location, and spread of a distribution. In this chapter we use numerical summaries.
Lesson 4 Compare datas.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Vocabulary for Box and Whisker Plots. Box and Whisker Plot: A diagram that summarizes data using the median, the upper and lowers quartiles, and the extreme.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Ch. Eick Project 3 COSC Christoph F. Eick Last updated: October 21, 2014.
Chapter 3 - Part B Descriptive Statistics: Numerical Methods
1 1 Slide © 2001 South-Western /Thomson Learning  Anderson  Sweeney  Williams Anderson  Sweeney  Williams  Slides Prepared by JOHN LOUCKS  CONTEMPORARYBUSINESSSTATISTICS.
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Anthony J Greene1 Dispersion Outline What is Dispersion? I Ordinal Variables 1.Range 2.Interquartile Range 3.Semi-Interquartile Range II Ratio/Interval.
Copyright © 2005 Pearson Education, Inc. Slide 6-1.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Chapter 3: Averages and Variation Section 4: Percentiles and Box- and-Whisker Plots.
Ch. Eick Christoph F. Eick. Ch. Eick Post Analysis Project1 Disclaimer The main purpose of these slides is not criticize groups but rather to learn how.
Section 1 Topic 31 Summarising metric data: Median, IQR, and boxplots.
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Exploring Data 1.2 Describing Distributions with Numbers YMS3e AP Stats at LSHS Mr. Molesky 1.2 Describing Distributions with Numbers YMS3e AP Stats at.
Percentiles For any whole number P (between 1 and 99), the Pth percentile of a distribution is a value such that P% of the data fall at or below it. The.
Chapter 2 Section 5 Notes Coach Bridges
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
EQ: How do we create and interpret box plots? Assessment: Students will write a summary of how to create a box plot. Warm Up Create a histogram for the.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Discovering Mathematics Week 5 BOOK A - Unit 4: Statistical Summaries 1.
Sec. 3-5 Exploratory Data Analysis. 1.Stem & Leaf Plots: (relates to Freq. Dist) Look at examples on page Box Plot: (Relates to Histograms)
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
Chapter 6: Interpreting the Measures of Variability.
Unit 3: Averages and Variations Part 3 Statistics Mr. Evans.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Ch. Eick Project 2 COSC Christoph F. Eick.
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
Box and Whiskers with Outliers
Statistics 1: Statistical Measures
On Interpreting I Interpreting Histograms, Density Functions, distributions of a single attribute What is the type of the attribute? What is the mean.
CHAPTER 1 Exploring Data
Elementary Statistics
William Norris Professor and Head, Department of Computer Science
1.2 Describing Distributions with Numbers
William Norris Professor and Head, Department of Computer Science
Warmup What is the shape of the distribution? Will the mean be smaller or larger than the median (don’t calculate) What is the median? Calculate the.
Describing Distributions of Data
Quartile Measures DCOVA
On Interpreting I Interpreting Histograms, Density Functions, distributions of a single attribute What is the type of the attribute? What is the mean.
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Measures of Central Tendency
How where first 3 displays generated?
Exploratory Data Analysis
On Interpreting I Interpreting Histograms, Density Functions, distributions of a single attribute What is the type of the attribute? What is the mean.
Key points! *Use the mean and mean absolute deviation (MAD) to describe symmetric distributions of data. *Use the median and the interquartile range (IQR)
Measures of Center and Spread
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Investigations: Box Plots
No, she only ordered one wheel for each triangle No, she only ordered one wheel for each triangle. She should have order 30 wheels. 3 = #of wheels.
Key points! *Use the mean and mean absolute deviation (MAD) to describe symmetric distributions of data. *Use the median and the interquartile range (IQR)
The Five-Number Summary
Module 10.
Lesson Plan Day 1 Lesson Plan Day 2 Lesson Plan Day 3
COSC 6335 Fall 2014 Post Analysis Project1
Presentation transcript:

Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what kind of objects does each cluster contain; what is unique about each particular cluster), Secondary focus: Relationships between clusters. Remark: Include the 13 th attribute when creating summaries. 1. Using a method of your choice (e.g. box plots), compare the distribution in a particular cluster with the distribution in the dataset: –Create summaries of clusters based on properties of a particular cluster that significantly deviate from the properties of the whole dataset. –Create interestingness scores for clusters based on the degree of deviation –Create summaries about major differences between clusters (analyze how one cluster differs from the other clusters) 2. Assess other relationship between different clusters (e.g. their distances) October 9, 2014

Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings II More Post Analysis Ideas: 3. Learn a decision tree (some other model) that separates the instances of a particular cluster from the instances of the other 4 clusters  Use the accuracy of the decision tree as a measure for the quality of a cluster  Use a highly pruned version of the decision tree as a summary of the decision tree (or rules derived from a decision tree; e.g. report all paths that lead to choose the class of cluster as a set of rules)  … 4. Using a method of your choice (e.g. box plots), compare the distribution of pairs of clusters: – Analyze which clusters are similar to each other and which deviate from each other. – Summarize the patterns they have in common and the patterns in which they differ. 5. Using a method of your choice (e.g. box plots), analyze how each cluster differs from the other 4 clusters.

Ch. Eick: Some Ideas for Task4 Project2 Example1: Using Box Plot Cluster Summaries 1. Compute the interquartile range (IQR) for each attribute for the dataset and for each cluster. 2. Compute the overlap  of each cluster box plot with the dataset boxplot. Let (a,b) be the cluster IQR with a>b and (a’,b’) the dataset IQR with a’>b’ for attribute att; then:  att =max(0, min(a’,a)-max(b’,b)) / (max(a’,a)-min(b’,b))) 3. Discard cluster box plot for att if  att >th (e.g. th=0.7) 4. Use the surviving boxplots as cluster summary for the clusters also reporting  for all clusters (including the discarded ones) 5. Compute cluster interestingness as follows: Let O= {  1,…,  r } be the overlap of a cluster c for its r attributes; in in general, Interestingness(c)=f(O); e.g. f(O)=average(O.values) Let v1, v2, v3 the lowest, second lowest, and third lowest value in O: Interestingness(O)=1- ((v1*3+v2*2+v3*1)/6)