Using summary statistics to explore dataUsing summary statistics to explore data Exploring data using visualizationExploring data using visualization Finding.

Slides:



Advertisements
Similar presentations
Chapter 3 – Data Visualization © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Advertisements

Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 2 Picturing Variation with Graphs.
Appendix A. Descriptive Statistics Statistics used to organize and summarize data in a meaningful way.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 2 Describing Data: Graphical
Beginning the Visualization of Data
12 FURTHER MATHEMATICS Organising and Displaying Data.
Chapter 2 Presenting Data in Tables and Charts
Ch. 2: The Art of Presenting Data Data in raw form are usually not easy to use for decision making. Some type of organization is needed Table and Graph.
Types of Data Displays Based on the 2008 AZ State Mathematics Standard.
Chapter 2 Graphs, Charts, and Tables – Describing Your Data
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 2-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Data Description Tables and Graphs Data Reduction.
Introductory Statistics: Exploring the World through Data, 1e
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. A PowerPoint Presentation Package to Accompany Applied Statistics.
Presenting information
CHAPTER 1: Picturing Distributions with Graphs
STA220: Practice of Statistics 1 Section L0301: Health & Life Sciences September 17,
How to build graphs, charts and plots. For Categorical data If the data is nominal, then: Few values: Pie Chart Many Values: Pareto Chart (order of bars.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
STATISTICAL GRAPHS.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln
Quantitative Skills 1: Graphing
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
Basic Business Statistics Chapter 2:Presenting Data in Tables and Charts Assoc. Prof. Dr. Mustafa Yüzükırmızı.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 2 Descriptive Statistics: Tabular and Graphical Methods.
StatisticsStatistics Graphic distributions. What is Statistics? Statistics is a collection of methods for planning experiments, obtaining data, and then.
A Picture Is Worth A Thousand Words. DAY 7: EXCEL CHAPTER 4 Tazin Afrin September 10,
VCE Further Maths Chapter Two-Bivariate Data \\Servernas\Year 12\Staff Year 12\LI Further Maths.
Categorical vs. Quantitative…
Thinking about Graphics Scales in Stata. Level of measurement Categorical versus continuous Categorical data may be represented as Position along a categorical.
Bellwork 1. If a distribution is skewed to the right, which of the following is true? a) the mean must be less than the.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 2-1 Chapter 2 Presenting Data in Tables and Charts Statistics For Managers 4 th.
Unit 4 Statistical Analysis Data Representations.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Math 3680 Lecture #1 Graphical Representation of Data.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
To be given to you next time: Short Project, What do students drive? AP Problems.
Describing Data Visually
Warm-up Day of 2.1 to 2.4 Quiz Create an appropriate visual display and write a few sentences comparing the distributions of the data below: Number of.
Math 145 September 11, Recap  Individuals – are the objects described by a set of data. Individuals may be people, but they may also be animals.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 2-1 Chapter 2 Presenting Data in Tables and Charts Basic Business Statistics 11 th Edition.
Types of Graphs.
Copyright 2011 by W. H. Freeman and Company. All rights reserved.1 Introductory Statistics: A Problem-Solving Approach by Stephen Kokoska Chapter 2 Tables.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Descriptive Statistics  Individuals – are the objects described by a set of data. Individuals may be people, but they may also be animals or things. 
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
1 Take a challenge with time; never let time idles away aimlessly.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Chapter 5: Organizing and Displaying Data. Learning Objectives Demonstrate techniques for showing data in graphical presentation formats Choose the best.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Exploring/managing data! Data Science in Practice Week 6, 03/28 Jia-Ming Chang The slide.
Descriptive Statistics: Tabular and Graphical Methods
Exploring Data: Summary Statistics and Visualizations
Ggplot2 Wu Shaohuan.
Exploring, Displaying, and Examining Data
Chapter 3 Graphical Methods for Describing Data
Statistical Reasoning
Ms jorgensen Unit 1: Statistics and Graphical Representations
Ggplot2 I EPID 799C Mon Sep
Collecting & Displaying Data
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Descriptive Statistics
Ungraded quiz Unit 4.
Presentation transcript:

Using summary statistics to explore dataUsing summary statistics to explore data Exploring data using visualizationExploring data using visualization Finding problems and issues during data explorationFinding problems and issues during data exploration 3 Exploring data 서울시립대학교 전기전자컴퓨터공학과 G 이가희 고급컴퓨터알고리듬

? Summary(data) : data 의 전반적인 형태를 보여준다. data type - numeric : variety of summary statistics - categorical data(factor & logical) : count statistics - categorical data(factor & logical) : count statistics custdata <- read.table('custdata.tsv', header=T, sep='\t') str(custdata)summary(custdata) Using summary statistics to exploring data -> zmPDSwR.zip

? Summary(data) : data 의 전반적인 형태를 보여준다. data type - numeric : variety of summary statistics - categorical data(factor & logical) : count statistics - categorical data(factor & logical) : count statistics custdata <- read.table('custdata.tsv', header=T, sep='\t') str(custdata)summary(custdata) Using summary statistics to exploring data Missing value Invalid value and Outliers Data range Units

MISSING VALUES : 값이 없다. (≠0) drop rows 만이 해결 방법일까 ? 왜 missing values 가 있고, 이것들이 사용할 가치가 있는지 판단할 필요가 있다. Typical problems reveal by summaries - Missing value !!! “not in the active workforce” (student or stay-at-home partners) only missing a few values -> drop rows!

INVALID VALUE : 의미 없는 값, missing value -> invalid value ex) non-negative value 여야 하는 numeric data (age, income) - negative values DATA RANGE : wide range? narrow range? 무엇을 분석하느냐에 따라 필요한 데이터 범위도 달라진다. 무엇을 분석하느냐에 따라 필요한 데이터 범위도 달라진다. ex. 5 세에서 10 세 사이의 어린이를 위한 읽기능력을 예측 : 유용한 변수 – 연령 20 대 이상 -> 데이터 변환 or 빈 연령대로 변환 만약 예측해야 할 문제에 비해 데이터 범위가 좁다면, a rough rule of thumb ( 평균에 대한 표준편차의 비율 ) 활용 Typical problems reveal by summaries - Invalid value and Outliers- Data range summary(custdata$income) summary(custdata$age) “age unknown” or “refuse to state” “amount of debt”-> bad data 0~615,000 : very wide range

UNITS : 어떤 단위로 구성되어 있는지 확인해야 한다. days, hours, minutes, kilometers per second, … Typical problems reveal by summaries -Units summary(custdata$income) Income <- custdata$income/1000 summary(Income) 범위 축소 “hourly wage” or “yearly income in units of $1,000”

ggplot2() : R 에서 기본으로 제공하는 plot() 과 유사한 인터페이스를 제공하는 시각화 툴 레이어 (layer) 를 잘 활용해야 한다. 레이어 (layer) 를 잘 활용해야 한다. Spotting problem using graphic and visualization ggplot(custdata, aes(x=age)) + geom_density() ggplot(custdata) + geom_density(aes(x=age)) invalid values? outliers ggplot(data, aes(x=column, y=column), FUN…) + geometric_object() + FUN… only data.frame 플로팅할 데이터의 column name geom_point() (scatter plot) geom_line() (line plot) geom_bar() (bar chart) geom_density (density plot) geom_histogram (histogram) … aesthetic mapping : 데이터를 플로팅 할때 쓴다.

1 HISTOGRAM : bin 을 기준으로 데이터의 분포를 보여준다. examines data range check number of modes checks if distribution is normal/lognormal checks for anomalies and outliers Spotting problem using graphic and visualization -A single variable ggplot(custdata) + geom_histogram(aes(x=age), binwidth=5, fill='gray') invalid values outliers

2 DENSITY PLOT : bin 에 따라 그래프의 모양이 변하는 히스토그램에 비해 그래프 모양이 변하지 않는다. bin 의 경계에서 분포가 확연히 달라지지 않는다. ( 곡선형태 ) examines data range check number of modes checks if distribution is normal/lognormal checks for anomalies and outliers Spotting problem using graphic and visualization -A single variable ggplot(custdata) + geom_density(aes(x=income)) + scale_x_continuous(labels=dollar) continuous position scales

3 LOG-SCALED DENSITY PLOT : 로그 밀도 그래프 Spotting problem using graphic and visualization -A single variable ggplot(custdata) + geom_density(aes(x=income)) + scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) + annotation_logticks(sides='bt') annotation: log tick marks log tick on bottom and top (default)

4 BAR CHART : compares relative or absolute frequencies of the values of a categorical variable the values of a categorical variable Spotting problem using graphic and visualization -A single variable ggplot(custdata) + geom_bar(aes(x=marital.stat), fill='gray')

5 HORIZONTAL BAR CHART Spotting problem using graphic and visualization -A single variable ggplot(custdata) + geom_bar(aes(x=state.of.res), fill='gray') + coord_flip() + theme(axis.text.y=element_text(size=rel(0.8))) flipped cartesian coordinates to modify theme settings relative sizing for theme elements

5 HORIZONTAL BAR CHART Spotting problem using graphic and visualization -A single variable statesums <- table(custdata$state.of.res) statef <- as.data.frame(statesums) colnames(statef) <- c('state.of.res', 'count') statef <- transform(statef, state.of.res=reorder(state.of.res, count)) ggplot(statef) + geom_bar(aes(x=state.of.res, y=count), stat='identity', fill='gray') + coord_flip() + theme(axis.text.y=element_text(size=rel(0.8))) reorder levels of a factor

6 STACKED BAR CHART : var1 값 안에서의 var2 값의 분포를 보여준다. 7 SIDE-BY-SIDE BAR CHART : 각각의 var1 에 대한 var2 값을 나란히 배치 8 FILLED BAR CHART : 일정한 틀 안에서 var2 의 상대적인 비율을 보여준다. Spotting problem using graphic and visualization - Relationship two variables ggplot(custdata) + geom_bar(aes(x=marital.stat, fill=health.ins), ), position=‘dodge', position=‘fill'

9 BAR CHART WITH FACETING : a large number of categories 를 가진 column 들을 차트로 나타냈을 때, 각각의 항목에 대해 나눠서 보자 Spotting problem using graphic and visualization - Relationship two variables custdata2 <- subset(custdata, (custdata$age>0 & custdata$age 0)) (custdata$age>0 & custdata$age 0)) ggplot(custdata2) + geom_bar(aes(x=housing.type, fill=marital.stat), position='dodge') + geom_bar(aes(x=housing.type, fill=marital.stat), position='dodge') + theme(axis.text.x=element_text(angle=45, hjust=1)) theme(axis.text.x=element_text(angle=45, hjust=1)) ggplot(custdata2) + geom_bar(aes(x=marital.stat), position='dodge', fill='darkgray') + geom_bar(aes(x=marital.stat), position='dodge', fill='darkgray') + facet_wrap(~housing.type, scales='free_y') + facet_wrap(~housing.type, scales='free_y') + theme(axis.text.x=element_text(angle=45, hjust=1)) theme(axis.text.x=element_text(angle=45, hjust=1)) horizontal justification should scales be free in one dimension default(fixed) 분포를 거의 알아보기 힘들 다.

10 LINE PLOT : 두 변수간의 연관성을 볼 수 있다. 하지만, 데이터가 서로 관련이 없으면 유용하지 않다. Spotting problem using graphic and visualization - Relationship two variables x <- runif(100) y <- x^ *x ggplot(data.frame(x=x, y=y), aes(x=x, y=y)) + geom_line()

11 SCATTER PLOT + α : two numeric variables relationship! Q. age, income … relationship? Spotting problem using graphic and visualization - Relationship two variables cor(custdata2$age, custdata2$income) ggplot(custdata2, aes(x=age, y=income)) + geom_point() + geom_point() + ylim(0, ) ylim(0, ) ggplot(custdata2, aes(x=age, y=income)) + geom_point() + geom_point() + stat_smooth(method='lm') + stat_smooth(method='lm') + ylim(0, ) ylim(0, ) correlation 연관관계를 알아보기 힘들 다 smoothing method 선 그리기 * se (default) = true ???

12 SMOOTHING CURVE Spotting problem using graphic and visualization - Relationship two variables ggplot(custdata2, aes(x=age, y=income)) + geom_point() + geom_point() + geom_smooth() + geom_smooth() + ylim(0, ) ylim(0, ) ggplot(custdata2, aes(x=age, y=as.numeric(health.ins))) + geom_point(position=position_jitter(w=0.05, h=0.05)) + geom_point(position=position_jitter(w=0.05, h=0.05)) + geom_smooth() geom_smooth() a smoothed conditional mean ~ 40 : increase 55 ~ : decrease continuous + a boolean

13 HEXBIN PLOT : 2-dimensional histogram Spotting problem using graphic and visualization - Relationship two variables ggplot(custdata2, aes(x=age, y=income)) + geom_hex(binwidth=c(5, 10000)) + geom_hex(binwidth=c(5, 10000)) + geom_smooth(color='white', se=F) + geom_smooth(color='white', se=F) + ylim(0, ) ylim(0, )

모델링 하기 전에 데이터를 살펴보는 시간을 갖자. 모델링 하기 전에 데이터를 살펴보는 시간을 갖자. Summary() : helps you spot issuesSummary() : helps you spot issues with data range, units, data type, and missing or invalid values. Visualization : 변수 사이의 데이터 분포와 이들 간의 관계성을 보는데 도움을 준다.Visualization : 변수 사이의 데이터 분포와 이들 간의 관계성을 보는데 도움을 준다. Key point!