Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.

Slides:



Advertisements
Similar presentations
Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Advertisements

Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Descriptive Measures MARE 250 Dr. Jason Turner.
Lecture 2 Summarizing the Sample. WARNING: Today’s lecture may bore some of you… It’s (sort of) not my fault…I’m required to teach you about what we’re.
Data in R. General form of data ID numberSexWeightLengthDiseased… 112m … 256f3.61 NA1… 3……………… 4……………… n91m5.1711… NOTE: A DATASET IS NOT A MATRIX!
Assessing Normality and Data Transformations. Role of Normality Many statistical methods require that the numeric variables we are working with have an.
Sullivan – Statistics: Informed Decisions Using Data – 2 nd Edition – Chapter 3 Introduction – Slide 1 of 3 Topic 16 Numerically Summarizing Data- Averages.
Measures of Central Tendency
Lecture 2 MATLAB fundamentals Variables, Naming Rules, Arrays (numbers, scalars, vectors, matrices), Arithmetical Operations, Defining and manipulating.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
Understanding and Comparing Distributions
Understanding and Comparing Distributions
LISA Short Course Series Basics of R Lin Zhang Feb. 16, 2015 LISA: Basics of RFeb. 16, 2015.
Describing distributions with numbers
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Objectives 1.2 Describing distributions with numbers
Exploration of Mean & Median Go to the website of “Introduction to the Practice of Statistics”website Click on the link to “Statistical Applets” Select.
Have out your calculator and your notes! The four C’s: Clear, Concise, Complete, Context.
Categorical vs. Quantitative…
STAT 251 Lab 1. Outline Lab Accounts Introduction to R.
Review BPS chapter 1 Picturing Distributions with Graphs What is Statistics ? Individuals and variables Two types of data: categorical and quantitative.
Chapter 5 Describing Distributions Numerically.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
Introduction to Statistics
Introduction to Statistics
EHS 655 Lecture 4: Descriptive statistics, censored data
EMPA Statistical Analysis
Linear Algebra Review.
Chapter 16: Exploratory data analysis: numerical summaries
BAE 5333 Applied Water Resources Statistics
Data Mining: Concepts and Techniques
Understanding and Comparing Distributions
Unit 4 Statistical Analysis Data Representations
Data description and transformation
Lecture 25: Exploring data
Understanding and Comparing Distributions
Understanding and Comparing Distributions
Describing Distributions Numerically
Description of Data (Summary and Variability measures)
Laugh, and the world laughs with you. Weep and you weep alone
Using Excel to Graph Data
Bar graphs are used to compare things between different groups
Descriptive Statistics:
Understanding and Comparing Distributions
Chapter 3 Describing Data Using Numerical Measures
Sec. 1.1 HW Review Pg. 19 Titanic Data Exploration (Excel File)
I/O: Input and Output.
Lab 2 Data Manipulation and Descriptive Stats in R
DAY 3 Sections 1.2 and 1.3.
Please take out Sec HW It is worth 20 points (2 pts
Topic 5: Exploring Quantitative data
Dot Plots & Box Plots Analyze Data.
Warmup What is the shape of the distribution? Will the mean be smaller or larger than the median (don’t calculate) What is the median? Calculate the.
Numerical Measures: Skewness and Location
Displaying Quantitative Data
Understanding and Comparing Distributions
Assessing Normality and Data Transformations
More Weather Stats.
Stat 251 (2009, Summer) Lab 1 TA: Yu, Chi Wai.
Describing Quantitative Data with Numbers
Measuring Variation 2 Lecture 17 Sec Mon, Oct 3, 2005.
MATH 2400 – Ch. 2 Vocabulary Mean: the average of a set of data sum/n
Understanding and Comparing Distributions
Measures of Position Section 3.3.
Honors Statistics Review Chapters 4 - 5
Understanding and Comparing Distributions
CHAPTER 1 Exploring Data
The Five-Number Summary
STAT 515 Statistical Methods I Sections
Presentation transcript:

Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai

Review 1) Basic commands in R; 2) Basic operations for a scalar and a vector; Algebraic manipulations with a scalar and a vector; Algebraic manipulations with vectors; 3) Summary/Descriptive statistics; mean, median, sd, var, quantile, IQR, etc. 4) Graphic; Scatterplot, histogram and boxplot.

Histogram Adv: Disadv: Detect outliers; Show the skewness/ symmetry of the distribution; Show the modality of the distribution. Disadv: Depends on the choice of bandwidth.

Boxplot Adv: Disadv: Detect outliers; Show the skewness/ symmetry of the distribution; Show lower (1st), mid (2nd) and upper (3rd) quartiles. Disadv: Cannot show the modality of the distribution.

x = c(rep(-2,5), rep(0, 3), rep(2, 5), rep(-1, 10), rep(1,10)) -1 1 -2 2

Skewness/Symmetry of a distribution Symmetric:

Skewness/Symmetry of a distribution Asymmetric: Left skewed Heavy tail on the left

Skewness/Symmetry of a distribution Asymmetric: Right skewed Heavy tail on the right

Outliers x = c(rep(0,10), rep(2, 5), rep(-2, 5)) hist(x) An unusual point(s) far away from the majority of data x = c(rep(0,10), rep(2, 5), rep(-2, 5)) human mistake measurement error hist(x)  Skew the distribution of data

x[4] = -10

Boxplot to show potential outliers Some small-valued observations are far away from the majority of the data. Left skewed

x[4] = 10

Boxplot to show potential outliers Some large-valued observations are far away from the majority of the data. Right skewed

Lab 2 Matrix manipulation Read external data in R Data transformation More techniques for Graphics

Matrix Scalar: 0 dimension e.g. x=1.3, 5, 200,… Vector: 1 dimension e.g. x = c(4,0.35,0.9, 1.1, 5) x[5]: the 5th element of x

x[1,2]: the element in the 1st row AND 2nd column Matrix: 2 dimensions e.g. 1st 2nd 3rd column The (1,2)th element of x 1st row x[1,2]: the element in the 1st row AND 2nd column 2nd row 3rd row x[,1] x[3,] The 1st column vector The 3rd row vector

Create a Matrix Use matrix() ?matrix Or help(matrix) x = matrix(c(3,5,6,3,4,9,2,1,7), ncol=3, nrow=3, byrow=T) The matrix is filled by rows.

Matrix Manipulation x[-i,]: drop the ith row vector of x. x[-c(i,j),]: drop the ith and jth row vectors of x. x[,-i]: drop the ith column vector of x. x[,-c(i,j)]: drop the ith and jth column vectors of x.

Drop the (i,j)th element of x? Matrix Manipulation x[-i,-j] ?!?!? Drop the (i,j)th element of x?

x[-1,-3] Remove the data in the 1st row and the 3rd column. x[-1,] AND x[,-3] 1st 2nd 3rd column Remove the data in the 1st row and the 3rd column. 1st row 2nd row 3rd row

Read External Data 1) Save data set into Z: drive; 2) Open R; 3) Change the directory by clicking on [File] at the top right and then choosing the option for [Change dir].

Read External Data Two ways to read a data in R: i) read.table(): read a table of data, i.e. the data set has multiple columns and rows of data. ii) scan(): read a single column of numerical data and it will ONLY read that type of dataset.

Read External Data rain = rain = Caution!! Head(er) or not in the 1st row?!?! With Header: There is no option for a header if we use scan(). rain = read.table(“rain.txt", head=T) Without Header: read.table(“rain.txt", head=F) rain =

sep=“ ” (default setting) Separation ? read.table( ) read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separates values on each line of the file. 25 28 14 26 29 28 27 22 30 28 31 27 29 25 33 30 20 11 Separated by a single space. sep=“ ” (default setting)

sep=“,” Separation ? read.table( ) read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separate values on each line of the file. 25,28,14 26,29,28 27,22,30 28,31,27 29,25,33 30,20,11 Separated by commas. sep=“,”

sep=“\t” Separation ? read.table( ) read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separate values on each line of the file. 25 28 14 26 29 28 27 22 30 28 31 27 29 25 33 30 20 11 Separated by tabs. sep=“\t”

Data manipulation Back to our data set The column Volume is our focus Is the annual rainfall volume in Sydney, Australia over 49 years 3 methods to focus our attention on Volume

Data Manipulation Option A Option B Option C Function attach() Specify a column by “$” Option C Manipulate the dataset as if it’s a matrix

Option A : attach() attach(“rain”) Attach the columns in rain to pseudo-variables

Option A : attach() Now type > attach(rain) 4 pseudo-variables will be created, namely ID, Year, Volume, and State > Volume We get the data in the column Volume in vector form

Option A : attach() Can only attach ONE data set at a time To attach another data set Detach the previous data set by > detach(“rain”) Then attach the new data set > attach(“MyNewDataset”)

Option B : $ The variable rain belongs to the class dataframe Similar to a matrix but a dataframe can store different types of data in different columns e.g. some columns contains numerical values, others contains letters, strings, etc.

rain ID Year Volume State 1 2 3 4 5 1937 1938 1939 1940 1941 387.93104 157.51727 102.96855 114.0309 131.01787 State 2 1

rain ID Year Volume State 1 2 3 4 5 1937 1938 1939 1940 1941 387.93104 157.51727 102.96855 114.0309 131.01787 State 2 1

Option B : $ To bring up a specific column, we go by rain$Volume Name of the column dataframe

Option B : $ volume = rain$Volume We can store the column in a new variable volume = rain$Volume

Option B : $ Function : names() > names(rain) returns the name of the columns in a dataframe > names(rain) [1] “ID” “Year” “Volume” “State”

Data Manipulation rain = read.table(“rain.txt", head=T) rain: 4 columns of data, ID, Year, Volume, State, and 49 rows of data. Not include the 3rd, 5th-10th rows v = rain[,3] v[-c(3,5:10)] OR rain[-c(3,5:10), 3] Keep the first 10 rows of volume data rain[1:10,3] v[1:10] OR

Data Transformation Square root; Square; Natural logarithm; Removing outliers may make the data distribution symmetric. Symmetrically distributed Make analysis easier Common transformations of data Square root; Square; Natural logarithm; Exponential, etc.

WARNING! Please don’t think that we can always deal with an outlier by simply removing it

Multiple Graphs par(mfrow=c(r,c)) How to place multiple plots on one display. par(mfrow=c(r,c)) Bring up a display with r number of rows and c number of columns.

par(mfrow=c(2,2)) hist(v^2) hist(sqrt(v)) hist(log(v)) hist(exp(v)) Find a better function for transforming the data. par(mfrow=c(2,2)) hist(v^2) hist(sqrt(v)) hist(log(v)) hist(exp(v))

hist(v)

Sorting sv = sort(v, decreasing = T) sv1=sv[-1] Remove the largest observations of v Sort a vector into ascending (or descending) order by using sort() descending sv = sort(v, decreasing = T) sv1=sv[-1]

par(mfrow=c(2,2)) hist(sv1^2) hist(sqrt(sv1)) hist(log(sv1)) hist(exp(sv1))

Sorting sv2 = sort(v, decreasing = F) sv2[-(1:3)] or sv2[-c(1:3)] Remove the first three smallest observations of v ascending sv2 = sort(v, decreasing = F) sv2[-(1:3)] or sv2[-c(1:3)]

Side by side Boxplot boxplot(rain[,3]~rain[,4]) Get a boxplot of the volumn data BY State Volumn: rain[,3 ], the 3rd column vector of the dataset rain State: rain[,4], the 4th column vector of the dataset rain boxplot(rain[,3]~rain[,4])

Summary statistics How to find the summary statistics of volume data by state (=1 or 2)? What is the sample mean of the volume data for state=1 ? What is the sample variance of the volume data for state=2 ?

v1 = rain[s==1, 3] OR v1= v[s==1] What is the sample mean of the volume data for state = 1? s = rain[,4] v1 = rain[s==1, 3] OR v1= v[s==1] mean(v1), mean(rain[s==1, 2]) or mean(v[s==1])

OR by(v, s, mean): get the sample means of v by s. by(v, s, median): get the sample medians of v by s. by(v, s, var): get the sample variances of v by s.

More Graphics x = seq(-3.14,3.14, by=0.2) plot(x, cos(x))

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), ) type = “ l ”

type: the type of plot. Possible types are "p" for points, "l" for lines, "b" for both, "c" for the lines part alone of "b", "o" for both ‘overplotted’, "h" for ‘histogram’ like (or ‘high-density’) vertical lines, "s" for stair steps, "n" for no plotting.

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), type = “l”, ) lty = 2

lty: the line type. Line types can either be specified as an integer 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, or 6=twodash… Or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash“…

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), xlab="variable x", ylab="function of x", main = “Stat 305”, type="l", lty=2, ) col = 2

col: the colors for lines and points. “col” can either be specified as an integer col=1 (black), col=2 (red), col=3 (green), col=4 (blue), etc. Or as one of the character strings col= “black”, col = “red”, col = “green”, col =“blue”, col=“white”, col=“brown”, etc

Use abline() abline(a=2, b=3) Add a line on the existing picture A line : y = a + bx Vertical line: abline(v=10) abline(a=2, b=3) Horizontal line: abline(h=-5)

abline(v=0, lty=2, col=3) x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), xlab="variable x", ylab="function of x", main = “Stat 305”, type="l", lty=2, col=2, ) abline(v=0, lty=2, col=3)

Exercise for students Click “summary of R commands“ Go to our course website by www.slate.ubc.ca Stat251  Mike  Lab materials Click “summary of R commands“

matrix(0, ncol=10, nrow=5)