Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.

Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai

Review 1) Basic commands in R;
2) Basic operations for a scalar and a vector; Algebraic manipulations with a scalar and a vector; Algebraic manipulations with vectors; 3) Summary/Descriptive statistics; mean, median, sd, var, quantile, IQR, etc. 4) Graphic; Scatterplot, histogram and boxplot.

Histogram Adv: Disadv: Detect outliers;
Show the skewness/ symmetry of the distribution; Show the modality of the distribution. Disadv: Depends on the choice of bandwidth.

Boxplot Adv: Disadv: Detect outliers;
Show the skewness/ symmetry of the distribution; Show lower (1st), mid (2nd) and upper (3rd) quartiles. Disadv: Cannot show the modality of the distribution.

x = c(rep(-2,5), rep(0, 3), rep(2, 5), rep(-1, 10), rep(1,10))
-1 1 -2 2

Skewness/Symmetry of a distribution
Symmetric:

Asymmetric: Left skewed Heavy tail on the left

Asymmetric: Right skewed Heavy tail on the right

Outliers x = c(rep(0,10), rep(2, 5), rep(-2, 5)) hist(x)
An unusual point(s) far away from the majority of data x = c(rep(0,10), rep(2, 5), rep(-2, 5)) human mistake measurement error hist(x)  Skew the distribution of data

x[4] = -10

Boxplot to show potential outliers
Some small-valued observations are far away from the majority of the data. Left skewed

x[4] = 10

Boxplot to show potential outliers
Some large-valued observations are far away from the majority of the data. Right skewed

Lab 2 Matrix manipulation Read external data in R Data transformation
More techniques for Graphics

Matrix Scalar: 0 dimension e.g. x=1.3, 5, 200,… Vector: 1 dimension
e.g. x = c(4,0.35,0.9, 1.1, 5) x[5]: the 5th element of x

x[1,2]: the element in the 1st row AND 2nd column
Matrix: 2 dimensions e.g. 1st 2nd 3rd column The (1,2)th element of x 1st row x[1,2]: the element in the 1st row AND 2nd column 2nd row 3rd row x[,1] x[3,] The 1st column vector The 3rd row vector

Create a Matrix Use matrix() ?matrix Or help(matrix)
x = matrix(c(3,5,6,3,4,9,2,1,7), ncol=3, nrow=3, byrow=T) The matrix is filled by rows.

Matrix Manipulation x[-i,]: drop the ith row vector of x.
x[-c(i,j),]: drop the ith and jth row vectors of x. x[,-i]: drop the ith column vector of x. x[,-c(i,j)]: drop the ith and jth column vectors of x.

Drop the (i,j)th element of x?
Matrix Manipulation x[-i,-j] ?!?!? Drop the (i,j)th element of x?

x[-1,-3] Remove the data in the 1st row and the 3rd column. x[-1,]
AND x[,-3] 1st 2nd 3rd column Remove the data in the 1st row and the 3rd column. 1st row 2nd row 3rd row

Read External Data 1) Save data set into Z: drive; 2) Open R;
3) Change the directory by clicking on [File] at the top right and then choosing the option for [Change dir].

Read External Data Two ways to read a data in R:
i) read.table(): read a table of data, i.e. the data set has multiple columns and rows of data. ii) scan(): read a single column of numerical data and it will ONLY read that type of dataset.

Read External Data rain = rain = Caution!!
Head(er) or not in the 1st row?!?! With Header: There is no option for a header if we use scan(). rain = read.table(“rain.txt", head=T) Without Header: read.table(“rain.txt", head=F) rain =

sep=“ ” (default setting)
Separation ? read.table( ) read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separates values on each line of the file. Separated by a single space. sep=“ ” (default setting)

sep=“,” Separation ? read.table( )
read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separate values on each line of the file. 25,28,14 26,29,28 27,22,30 28,31,27 29,25,33 30,20,11 Separated by commas. sep=“,”

sep=“\t” Separation ? read.table( )
read.table(file, header = FALSE, sep = “ ", …..) sep: the field separator character that separate values on each line of the file. Separated by tabs. sep=“\t”

Data manipulation Back to our data set The column Volume is our focus
Is the annual rainfall volume in Sydney, Australia over 49 years 3 methods to focus our attention on Volume

Data Manipulation Option A Option B Option C Function attach()
Specify a column by “$” Option C Manipulate the dataset as if it’s a matrix

Option A : attach() attach(“rain”)
Attach the columns in rain to pseudo-variables

Option A : attach() Now type > attach(rain)
4 pseudo-variables will be created, namely ID, Year, Volume, and State > Volume We get the data in the column Volume in vector form

Option A : attach() Can only attach ONE data set at a time
To attach another data set Detach the previous data set by > detach(“rain”) Then attach the new data set > attach(“MyNewDataset”)

Option B : $ The variable rain belongs to the class dataframe
Similar to a matrix but a dataframe can store different types of data in different columns e.g. some columns contains numerical values, others contains letters, strings, etc.

rain ID Year Volume State 1 2 3 4 5 1937 1938 1939 1940 1941 387.93104
State 2 1

Option B : $ To bring up a specific column, we go by rain$Volume
Name of the column dataframe

Option B : $ volume = rain$Volume
We can store the column in a new variable volume = rain$Volume

Option B : $ Function : names() > names(rain)
returns the name of the columns in a dataframe > names(rain) [1] “ID” “Year” “Volume” “State”

Data Manipulation rain = read.table(“rain.txt", head=T)
rain: 4 columns of data, ID, Year, Volume, State, and 49 rows of data. Not include the 3rd, 5th-10th rows v = rain[,3] v[-c(3,5:10)] OR rain[-c(3,5:10), 3] Keep the first 10 rows of volume data rain[1:10,3] v[1:10] OR

Data Transformation Square root; Square; Natural logarithm;
Removing outliers may make the data distribution symmetric. Symmetrically distributed Make analysis easier Common transformations of data Square root; Square; Natural logarithm; Exponential, etc.

WARNING! Please don’t think that we can always deal with an outlier by simply removing it

Multiple Graphs par(mfrow=c(r,c))
How to place multiple plots on one display. par(mfrow=c(r,c)) Bring up a display with r number of rows and c number of columns.

par(mfrow=c(2,2)) hist(v^2) hist(sqrt(v)) hist(log(v)) hist(exp(v))
Find a better function for transforming the data. par(mfrow=c(2,2)) hist(v^2) hist(sqrt(v)) hist(log(v)) hist(exp(v))

hist(v)

Sorting sv = sort(v, decreasing = T) sv1=sv[-1]
Remove the largest observations of v Sort a vector into ascending (or descending) order by using sort() descending sv = sort(v, decreasing = T) sv1=sv[-1]

par(mfrow=c(2,2)) hist(sv1^2) hist(sqrt(sv1)) hist(log(sv1)) hist(exp(sv1))

Sorting sv2 = sort(v, decreasing = F) sv2[-(1:3)] or sv2[-c(1:3)]
Remove the first three smallest observations of v ascending sv2 = sort(v, decreasing = F) sv2[-(1:3)] or sv2[-c(1:3)]

Side by side Boxplot boxplot(rain[,3]~rain[,4])
Get a boxplot of the volumn data BY State Volumn: rain[,3 ], the 3rd column vector of the dataset rain State: rain[,4], the 4th column vector of the dataset rain boxplot(rain[,3]~rain[,4])

Summary statistics How to find the summary statistics of volume data by state (=1 or 2)? What is the sample mean of the volume data for state=1 ? What is the sample variance of the volume data for state=2 ?

v1 = rain[s==1, 3] OR v1= v[s==1]
What is the sample mean of the volume data for state = 1? s = rain[,4] v1 = rain[s==1, 3] OR v1= v[s==1] mean(v1), mean(rain[s==1, 2]) or mean(v[s==1])

OR by(v, s, mean): get the sample means of v by s.
by(v, s, median): get the sample medians of v by s. by(v, s, var): get the sample variances of v by s.

More Graphics x = seq(-3.14,3.14, by=0.2) plot(x, cos(x))

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), ) type = “ l ”

type: the type of plot. Possible types are "p" for points,
"l" for lines, "b" for both, "c" for the lines part alone of "b", "o" for both ‘overplotted’, "h" for ‘histogram’ like (or ‘high-density’) vertical lines, "s" for stair steps, "n" for no plotting.

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), type = “l”, ) lty = 2

lty: the line type. Line types can either be specified as an integer
0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, or 6=twodash… Or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash“…

x = seq(-3.14,3.14, by=0.2) plot(x, cos(x), xlab="variable x", ylab="function of x", main = “Stat 305”, type="l", lty=2, ) col = 2

col: the colors for lines and points.
“col” can either be specified as an integer col=1 (black), col=2 (red), col=3 (green), col=4 (blue), etc. Or as one of the character strings col= “black”, col = “red”, col = “green”, col =“blue”, col=“white”, col=“brown”, etc

Use abline() abline(a=2, b=3) Add a line on the existing picture
A line : y = a + bx Vertical line: abline(v=10) abline(a=2, b=3) Horizontal line: abline(h=-5)

abline(v=0, lty=2, col=3) x = seq(-3.14,3.14, by=0.2)
plot(x, cos(x), xlab="variable x", ylab="function of x", main = “Stat 305”, type="l", lty=2, col=2, ) abline(v=0, lty=2, col=3)

Exercise for students Click “summary of R commands“
Go to our course website by Stat251  Mike  Lab materials Click “summary of R commands“

matrix(0, ncol=10, nrow=5)

Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.

Similar presentations

Presentation on theme: "Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.

Similar presentations

Presentation on theme: "Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai."— Presentation transcript:

Similar presentations

About project

Feedback