Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.

Slides:



Advertisements
Similar presentations
Sampling: Final and Initial Sample Size Determination
Advertisements

Chapter 10: Estimating with Confidence
Economics 105: Statistics Review #1 due next Tuesday in class Go over GH 8 No GH’s due until next Thur! GH 9 and 10 due next Thur. Do go to lab this week.
5 - 1 © 1997 Prentice-Hall, Inc. Importance of Normal Distribution n Describes many random processes or continuous phenomena n Can be used to approximate.
Sampling Distributions (§ )
Inferential Statistics & Hypothesis Testing
12.1 Inference for A Population Proportion.  Calculate and analyze a one proportion z-test in order to generalize about an unknown population proportion.
Chapter 18 Sampling Distribution Models
Chapter 10: Sampling and Sampling Distributions
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Stat 301 – Day 28 Review. Last Time - Handout (a) Make sure you discuss shape, center, and spread, and cite graphical and numerical evidence, in context.
Chapter 7 Sampling and Sampling Distributions
Point and Confidence Interval Estimation of a Population Proportion, p
Sampling Distributions
Evaluating Hypotheses
Chapter 7: Variation in repeated samples – Sampling distributions
Quant 3610 Weber State University Dr. Stephen Hays Russell.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.
© 2013 Pearson Education, Inc. Active Learning Lecture Slides For use with Classroom Response Systems Introductory Statistics: Exploring the World through.
Confidence Intervals Confidence Interval for a Mean
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Chapter 10: Estimating with Confidence
Understanding sample survey data
Review of normal distribution. Exercise Solution.
A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.
Continuous Probability Distributions
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Section 8.2 Estimating  When  is Unknown
From Sample to Population Often we want to understand the attitudes, beliefs, opinions or behaviour of some population, but only have data on a sample.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 9 Section 1 – Slide 1 of 39 Chapter 9 Section 1 The Logic in Constructing Confidence Intervals.
Chapter 11: Estimation Estimation Defined Confidence Levels
Confidence Intervals Confidence Interval for a Mean
Introduction to Statistical Inference Chapter 11 Announcement: Read chapter 12 to page 299.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Continuous Probability Distributions Continuous random variable –Values from interval of numbers –Absence of gaps Continuous probability distribution –Distribution.
Sampling Distribution ● Tells what values a sample statistic (such as sample proportion) takes and how often it takes those values in repeated sampling.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 10. Hypothesis Testing II: Single-Sample Hypothesis Tests: Establishing the Representativeness.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 7: Sampling 1.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Sampling Error.  When we take a sample, our results will not exactly equal the correct results for the whole population. That is, our results will be.
Sampling Distribution Models Chapter 18. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads & mark it on the dot.
1 Topic 5 - Joint distributions and the CLT Joint distributions –Calculation of probabilities, mean and variance –Expectations of functions based on joint.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
© 2001 Prentice-Hall, Inc.Chap 7-1 BA 201 Lecture 11 Sampling Distributions.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
Sampling Theory and Some Important Sampling Distributions.
Chapter 13 Sampling distributions
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 5 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
10.1 – Estimating with Confidence. Recall: The Law of Large Numbers says the sample mean from a large SRS will be close to the unknown population mean.
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
Review Confidence Intervals Sample Size. Estimator and Point Estimate An estimator is a “sample statistic” (such as the sample mean, or sample standard.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 3
Sampling and Sampling Distributions. Sampling Distribution Basics Sample statistics (the mean and standard deviation are examples) vary from sample to.
Control Structures Hara URL:
Chapter 6 Sampling and Sampling Distributions
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Understanding Sampling Distributions: Statistics as Random Variables
ESTIMATION.
Introduction to R.
Sampling Distributions (§ )
How Confident Are You?.
Presentation transcript:

Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis with R > tumor.info<- data.frame(localization,tumorsize, progress) > rownames(tumor.info)<- c("XX348","XX234","XX987") > tumor.info$tumorsize [1]

Today’s Outline Further features of the R language Preliminary data analysis exercise Central Limit Theorem (CLT) CLT exercise some material included here was adapted from materials available at and is used by permission

R: factors Categorical variables in R should be specified as factors Factors can take on a limited number of values, called levels Levels of a factor may have a natural order Functions in R for creating factors: factor(), ordered()

R: data frames (review) data frame : the type of R object normally used to store a data set A data frame is a rectangular table with rows and columns –data within each column has the same type (e.g. number, character, logical) –different columns may have different types Example: > tumor.info localisation tumorsize progress XX348 proximal 6.3 FALSE XX234 distal 8.0 TRUE XX987 proximal 10.0 FALSE

R: making data frames Data frames can be created in R by importing a data set A data frame can also be created from pre- existing variables Example: > localisation<- c("proximal","distal","proximal") > tumorsize<- c(6.3,8,10) > progress<-c(FALSE,TRUE,FALSE) > tumor.info<- data.frame(localization,tumorsize,progress) > rownames(tumor.info)<- c("XX348","XX234","XX987") > tumor.info$tumorsize [1]

> tumor.info[c(1,3),] localisation tumorsize progress XX348 proximal 6.3 FALSE XX987 proximal 10.0 FALSE > tumor.info[c(TRUE,FALSE,TRUE),] localisation tumorsize progress XX348 proximal XX987 proximal > tumor.info$localisation [1] "proximal" "distal" "proximal" > tumor.info$localisation=="proximal" [1] TRUE FALSE TRUE > tumor.info[ tumor.info$localisation=="proximal", ] localisation tumorsize progress XX348 proximal XX987 proximal subset rows by a vector of indices subset rows by a logical vector subset a column comparison resulting in logical vector subset the selected rows R: more on subsetting

R: loops When the same or similar tasks need to be performed multiple times in an iterative fashion A data frame can also be created from pre- existing variables Examples: > for(i in 1:10) { > i = 1 print(i*i) while(i<=10) { } print(i*i) i=i+sqrt(i) } Explicit loops such as these should be avoided where possible

R: lapply, sapply When the same or similar tasks need to be performed multiple times for all elements of a list or for all columns of an array These implicit loops are generally faster than explicit ‘for’ loops lapply(the.list,the.function) –the.function is applied to each element of the.list –result is a list whose elements are the individual results for the.function sapply(the.list,the.function) –Like lapply, but tries to simplify the result, by converting it into a vector or array of appropriate size

R: apply apply(array, margin,the.function) –applies the.function along the dimension of array specified by margin –result is a vector or matrix of the appropriate size Example: > x [,1] [,2] [,3] [1,] [2,] [3,] [4,] > apply(x, 1, sum) [1] > apply(x, 2, sum) [1]

R: sweep and scale sweep(...) removes a statistic from dimensions of an array Example: Subtract column medians > col.med<-apply(my.data,2,median) > sweep(my.data,2,col.med) scale(...) centers and/or rescales columns of a matrix

R: importing and exporting data (review) Many ways to get data into and out of R One straightforward way is to use tab-delimited text files (e.g. save an Excel sheet as tab- delimited text, for easy import into R) Useful R functions: read.delim(), read.table(), read.csv(), write.table() Example: > x = read.delim(“filename.txt”) > write.table(x, file=“x.txt”, sep=“\t”)

R: introduction to object orientation Primitive (or atomic) data types in R are: – numeric (integer, double, complex) – character – logical – function From these, vectors, arrays, lists can be built An object is an abstract term for anything that can be assigned to a variable Components of objects are called slots Example: a microarray experiment –probe intensities –patient data (tissue location, diagnosis, follow-up) –gene data (sequence, IDs, annotation )

R: classes and generic functions Object-oriented programming aims to create coherent data systems and methods that work on them In general, there is a class of data objects and a (print, plot, etc.) method for that class Generic functions, such as print, act differently depending on the function argument This means that we don’t need to worry about a lot of the programming details In R, an object has a (character vector) class attribute which determines the mode of action for the generic function

Exercises: Bittner et al. dataset You should have downloaded the dataset gene_list-Cutaneous_Melanoma.xls from the web Use the handout as a guide to get this dataset into R and do some preliminary analyses If you do not have this dataset, you can use your own data

Sample surveys Surveys are carried out with the aim of learning about characteristics (or parameters) of a target population, the group of interest The survey may select all population members (census) or only a part of the population (sample) Typically studies sample individuals (rather than obtain a census) because of time, cost, and other practical constraints

Sampling variability Say we sample from a population in order to estimate the population mean of some (numerical) variable of interest (e.g. weight, height, number of children, etc.) We would use the sample mean as our guess for the unknown value of the population mean Our sample mean is very unlikely to be exactly equal to the (unknown) population mean just due to chance variation in sampling Thus, it is useful to quantify the likely size of this chance variation (also called ‘chance error’ or ‘sampling error’, as distinct from ‘nonsampling errors’ such as bias)

Sampling variability of the sample mean Say the SD in the population for the variable is known to be some number  If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample mean (called the ‘standard error’) is SE(mean) =  /  n If  is not known, you can substitute an estimate

Sampling variability of the sample proportion Similarly, we could use the sample proportion as a guess for the unknown population proportion p with some characteristic (e.g. proportion of females) If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample proportion is SE(proportion) =  p(1-p)/n Of course, we don’t know p (or we would not need to estimate it), so we substitute our estimate

Central Limit Theorem (CLT) The CLT says that if we –repeat the sampling process many times –compute the sample mean (or proportion) each time –make a histogram of all the means (or proportions) then that histogram of sample means (or proportions) should look like the normal distribution Of course, in practice we only get one sample from the population The CLT provides the basis for making confidence intervals and hypothesis tests for means or proportions

What the CLT does not say The CLT does not say that the histogram of variable values will look like the normal distribution The distribution of the individual variable values will look like the population distribution of variable values for a big enough sample This population distribution does not have to be normal, and in practice is typically not normal

CLT: technical details A few technical conditions must be met for the CLT to hold The most important ones in practice are that –the sampling should be random (in a carefully defined sense) – the sample size should be ‘big enough ’ How big is ‘big enough’? There is no single answer because it depends on the variable’s distribution in the population: the less symmetric the distribution, the more samples you need

Exercises: CLT simulations Here, you will simulate flipping coins The coins will have differing probabilities of ‘heads’ The object is to see how the number of coin flips required for the distribution of the proportion of heads in the simulated flips to become approximately normal See the handout for details