Normalization Intro to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2012 February 7, 2012.

Slides:



Advertisements
Similar presentations
Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Advertisements

Basics of Biostatistics for Health Research Session 2 – February 14 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
EViews Student Version. Today’s Workshop Basic grasp of how EViews manages data Creating Workfiles Importing data Running regressions Performing basic.
Introduction to MATLAB The language of Technical Computing.
Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.
Means, variations and the effect of adding and multiplying Textbook, pp
1 CA202 Spreadsheet Application Combining Data from Multiple Sources Lecture # 6.
Understanding Microsoft Excel
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
How to Work With Affymetrix .Cel Files in geWorkbench
Microarray Normalization
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
TIGR Spotfinder: a tool for microarray image processing
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
EGR 106 – Week 2 – Arrays & Scripts Brief review of last week Arrays: – Concept – Construction – Addressing Scripts and the editor Audio arrays Textbook.
A Simple Guide to Using SPSS© for Windows
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
MR2300: MARKETING RESEARCH PAUL TILLEY Unit 10: Basic Data Analysis.
CS1100: Computer Science and Its Applications Creating Graphs and Charts in Excel.
Introduction to Excel 2007 Part 1: Basics and Descriptive Statistics Psych 209.
SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
PY550 Research and Statistics Dr. Mary Alberici Central Methodist University.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
Spreadsheet-Based Decision Support Systems Chapter 22:
BIOSTAT - 2 The final averages for the last 200 students who took this course are Are you worried?
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.
Objectives Understand what MATLAB is and why it is widely used in engineering and science Start the MATLAB program and solve simple problems in the command.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Agenda Introduction to microarrays
Manatees of Florida. Standard: MAFS.912.S-ID.1.1: Represent data with plots on the real number line (dot plots, histograms, and box plots). MAFS.912.S-ID.1.3:
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
T T03-01 Calculate Descriptive Statistics Purpose Allows the analyst to analyze quantitative data by summarizing it in sorted format, scattergram.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Chapter 3 MATLAB Fundamentals Introduction to MATLAB Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
SP5 - Neuroinformatics SynapsesSA Tutorial Computational Intelligence Group Technical University of Madrid.
Lesson 1 – Microsoft Excel * The goal of this lesson is for students to successfully explore and describe the Excel window and to create a new worksheet.
QM Spring 2002 Statistics for Decision Making Excel for Statistics: An Overview.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 Describing Distributions Numerically.
Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Using geWorkbench: Working with Sets of Data Fan Lin, Ph. D. Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT.
Microsoft Office 2013 ®® Calculating Data with Formulas and Functions.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Bioinformatics for biologists
Postgraduate Computing Lectures PAW 1 PAW: Physicist Analysis Workstation What is PAW? –A tool to display and manipulate data. Learning PAW –See ref. in.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Elementary Analysis Richard LeGates URBS 492. Univariate Analysis Distributions –SPSS Command Statistics | Summarize | Frequencies Presents label, total.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Copyright © 2009 Pearson Education, Inc. Slide 4- 1 Practice – Ch4 #26: A meteorologist preparing a talk about global warming compiled a list of weekly.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
HIMS 650. * To learn how to use the Excel program, watch these helpful Youtube.com videos:
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Using Macros Lesson.
Cell Diameters and Normal Distribution. Frequency Distributions a frequency distribution is an arrangement of the values that one or more variables take.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
Introduction Osborn.
Jonathan W. Duggins; James Blum NC State University; UNC Wilmington
Success Criteria: I will be able to analyze data about my classmates.
Stata Basic Course Lab 2.
One-Factor Experiments
Pre-processing AFFY data
Presentation transcript:

Normalization Intro to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2012 February 7, 2012

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination

Microarray Data Analysis Microarray experiment Image Analysis Gene level expression values Unsupervised Analysis – clustering Networks & Data Integration Supervised Analysis Normalized Data Decomposition techniques

Why normalize?  Microarray data have significant systematic variation both within arrays and between arrays that is not true biological variation  Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects  Sources of variation:  Spatial location on the array  Dye biases which vary with spot intensity  Plate origin  Printing/spotting quality  Experimenter

Sources of Systematic Bias Individual Factors – Print (20% - 30%) – Experimenter (20% - 30%) – Organism (3% - 10%) – Date (5%) – Software (2%) – Number of tips (3%) Interactions – Print - Experimenter (40%) – Print - Date (40%) – Experimenter - Date (40%) (slide from Catherine Ball) (based on ~4,600 experiments in Stanford Microarray Database analyzed by ANOVA)

KO #8 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays. Clearly visible plate effects

Spatial Biases

Spatial plots: background from two slides

Log transform Typical Affymetrix probe intensity distribution After log-transform

#Create a boxplot of the normalized data boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") #To save the boxplot as a jpeg file jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()

Normalization There are many different ways to normalize data – Global median, LOWESS, LOESS, RMA etc – By print tip, spatial, etc BUT: don’t expect it to fix bad data! – Won’t make up for lack of replicates – Won’t make up for horrible slides Assignment: Read about normalization (Quackenbush and RMA article) normalization-carol-bult/

RMA Processing Background correction Corrects for background noise and processing effects Adjusts for cross hybridization Adjusts for estimated expression values to fall on proper scale Normalization Reduces unwanted variation across chips. Ensures the quantiles of each chip are equal. Give same distribution to each chip. Probe summarization – Once probe-level PM values have been background-corrected and normalized, they need to be summarized into expression measures, so that the result is a single expression measure per probe-set, per chip.

Using R RMA processing Log into the cloud. (don’t forget to have X windows server running) At the prompt, type “R” to invoke R. To quit R, type q() or quit() at the prompt. Save workspace.

You can invoke on-line help with the help.start() command. Note: R is a case sensitive language! What happens if you type Help.Start() at the command prompt?

At the prompt (>) type the command, library(), to see which packages are available in your R installation. A library is a location where R goes to find packages. What is listed for you may differ from what is shown here.

The caret “>” is the command prompt ‘library()’ is the command the parantheses are for defining specific operations associated with the function. You might find this R reference card helpful….Use it as a stub to create your own reference card!

Use this command to get a list of functions associated with the stats package.

Running this command will load all of the data sets that come with your R installation.

In this series of commands, we load the BOD (Biological Oxygen Demand) data set and then print out the data to the screen.

Data Frames in R The data sets in R are objects called data frames – You can think of a data frame as a table where the columns are variables and rows are observations

Types of Objects in R Vector – One-dimensional array of arbitrary length. All members of a vector must be of the same type (numeric, alpha, etc.) Matrix – Two –dimensional array with an arbitrary number of rows and columns. All elements in a matrix must be of the same type. Array – Similar to a matrix but with an arbitrary dimension Data frame – Organized similar to a matrix except that each column can include its own type of data. Not all columns in a data frame need to contain the same type of data. Function – A type of R object that performs a specific operation. R contains many built-in functions. List – An arbitrary collection of R objects

Create the BOD Data Frame from Scratch 1. Create a vector object for time using the c() function. 2. Create a vector object for demand. 3. Use the data.frame() function to create the data frame object

Use the read.table() function to read in data from a text file and use it as a data frame in R. Other input functions for R include: read.csv() read.delim() Reading in Data from Files If you had data in an Excel spreadsheet, how could you import it into R? <- can be used as an assignment operator, but = should also work.

How would you find out more about the c() function?

Writing Output to Files Want to save the data in your data frame as a text file on your computer? Use the write.table() function to output the MyBOD data frame to a text file. This file will be saved to the directory that R is working from.

Writing Output to Files Use the write.csv() function to output the MyBOD data frame to a comma separated file (which can be opened easily in Excel). This file will be saved to the directory that R is working from.

Editing Data For smaller data files, you can edit the file using the edit() function. This launches an R Editor window. Always write the edited file to a new object name! In this case we will edit the newdata object and store the results as an object called Mynewdata. To store the edited object as a file on your computer, use the write.table() function.

Exploring What is In a Data Frame The names() and str() commands let you get an overview of what is in a data frame. The names() function allows you to access the column names and edit them. In this code snippet, the first [1] and second [2] column names are changed from lower case to sentence case.

Accessing Data in a Data Frame Use the name of the data frame (MyBOD2), a dollar sign ($) and the name of the variable (Time or Demand) to see a list of all of the observation values. To access a specific value, you simply indicate the position in the vector…for example, MyBOD2$Demand [2] will access the second value for that variable which is If you “attach” the data frame using the attach() command you can access the variables and observations without the cumbersome need to specify the name of the data frame or the $. Using what you know…how could you change the value of Demand[2] from 10.3 to 10.5? (be careful that you don’t make such changes to the original data frames!)

Adding columns to a Data Frame You can add and delete columns from a data frame. Here we add a column for the sex of whatever it is we are measuring oxygen demand for. Oops!!! We have a data entry error. The value for sex should all be female (F). How would you fix this?

Deleting columns from a Data Frame You can delete columns from a data frame. Here we deleted the column for sex that we just created from the MyBOD2 data frame.

Note: When you are done using a data frame it is a good practice to “detach” it.

Displaying Data in R R comes with an incredible array of built in data analysis tools for exploring, analyzing, and visualizing data. Here is a plot of the Time and Demand variables for the MyBOD2 data frame using the plot() command. Note that because we “attached” this data frame we can just use the names of the variables to access the observation data. Use help(plot) to look up the details of this command. Figure out how to change the command to add a title to the plot.

Displaying Data in R Here is a box plot of the Demand variables for the MyBOD2 data frame using the boxplot() command.

Analyzing Data The summary() command provides summary statistics for a data frame.

Analyzing Data Here are a series of commands to generate some basic statistics for the Demand variable in the MyBOD2 data frame. The data frame has been attached so that the variable names can be used directly. Remember that the case of the variable names were changed relative to the original BOD data set (Time vs time; Demand vs demand)!

Examples of stats functions in R mean() median() table() – there is no function to find the mode of a data set but the table() function will show how many times a value is observed. max() min() There is no built in function for midrange so you have to construct a formula to calculate this based on the values from the max() and min() functions.

Measuring data spread Remember that the case of the variable names were changed relative to the original BOD data set (Time vs time; Demand vs demand)! Here are a series of commands to generate some basic statistics related to the spread of measurements for the Demand variable in the MyBOD2 data frame. The data frame has been attached so that the variable names can be used directly.

More examples of stats functions in R var() sd() There is no built in function for calculating the standard error of the mean (sem) so you have to create a formula to calculate this. There is no built in function for calculating the range so you have to construct a formula to calculate this based on the values from the max() and min() functions.

What is meant by mode? What do the variance, standard deviation and standard error of the mean tell us about a data set?

Your Turn Create a data frame for age and frequency using the data on this slide. Calculate the cumulative frequency and add it as a column to the data frame. Save the data frame as a comma separated text file and then open it in Excel. Plot age versus cumulative frequency. What are mean and median age? What is the variance, standard deviation, and standard error mean for frequency?

Affymetrix Data Files CEL file = results of the intensity calculations on the pixel values of the DAT file (the DAT file is the image file) – CDF = chip description file – Describes the layout for an Affymetrix gene chip array I have put some CEL files and the related CDF file in my home directory for you to work with.

1. Log into the Maine Innovation Cloud 2. Copy the CEL files from my home directory to your directory 2. Also copy a file called MoGene-1_0-st-v1.cdf to your directory 3. Create a subdirectory called TestData 4. Move the CEL files and cdf file into TestData 5. Change directory to TestData 6. Start R at the command line 7. After R starts, type the commands on the next slide at the prompts in R

library(affy) library(makecdfenv) CELData=ReadAffy() Array.CDF=make.cdf.env("MoGene-1_0-st-v1.cdf") rma.CELData=rma(CELData) rma.expr=exprs(rma.CELData) rma.expr.df=data.frame(ProbeID=rownames(rma.expr), rma.expr) write.table(rma.expr.df, "rma.expr.dat", sep="\t", row=F, quote=F) Performing RMA and saving the normalized data file

#CELData boxplot - plot CEL data intensities (non-normalized) jpeg("boxplot.jpeg") boxplot(CELData, names=CELData$sample, col="blue") dev.off() #CELData histogram - density histogram of CEL data (non-normalized) jpeg("histogram.jpeg") hist(CELData) dev.off() #print out a key to sample order (for plots and such) pData(CELData) #save R workspace to file (OPTIONAL) save.image("CEL files.Rdata") Create a plot of the non-normalized intensities

Read data back in….create plot of normalized intensities Open the rma.expr.dat text file using Excel. Save as a new file (tab delimited text format) such as “mydata.dat” #Create an R object and read your tab delimited text file into it mydata <- read.table(“mydata.txt”, header=T, sep=”\t”) summary(mydata) #Exclude probe information summary(mydata[-1]) #Create a boxplot of the normalized data boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") #To save the boxplot as a jpeg file jpeg("normal_nobg_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()

#Create a boxplot of the normalized data boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") #To save the boxplot as a jpeg file jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()