Introduction to Microarray Gene Expression

Slides:



Advertisements
Similar presentations
Experiment Design for Affymetrix Microarray.
Advertisements

Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Getting the numbers comparable
DNA microarray and array data analysis
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
DNA Microarray: A Recombinant DNA Method. Basic Steps to Microarray: Obtain cells with genes that are needed for analysis. Isolate the mRNA using extraction.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Information Aspects of Nucleic Acids Measurement Technologies Description of nucleic acid measurement technologies Algorithmic, optimization, data analysis.
Introduce to Microarray
Gene Expression Data Analyses (1) Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
and analysis of gene transcription
By Moayed al Suleiman Suleiman al borican Ahmad al Ahmadi
Analysis of microarray data
Microarray Preprocessing
with an emphasis on DNA microarrays
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat,
Affymetrix vs. glass slide based arrays
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
Lecture 22 Introduction to Microarray
CDNA Microarrays MB206.
Data Type 1: Microarrays
Gene expression and DNA microarrays Old methods. New methods based on genome sequence. –DNA Microarrays Reading assignment - handout –Chapter ,
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
DNA, RNA, & Protein Synthesis (12.3) State Standards 2A. Distinguish between DNA and RNA. 2B. Explain the role of DNA in storing and transmitting cellular.
Agenda Introduction to microarrays
Scenario 6 Distinguishing different types of leukemia to target treatment.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Introduction to Gene Expression
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Gene expression and DNA microarrays No lab on Thursday. No class on Tuesday or Thursday next week –NCBI training Monday and Tuesday –Feb. 5 during class.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Introduction to Oligonucleotide Microarray Technology
Microarray: An Introduction
AN INTRODUCTION TO GENE EXPRESSION ANALYSIS BY MICROARRAY TECHNIQUE (PART I) DR. AYAT B. AL-GHAFARI MONDAY 3 RD MUHARAM 1436.
RNA and Transcription. Genes Genes are coded DNA instructions that control the production of proteins within the cell To decode the genetic message, you.
PROTEIN SYNTHESIS.
RNA Ribonucleic Acid Single-stranded
(3) Gene Expression Gene Expression (A) What is Gene Expression?
Notes – Protein Synthesis: Transcription
Protein synthesis: Overview
Unit 2.1: BASIC PRINCIPLES OF HUMAN GENETICS
Getting the numbers comparable
It’s Wednesday!! Don’t be content with being average. Average is as close to the bottom as it is to the top!
12.2 DNA Replication Process of creating an identical copy of DNA is replication. 2 strands will unwind and become a template for copying. strands separate.
Normalization for cDNA Microarray Data
Protein Synthesis.
TRANSCRIPTION DNA mRNA.
Data Type 1: Microarrays
DNA Deoxyribonucleic Acid.
Introduction to Gene Expression
Presentation transcript:

Introduction to Microarray Gene Expression Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC

Outline of the four talks A general overview of microarray data Some important terminology and background Various platforms Sources of variation Normalization of data Analysis of gene expression data - Nominal explanatory variables Two types of explanatory variables Scientific questions of interest A brief discussion on false discovery rate (FDR) analysis Some existing methods of analysis.

Outline of the four talks Analysis of ordered gene expression data Common experimental designs Some existing statistical methods An example Demonstration of ORIOGEN Some open research problems Analysis of data from cell-cycle experiments Some background on cell-cycle experiments Modeling the data Data from multiple experiments Some open research problem

Talk 1: An overview of microarray data

To perform statistical analysis of any given data It is important to understand all sources of (i) bias, (ii) variability. Some basic understanding of the underlying technology! Understand the sampling/experimental design

Some Important Terminology and Background

Central Dogma of Molecular Biology

Some background terminology: DNA and RNA DNA (Deoxyribonucleic acid) - Contains genetic code or instructions for the development and function living organisms. It is double stranded. Four Nucleotides (building blocks of DNA) Adenine (A), Guanine (G), Thymine (T), Cytosine (C) Base pairs: (A, T) (G, C) E.g. 5’ ---AAATGCAT---3’ 3’ ---TTTACGTA---5’

Some background terminology: DNA and RNA RNA (Ribonucleic acid) - transcribed (or copied) from DNA. It is single stranded. (Complimentary copy of one of the strands of DNA) RNA polymerase - An enzyme that helps in the transcription of DNA to form RNA. Four Nucleotides (building blocks of DNA) Adenine (A), Guanine (G), Uracil (U), Cytosine (C) Base pairs: (A, U) (G, C)

Some background terminology: Types of RNA Types of RNA - (transfer) tRNA, (ribosomal) rRNA, etc. mRNA - messenger RNA. Carries information from DNA to ribosomes where protein synthesis takes place (less stable than DNA).

Some background terminology: Oligos Oligonucleotide - a short segment of DNA consisting of a few base pairs. In short it is commonly called “Oligo”. “mer” - unit of measurement for an Oligo. It is the number of base pairs. So 30 base pair Oligo would be 30-mer long.

Some background terminology: Probes cDNA - complimentary DNA. DNA sequence that is complimentary to the given mRNA. Obtained using an enzyme called reverse transcriptase. Probes - a short segment of DNA (about 100-mer or longer) used to detect DNA or RNA that compliments the sequence present in the probe.

Some background terminology: “Blots” - Origins of Microarrays Southern blot (Edwin Southern, 1975 J. Molec. Biol.) A method used to identify the presence of a DNA sequence in a sample of DNA. Western blot (immunoblot) to identify a specific protein from a tissue extract.

Some background terminology Southwestern blot to identify and characterize DNA-binding proteins. Northern blot A method used to study the gene expression from a sample of mRNA.

Microarrays …

Northern blot Vs Microarray Rate of expression analysis Thousands of genes at a time (High throughput) Few genes at a time Automation Automation possible Manual Scope Allows to explore relationships among several 100’s of genes at the same time Limited

What is a Microarray? Sequences from thousands of different genes are immobilized, or attached, at fixed locations. Spotted, or actually synthesized directly onto the support.

Microarray Technology Two color dye array (Spotted array) Spotted cDNA microarrays Spotted oligo microarrays Single dye array In situ oligo microarrays

Microarray Technology

Spotted Microarrays

Spotted DNA Microarray Slides carrying spots of target DNA are hybridized to fluorescently labeled cDNA from experimental and control cells and the arrays are imaged at two or more wavelengths Expression profiling involves the hybridization of fluorescently labeled cDNA, prepared from cellular mRNA, to microarrays carrying thousands of unique sequences.

Spotted DNA Microarray Spotted DNA array is typically “home made” so you need to think about: cDNA or Oligo Location of the Oligo in a given gene Oligo length - number of bp?

Spotted DNA Microarray Gene expression: Y < 0; gene is over expressed in green labeled sample compared to red-labeled sample Y = 0; gene is equally expressed in both samples Y > 0; gene is over expressed in red-labeled sample compared to green labeled sample

Single Dye Microarrays

Major Commercial Platforms More than 50 companies are currently offering various DNA microarray platforms, reagents and software Affymetrix dominated the marker for many years *Agilent has one and two-color microarray platform

Affymetrix GeneChip Each gene is represented by 11 to 20 oligos of 25-mers Probe: An oligo of 25-mer Probe Pair: a PM and MM pair Perfect match (PM): A 25-mer complementary to a reference sequence of interest (part of the gene) Mismatch (MM): same as PM with a single base change for the middle (13th) base (G <-> C, A <-> T) Probe set: a collection of probe-pairs (11 to 20) related to a fraction of gene

Affymetrix call for the presence of a signal Affymetrix detection algorithm uses probe pair intensities to obtain detection p-value Using this p-value they decide whether the signal is “ present”, “marginal” or “absent”

Affy call Detection of p-value Calculate Kendall’s tau T for each probe pair T = (PM-MM) / (PM+MM) Determine the statistical significance of the gene by computing the p-value.

Affy call Ref: Affymetrix Technical Manual

Affymetrix Vs Illumina Ref: Pan Du & Simon Lin

Which Platform to Choose? Every platform has its unique feature Choose platform based on Nature of the study Amount of available RNA Cost Platform comparison in MAQC study

MAQC Project Objective: To generate a set of quality control tools for microarray research community 137 participants representing 51 organizations Gene expression from two distinct RNA samples (total 4 samples) Sample A = Universal Human Reference RNA(UHRR)–100% Sample B = Human Brain Reference RNA(HBRR) – 100% Sample C = 75% UHRR + 25% HBRR Sample D = 25% UHRR + 75% HBRR

Microarray Data Analysis

Why Normalize Data? To “calibrate”/adjust data so as to reduce or eliminate the effects arising from variation in technology and other sources rather than due to true biological differences between test groups.

Sources of bias/variation Tissue or cell lines mRNA It can degrade over time - so there is a potential batch effect if portions of experiment are performed at different times Purity and quantity Dye color effect (spotted arrays) Variation due to technology - is substantially reduced with improved technology Etc.

A useful graphical representation of data Data matrix: Let

A useful graphical representation of data Let its spectral decomposition be given by where

A useful graphical representation of data Then Plot

Common Normalization Methods Internal Control Normalization Global Normalization Linear Normalization (Spotted arrays) Non-linear Normalization Method (Spotted arrays) - LOWESS curve. ANOVA COMBAT (for batch effect)

Internal control normalization (Housekeeping gene(s)) Expression of each gene is measured relative to the average of house keeping genes. Basic assumption: Expression of housekeeping genes does not change. Disadvantage: House keeping genes may be highly expressed sometimes. Unexpected regulation of house keeping gene(s) leads to misinterpretation

Global Normalization Basic assumption Regression of Mean/Median expression ratio of all monitored mRNAs is constant across a chip. Regression of In simple terms the log ratios are corrected by a common “mean” or “median” This method can also be applied to single Dye data

Linear Normalization (for spotted arrays) Basic assumption Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity Regression of

Non-Linear Normalization (for spotted arrays) Basic assumption Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity Regression of Where is estimated by the robust scatter plot smoother LOWESS (Locally WEighted Scatterplot Smoothing)

Analysis of Variance (ANOVA) Standard Analysis of Variance model Response variable - Gene expression Explanatory variables: Dye color Batch Other potential effects? Advantage: Statistically significant genes can be identified while controlling for the various experimental conditions/factors.

Some important experimental designs Pooled Samples versus Separate samples Sometimes there may not be sufficient biological sample/specimen from a given animal. In such cases biological samples are pooled from several identical animals to form a sample.

An example of a pooling design (for each treatment group) Subjects Pool Observations (Microarray chips)

The pooling design Subjects Pool Observations (Microarray chips) 9 3 6 9 3 6 (3 per pool) More generally: n p m (r=n/p per pool)

The standard design Subjects # Pool Observations (Microarray chips) 9 9 9 (r=1) More generally: n p=n m=n

Some issues What are the underlying parameters? Effect of pooling on power. The basic assumption. Validity of the assumption.

Parameters Total variation in the expression of a gene can be decomposed in to: Biological variation Technical variation Biological samples (n) Number of pools (p) Biological samples per pool (r=n/p) Observed number of samples (e.g. microarrays) (m)

Some comments about pooling Variance of the estimated mean expression of a gene depends on: number of pools (p) number of bio samples per pool (r) number of arrays (m) biological variation Technical variation. Pooling works well when the biological variation in the gene expression is substantially larger than the technical variation.

Power comparisons # Bio #Micro Pool size Power 5/group 5/group 1 (Standard design) 0.81 6/group 6/group 1 (Standard design) 0.95 6/group 3/group 2 (i.e 3 pools/group) 0.30 8/group 4/group 2 (i.e. 4 pools/group) 0.80 10/group 5/group 2 (i.e. 5 pools/group) 0.98 Zhang and Gant (2005)

Power comparisons Conditions of the simulation study: Biological variation is 4 times the technical variation. False positive rate is 0.001. Detect 2-fold expression. Data are normally distributed.

A fundamental assumption Biological averaging: Suppose an experiment consists of pooling “r” samples. Then the expression of a gene in the pooled sample is assumed to be the average of the gene’s expression in the “r” samples. This assumption need not be true especially if the expression values are transformed non-linearly.

Some important experimental designs Reference designs (Spotted array) Each treatment sample is hybridized against a common reference control. Loop designs (Spotted array) Suppose we have a control and three experimental groups A, B and C. Then hybridize Control and A, A with B, B with C and C with A.

Data Analysis - Preliminaries Normalization Transformation of data (usual methods) Perhaps first fit ANOVA and plot the residuals Log transformation Square root More generally, Box-Cox family of transformations Identify potential outliers in the data (again, perhaps use the residuals)

Data Analysis Method of Analysis depends upon the scientific question of interest. In the next three lectures we describe several general methods and illustrate some using real data!