Microarray Data Analysis The Bioinformatics side of the bench.

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Experiment Design for Affymetrix Microarray.
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Microarray Data Analysis Day 2
Asking translational research questions using ontology enrichment analysis Nigam Shah
Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.
1. Principles and important terminology 2. RNA Preparation and quality controls 3. Data handling 4. Costs 5. Protocols 6. Information for collaboration.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Microarray Normalization
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae Article by Peter Uetz, et.al. Presented by Kerstin Obando.
Getting the numbers comparable
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
COG and GO tutorial.
GCB/CIS 535 Microarray Topics John Tobias November 3 rd, 2004.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
1 Gene Ontology and Semantic Similarity Measures.
Introduce to Microarray
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Microarray Preprocessing
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Microarray Data Analysis The Bioinformatics side of the bench.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Affymetrix vs. glass slide based arrays
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Evolva Biotech SA Microarray and Macro opportunities for Discovery informatics Head of Informatics Mobile.
歐亞書局 PRINCIPLES OF BIOCHEMISTRY Chapter 9 DNA-Based Information Technologies.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Agenda Introduction to microarrays
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
The Bioinformatics of Microarrays
Microarray - Leukemia vs. normal GeneChip System.
A Short Overview of Microarrays Tex Thompson Spring 2005.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Gene Expression and Evolution. Why are Evolutionists Interested in Gene Expression? Divergence in gene expression can underlie differences between taxa.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarray Data Analysis The Bioinformatics side of the bench.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
The State of Microarrays The Scientist: 2003 By: Hien Dang.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Functional Genomics in Evolutionary Research
Microarray Technology and Applications
Microarray Data Analysis
Presentation transcript:

Microarray Data Analysis The Bioinformatics side of the bench

The anatomy of your data files from Affymetrix array analysis.DAT= image file (10 7 pixels).CEL= measured cell intensities.CDF= cell descriptions files (identify probe sets and probe set pairs).CHP= calculated probe set data.RPT= report generated from.CHP

Quality Control (QC) of the chip – visual inspection Look at the.DAT file or the.CHP file image –Scratches? Spots? –Corners and outside border checkerboard appearance (B2 oligo) Positive hybridization control Used by software to place grid over image –Array name is written out in oligos!

Chip defects

Internal controls B. subtilis genes (added poly-A tails) –Assessment of quality of sample preparation –Also as hybridization controls Hybridization controls (bioB, bioC, bioD, cre) –E. coli and P1 bacteriophage biotin-labeled cRNAs –Spiked into the hybridization cocktail –Assess hybridization efficiency Actin and GAPDH assess RNA sample/assay quality –Compare signal values from 3’ end to signal values from 5’ end ratio generally should not exceed 3 Percent genes present (%P) –Replicate samples - similar %P values

1.Experimental Design 2.Image Analysis – scan to intensity measures (raw data) 3.Normalization – “clean” data 4.More “low level” analysis-fold change, ANOVA, data filtering 5.Data mining-how to interpret > 6000 measures –Databases –Software –Techniques-clustering, pattern recognition etc. –Comparing to prior studies, across platforms? 6.Validation Microarray Data Process/Outline

Experimental Design A good microarray design has 4 elements 1.A clearly defined biological question or hypothesis 2.Treatment, perturbation and observation of biological materials should minimize systematic bias 3.Simple and statistically sound arrangement that minimizes cost and gains maximal information 4.Compliance with MIAME (minimal information about microarray experiment) The goal of statistics is to find signals in a sea of noise The goal of exp. design is to reduce the noise so signals can be found with as small a sample size as possible

Observational Study vs. Designed Experiment Observational study- –Investigator is a passive observer who measures variables of interest, but does not attempt to influence the responses Designed Experiment- –Investigator intervenes in natural course of events What type is our DMSO exp?

Experimental Replicates Why? –In any exp. system there is a certain amount of noise—so even 2 identical processes yield slightly different results –Sources? –In order to understand how much variation there is it is necessary to repeat an exp a # of independent times –Replicates allow us to use statistical tests to ascertain if the differences we see are real

Technical vs. Biological Replicates As we progress from the starting material to the scanned image we are moving from a system dominated by biological effects through one dominated by chemistry and physics noise Within Affy platform the dominant variation is usually of a biological nature thus best strategy is to produce replicates as high up the experimental tree as possible

Low level data analysis / pre-processing Varying biological or cellular composition among sample types. Differences in sample preparation, labeling or hybridization Non specific cross- hybridization of target to probes. Lead to systemic differences between individual arrays Raw Data Quality Control Scaling Normalization and filtering.

Image Analysis - Raw Data

From probe level signals to gene abundance estimates The job of the expression summary algorithm is to take a set of Perfect Match (PM) and Mis- Match (MM) probes, and use these to generate a single value representing the estimated amount of transcript in solution, as measured by that probeset. To do this,.DAT files containing array images are first processed to produce a.CEL file, which contains measured intensities for each probe on the array.array images It is the.CEL files that are analyzed by the expression calling algorithm.

MAS 5.0 output files For each transcript (gene) on the chip: –signal intensity –a “present” or “absent” call (presence call) –p-value (significance value) for making that call Each gene associated with GenBank accession number (NCBI database)

How are transcripts determined to be present or absent? Probe pair (PM vs. MM) intensities –generate a detection p-value assign “Present”, “Absent”, or “Marginal” call for transcript Every probe pair in a probe SET has a potential “vote” for presence call

PM and MM Probes The purpose of each MM probe is to provide a direct measure of background and stray-signal (perhaps due to cross-hybridization) for its perfect-match partner. In most situations the signal from each probe-pair is simply the difference PM - MM. For some probe-pairs, however, the MM signal is greater than the PM value; we have an apparently impossible measure of background.

Thank goodness for software!!! MAS 5.0 does these calculations for you –.CHP file Basic analysis in MAS 5.0, but it won’t handle replicates Import MAS 5.0 (.CHP) data into other software, Genesifter, GCOS, SpotFire, and many others

Signal Intensity Following these calculations, the MAS 5.0 algorithm now has a measure of the signal for each probe in a probeset. Other algortihms, ex RMA, GCRMA, dCHIP, PLIER and others have been developed by academic teams to improve the precision and accuracy of this calculation In our Exp we will use RMA and GCRMA

How do we want to analyze this data? Pairwise analysis is most appropriate –Control vs. DMSO List of genes that are “upregulated” or “downregulated” Determine fold up or down cutoffs –What is significant? 1.5 fold up/down? 2 fold up/down? 10 fold up/down?

Normalization - “clean” data “Normalizing” data allows comparisons ACROSS different chips –Intensity of fluorescent markers might be different from one batch to the other –Normalization allows us to compare those chips without altering the interpretation of changes in GENE EXPRESSION

Why Normalize Data? The experimental goal is to identify biological variation (expression changes between samples) Technical variation can hide the real data Unavoidable systematic bias should be recognized and corrected Normalization is necessary to effectively make comparisons between chips-and sometimes within a single chip. There are different methods of normalization the assumptions of where variation exist will determine the normalization techniques used. Always look at data before and after normalization Spike in controls can help show which method may be best

Caveat… There is NO standard way to analyze microarray data Still figuring out how to get the “best” answers from microarray experiments Best to combine knowledge of biology, statistics, and computers to get answers

MAS 5.0GCRMA RMA GCRMAMAS 5.0 Venn Diagrams

Data processing is completed now what? Fold change, ANOVA, Data filtering

Where are we now? Ran analysis, output is a GENE LIST –List indicates what genes are up or down regulated –p values for t-test –Graphs of signal levels Absolute numbers not as important here as the trends you see –Now what????

What is the first set of genes on our chips that will be “filtered” out?

Follow the links Click on a gene Find links to other databases Follow links to discover what the protein does Now the fun part begins….

Back to Biology Do the changes you see in gene expression make sense BIOLOGICALLY? If they don’t make sense, can you hypothesize as to why those genes might be changing? Leads to many, many more experiments

A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!

Gene Ontology Objectives GO represents concepts used to classify specific parts of our biological knowledge: –Biological Process –Molecular Function –Cellular Component GO develops a common language applicable to any organism GO terms can be used to annotate gene products from any species, allowing comparison of information across species

Sriniga Srinivasan, Chief Ontologist, Yahoo! The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996

Molecular Function = elemental activity/task –the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective –broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex –subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme The 3 Gene Ontologies

Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment Example: Gene Product = hammer

Biological Examples Molecular Function Biological ProcessCellular Component

Validation Not enough to just do microarrays Usually “validate” microarray results via some other technique –rt-PCR –TaqMan –Northern analysis –Protein level analysis No technique is perfect…

Yeast Genome and Data Mining

Dynamic Nature of Yeast Genome eORF= essential kORF= known hORF= homology identified shORF= short tORF= transposon identified qORF= questionable dORF= disabled First published sequence claimed 6274 genes– a # that has been revised many times, why?

The Affy detection oligonucleotide sequences are frozen at the time of synthesis, how does this impact downstream data analysis?

term: MAPKKK cascade (mating sensu Saccharomyces) goid: GO: definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces definition_reference: PMID: Terms, Definitions, IDs

SGD

SGD public microarray data sets available for public query

Homework 1.Go to and find 3 candidate genes of known f(x) and one of undefined f(x) that you might predict to be altered by DMSO treatmenthttp:// 2.What GO biological processes and molecular mechanisms are associated with your candidate genes? 3.Where, subcellularly does the protein reside in the cell? 4.What other proteins are known or inferred to interact with yours? How was this interaction determined? Is this a genetic or physical interaction? 5.Find the expression of at least one of your known genes in another public ally deposited microarray data set? 1.Name of data set and how you found it? 2.What is the largest Fold change observed for this gene in the public study? 6.Now that you are microarray technology experts can you give me 3 reasons why the observed transcript level difference may not be confirmed through a second technology like RTQPCR?

Suggested Reading