4. Functional Genomics and Microarray Analysis (1)

Slides:



Advertisements
Similar presentations
Introductory Mathematics & Statistics for Business
Advertisements

Tests of Hypotheses Based on a Single Sample
HYPOTHESIS TESTING. Purpose The purpose of hypothesis testing is to help the researcher or administrator in reaching a decision concerning a population.
“Students” t-test.
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
T-tests continued.
1-Way Analysis of Variance
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarrays Dr Peter Smooker,
Microarray analysis Golan Yona ( original version by David Lin )
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Experimental Evaluation
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Microarray Data Analysis
Analysis of microarray data
with an emphasis on DNA microarrays
Hypothesis Testing:.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Affymetrix vs. glass slide based arrays
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
DNA microarrays Each spot contains a picomole of a DNA ( moles) sequence.
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
CDNA Microarrays MB206.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Microarray Technology
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Statistical Testing with Genes Saurabh Sinha CS 466.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
PCB 3043L - General Ecology Data Analysis.
The t-distribution William Gosset lived from 1876 to 1937 Gosset invented the t -test to handle small samples for quality control in brewing. He wrote.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 26 Chapter 11 Section 1 Inference about Two Means: Dependent Samples.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Definition Slides Unit 1.2 Research Methods Terms.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Two-Sample Hypothesis Testing
Microarray - Leukemia vs. normal GeneChip System.
PCB 3043L - General Ecology Data Analysis.
Chapter 2 Simple Comparative Experiments
Central Limit Theorem, z-tests, & t-tests
Chapter 9 Hypothesis Testing.
What are their purposes? What kinds?
Descriptive Statistics
Presentation transcript:

4. Functional Genomics and Microarray Analysis (1)

Background Functional Genomics Systematic analysis of gene activity in healthy and diseased tissues. Obtaining an overall picture of genome functions, including the expression profiles at the mRNA level and the protein level. Functional Genome Analysis: used to understand the functions of genes and proteins in an organism. This is typically known as genome annotation. used in integrative biology and systems biology studies aiming to understand health and disease states (e.g. cancer, obesity, …etc) Used as an important step in the search for new target molecules in the drug discovery process. (which genes, proteins to target and how)

What is…? Gene Expression: Microarrays: determines DNA sequence (split into genes) Amino Acid Sequence Protein 3D Structure Function Cell Activity codes for folds into dictates has What is…? Gene Expression: The process by which the information encoded in a gene is converted into an observable phenotype (most commonly production of a protein). The degree to which a gene is active in a certain tissue of the body, measured by the amount of mRNA in the tissue. Microarrays: Tools used to measure the presence and abundance of gene expression (measure as mRNA) in tissue. microarray technologies provide a powerful tool by which the expression patterns of thousands of genes can be monitored simultaneously and measured quantitatively

Applications of Microarray Technology Applications covered only as example contexts, emphasis is on analysis methods Identify Genes expressed in different cell types (e.g. Liver vs Kidney) Learn how expression levels change in different developmental stages (embryo vs. adult) Learn how expression levels change in disease development (cancerous vs non-cancerous) Learn how groups of genes inter-relate (gene-gene interactions) Identify cellular processes that genes participate in (structure, repair, metabolism, replication, … etc)

Microarrays Basic Idea Affymetrix Inc. is the leading provider of Microarray technology (GeneChip® ) http://www.affymetrix.com/ Microarrays Basic Idea A Microarray is a device that detects the presence and abundance of labelled nucleic acids in a biological sample. In the majority of experiments, the labelled nucleic acids are derived from the mRNA of a sample or tissue. The Microarray consists of a solid surface onto which known DNA molecules have been chemically bonded at special locations. Each array location is typically known as a probe and contains many replicates of the same molecule. The molecules in each array location are carefully chosen so as to hybridise only with mRNA molecules corresponding to a single gene.

Several companies sell equipment to make DNA chips, including spotters to deposit the DNA on the surface and scanners to detect the fluorescent or radioactive signals. Basic Idea A Microarray works by exploiting the ability of a given mRNA molecule to bind specifically to, or hybridize to, the DNA template from which it originated. By using an array containing many DNA samples, scientists can determine, in a single experiment, the expression levels of hundreds or thousands of genes within a cell by measuring the amount of mRNA bound to each site on the array. With the aid of a computer, the amount of mRNA bound to the spots on the Microarray is precisely measured, generating a profile of gene expression in the cell.

Microarray Process The molecules in the target biological sample are labelled using a fluorescent dye before sample is applied to array If a gene is expressed in the sample, the corresponding mRNA hybridises with the molecules on a given probe (array location). If a gene is not expressed, no hybridisation occurs on the corresponding probe. Reading the array output After the sample is applied, a laser light source is applied to the array. The fluorescent label enables the detection of which probes have hybridised (presence) via the light emitted from the probe. If gene is highly expressed, more mRNA exists and thus more mRNA hybridises to the probe molecules (abundance) via the intensity of the light emitted.

The array Chemistry Basics: Surface Chemistry is used to attach the probe molecules to the glass substrate. Chemical reactions are used to attach the florescent dyes to the target molecules Probe and Target hybridise to form a double helix

Affymetrix GeneChip Example of Single Label Chips Hundreds of thousands of oligonucleotide probes packed at extremely high densities. The probes designed to maximize sensitivity, specificity, and reproducibility, allowing consistent discrimination between specific and background signals, and between closely related target sequences. RNA labeled and scanned in a single “color” one sample per chip

From Microarray images to Gene Expression Matrices Samples Genes Gene expression levels Final data Gene Expression Matrix Raw data Array scans Images Spots Spot/Image quantiations Intermediate data

Steps of a Microarray Experiment Biological question Sample Attributes Experimental design Platform Choice Microarray experiment 16-bit TIFF Files Image analysis Quantify the Dots Normalization Clustering Statistical Analysis Data Mining Pattern Discovery Classification Biological verification and interpretation

Qualitative Interpretation of Reads GREEN represents High Control hybridization   RED represents High Sample hybridization   YELLOW represents a combination of Control and Sample where both hybridized equally.   BLACK represents areas where neither the Control nor Sample hybridized. Main issue is to quantify the results: How green is green? What is the ratio of the signal to background noise? How to compare multiple experiments using different chips? How to quantify cross hybridization (if any)?

Normalization Normalisation is a general term for a collection of methods that are directed at reasoning about and resolving the systematic errors and bias introduced by microarray experimental platforms Normalisation methods stand in contrast with the data analysis methods described in other lectures (e.g. differential gene expression analysis, classification and clustering). Our overall aim is to be able to quantify measured/calculated variability, differentials and similarity: Are they biologically significant or just side effects of the experimental platforms and conditions?

Why Normalization Sources of Microarray Data Variability The measured gene expression in any experiment includes true gene expression,together with contributions from many sources of variability Why Normalization Sources of Microarray Data Variability There are several levels of variability in measured gene expression of a feature. At the highest level, there is biological variability in the population from which the sample derives. At an experimental level, there is variability between preparations and labelling of the sample, variability between hybridisations of the same sample to different arrays, and variability between the signal on replicate features on the same array. Variability between Individuals True gene expression of individual Variability between sample preparations Variability between arrays and hybridisations Variability between replicate features Measured gene expression

Normalisation Examples Probe Intensity Value Typical Problem: Usually more variability at low intensity Normalisation Examples Probe Intensity Value The raw intensities of signal from each spot on the array are not directly comparable. Depending on the types of experiments done, a number of different approaches to normalization may be needed. Not all types of normalization are appropriate in all experiments. Some experiments may use more than one type of normalization. Reasonable Assumption: intensities of fluorescent molecules reflect the abundance of the mRNA molecules – generally true but could be problematic Example: intensity of gene A spot is 100 units in normal-tissue array intensity of gene A spot is 50 units in cancer-tissue array Conclusion: gene A’s expression level in normal issue is significantly higher than in cancer tissue

Normalisation Examples Probe Intensity Value Images showing examples of how background intensity can be calculated Problem? What if the overall background intensity of the normal-tissue array is 95 units while the background intensity of cancer-tissue array is 10 units? Solutions: Subtract background intensity value Take ratio of spot intensity to background intensity (preferable) In both cases have to decide where to measure background intensity (e.g. local to spot or globally per chip) In general, There could be many factors contributing to the background intensity of a microarray chip To compare microarray data across different chips, data (intensity levels) need to be normalized to the “same” level

Differential Gene Expression Analysis Consider a microarray experiment that measures gene expression in two groups of rat tissue (>5000 genes in each experiment). The rat tissues come from two groups: WT: Wild-Type rat tissue, KO: Knock Out Treatment rat tissue Gene expression for each group measured under similar conditions Question: Which genes are affected by the treatment? How significant is the effect? How big is the effect?

Calculating Expression Ratios In Differential Gene Expression Analysis, we are interested in identifying genes with different expression across two states, e.g.: Tumour cell lines vs. Normal cell lines Treated tissue vs. diseased tissue Different tissues, same organism Same tissue, different organisms Same tissue, same organism Time course experiments We can quantify the difference (effect) by taking a ratio i.e. for gene k, this is the ratio between expression in state a compared to expression in state b This provides a relative value of change (e.g. expression has doubled) If expression level has not changed ratio is 1

Fold change (Fold ratio) A gene is up-regulated in state 2 compared to state 1 if it has a higher value in state 2 A gene is down-regulated in state 2 compared to state 1 if it has a lower value in state 2 Fold change (Fold ratio) Ratios are troublesome since Up-regulated & Down-regulated genes treated differently Genes up-regulated by a factor of 2 have a ratio of 2 Genes down-regulated by same factor (2) have a ratio of 0.5 As a result down regulated genes are compressed between 1 and 0 up-regulated genes expand between 1 and infinity Using a logarithmic transform to the base 2 rectifies problem, this is typically known as the fold change

Examples of fold change A, B and D are down regulated C is up-regulated E has no change Examples of fold change Gene ID Expression in state 1 Expression in state 2 Ratio Fold Change A 100 50 2 1 B 10 5 C 0.5 -1 D 200 7.65 E You can calculate Fold change between pairs of expression values: e.g. Between State 1 vs State 2 for gene A Or Between mean values of all measurements for a gene in the WT/KO experiments mean(WT1..WT4) vs mean (KO1..KO4)

Statistics Significance of Fold Change For our problem we can calculate an average fold ratio for each gene (each row) This will give us an average effect value for each gene 2, 1.7, 10, 100, etc Question which of these values are significant? Can use a threshold, but what threshold value should we set? Use statistical techniques based on number of members in each group, type of measurements, etc -> significance testing.

Statistics Unpaired statistical experiments Condition Group 1 members Group 2 members Statistics Unpaired statistical experiments Overall setting: 2 groups of 4 individuals each Group1: Imperial students Group2: UCL students Experiment 1: We measure the height of all students We want to establish if members of one group are consistently (or on average) taller than members of the other, and if the measured difference is significant Experiment 2: We measure the weight of all students We want to establish if members of one group are consistently (or on average) heavier than the other, and if the measured difference is significant Experiment 3: ………

Statistics Unpaired statistical experiments Condition Group 1 members Group 2 members Statistics Unpaired statistical experiments In unpaired experiments, you typically have two groups of people that are not related to one another, and measure some property for each member of each group e.g. you want to test whether a new drug is effective or not, you divide similar patients in two groups: One groups takes the drug Another groups takes a placebo You measure (quantify) effect of both groups some time later You want to establish whether there is a significant difference between both groups at that later point The WT/KO example is an unpaired experiment if the rats in the experiments are different !

Statistics Unpaired statistical experiments The WT/KO example is an unpaired experiment if the rats in the experiments are different! Experiment for WT Rats for Gene 96608_at Rat # WT gene expression WT1 100 WT2 WT3 200 WT4 300 Experiment for KO Rats for Gene 96608_at Rat # KO gene expression KO1 150 KO2 300 KO3 100 KO4

Statistics Unpaired statistical experiments How do we address the problem? Compare two sets of results (alternatively calculate mean for each group and compare means) Graphically: Scatter Plots Box plots, etc Compare Statistically Use unpaired t-test Are these two series significantly different? Are these two series significantly different?

Statistics Paired statistical experiments Group members Condition 1 Condition 2 Statistics Paired statistical experiments In paired experiments, you typically have one group of people, you typically measure some property for each member before and after a particular event (so measurement come in pairs of before and after) e.g. you want to test the effectiveness of a new cream for tanning You measure the tan in each individual before the cream is applied You measure the tan in each individual after the cream is applied You want to establish whether the there is a significant difference between measurements before and after applying the cream for the group as a whole

Statistics Paired statistical experiments The WT/KO example is a paired experiment if the rats in the experiments are the same! Experiments for Gene 96608_at Rat # WT gene expression KO gene expression Rat1 100 200 Rat2 300 Rat3 400 Rat4 500

Statistics Paired statistical experiments How do we address the problem? Calculate difference for each pair Compare differences to zero Alternatively (compare average difference to zero) Graphically: Scatter Plot of difference Box plots, etc Statistically Use unpaired t-test Are differences close to Zero?

Statistics Significance testing In both cases (paired and unpaired) you want to establish whether the difference is significant Significance testing is a statistical term and refers to estimating (numerically) the probability of a measurement occurring by chance. To do this, you need to review some basic statistics Normal distributions: mean, standard deviations, etc Hypothesis Testing t-distributions t-tests and p-values

Mean and standard deviation 68% of dist. 1 s.d. x Mean and standard deviation Mean and standard deviation tell you the basic features of a distribution mean = average value of all members of the group u = (x1+x2+x3 ….+xN)/N standard deviation = a measure of how much the values of individual members vary in relation to the mean The normal distribution is symmetrical about the mean 68% of the normal distribution lies within 1 s.d. of the mean

Note on s.d. calculation Through the following slides and in the tutorials, I use the following formula for calculating standard deviation Some people use the unbiased form below (for good reasons) Please use the simple form if you want the answers to add up at the end

The Normal Distribution Many continuous variables follow a normal distribution, and it plays a special role in the statistical tests we are interested in; 68% of dist. 1 s.d. x The x-axis represents the values of a particular variable The y-axis represents the proportion of members of the population that have each value of the variable The area under the curve represents probability – i.e. area under the curve between two values on the x-axis represents the probability of an individual having a value in that range

Hypothesis Testing: Are two data sets different We use z-test (normal distribution) if the standard deviations of two populations from which the data sets came are known (and are the same) We pose a null hypothesis that the means are equal We try to refute the hypothesis using the curves to calculate the probability that the null hypothesis is true (both means are equal) if probability is low (low p) reject the null hypothesis and accept the alternative hypothesis (both means are different) If probability is high (high p) accept null hypothesis (both means are equal) Ho Population 1 Population 2 Ha Population 1 Population 2 If standard deviation known use z test, else use t-test

Comparing Two Samples Graphical interpretation To compare two groups you can compare the mean of one group graphically. The graphical comparison allows you to visually see the distribution of the two groups. If the p-value is low, chances are there will be little overlap between the two distributions. If the p-value is not low, there will be a fair amount of overlap between the two groups. We can set a critical value for the x-axis based on the threshold of p-value

t-test terminology t-test: Used to compare the mean of a sample to a known number Assumptions: Subjects are randomly drawn from a population and the distribution of the mean being tested is normal. Test: The hypotheses for a single sample t-test are: Ho: u = u0 Ha: u < > u0 p-value: probability of error in rejecting the hypothesis of no difference between the two groups. (where u0 denotes the hypothesized value to which you are comparing a population mean)

t-Tests Intuitively

t-test terminology Unpaired vs. paired t-test Same as before !! Depends on your experiment Unpaired t-Test: The hypotheses for the comparison of two independent groups are: Ho: u1 = u2 (means of the two groups are equal) Ha: u1 <> u2 (means of the two group are not equal) Paired t-test: The hypothesis of paired measurements in same individuals Ho: D = 0 (the difference between the two observations is 0) Ha: D <> 0 (the difference is not 0)

Calculating t-test (t statistic) Remember these formulae !! Calculating t-test (t statistic) First calculate t statistic value and then calculate p value For the paired t-test, t is calculated using the following formula: And n is the number of pairs being tested. For an unpaired (independent group) t-test, the following formula is used: Where σ (x) is the standard deviation of x and n (x) is the number of elements in x. Where d is calculated by

Calculating p-value for t-test When carrying out a test, a P-value can be calculated based on the t-value and the ‘Degrees of freedom’. There are three methods for calculating P: One Tailed >: One Tailed <: Two Tailed: Where p(t,v) is looked up from the t-distribution table The number of degrees (v) of freedom is calculated as: UnPaired: n (x) +n (y) -2 Paired: n- 1 (where n is the number of pairs.)

p-values Results of the t-test: If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favour of the alternative. In other words, there is evidence that the mean is significantly different than the hypothesized value. If the p-value associated with the t-test is not small (p > 0.05), there is not enough evidence to reject the null hypothesis, and you conclude that there is evidence that the mean is not different from the hypothesized value.

t-value and p-value Given a t-value, and degrees of freedom, you can look-up a p-value Alternatively, if you know what p-value you need (e.g. 0.05) and degrees of freedom you can set the threshold for critical t

Finding a critical t A = .05 -tc A = .05 tc The table provides the t values (tc) for which P(tx > tc) = A Finding a critical t A = .05 -tc A = .05 tc =-1.812 =1.812 t.100 t.05 t.025 t.01 t.005

Meaning of t-value High t-value Take Gene A, assuming paired test: For Either type of test Average Difference is = 100, SD. = 0 t value is near infinity, p is extremely low

Consider Gene M for a paired experiment Where d is calculated by Consider Gene M for a paired experiment Average Difference is = 0 t value is zero, what does this mean?

Consider Gene T for a paired experiment Where d is calculated by Consider Gene T for a paired experiment

Hypothesis Testing Uses hypothesis testing methodology. For each Gene (>5,000) Pose Null Hypothesis (Ho) that gene is not affected Pose Alternative Hypothesis (Ha) that gene is affected Use statistical techniques to calculate the probability of rejecting the hypothesis (p-value) If p-value < some critical value reject Ho and Accept Ha The issues: Large number of genes (or experiments) Need quick way to filter out significant genes that have high fold change Need also to sort genes by fold change and significance

Volcano Plots A visual approach Volcano plots are a graphical means for visualising results of large numbers of t-tests allowing us to plot both the Effect and significance of each test in an easy to interpret way For each gene calculate the significance of the change (t-test, p-value) For each gene compare the value of the effect between population WT vs. KO (fold change) Identify Genes with high effect and high significance Volcano Plot

Volcano plots In a volcano plot: X-axis represents effect measured as fold change: y-axis represents the number of zeroes in the p-value Effect = log(WT) – log(KO) 2 = log(WT / KO) 2 If WT = WO, Effect Fold Change = 0 , If WT = 2 WO, Effect Fold Change = 1 ...

Numerical Interpretation (Significance) (2 decimal places) p< 0.1 (1 decimal place) Using log10 for Y axis: