B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.

Slides:



Advertisements
Similar presentations
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Advertisements

Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Microarrays Dr Peter Smooker,
Microarray Data Preprocessing and Clustering Analysis
Introduction to Bioinformatics Algorithms Clustering.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Clustering.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Cluster Analysis Class web site: Statistics for Microarrays.
CSE182-L17 Clustering Population Genetics: Basics.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Introduce to Microarray
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Comp602 Bioinformatics Algorithms -m werner 2011
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarrays: Basic Principle AGCCTAGCCT ACCGAACCGA GCGGAGCGGA CCGGACCGGA TCGGATCGGA Probe Targets Highly parallel molecular search and sort process based.
Analysis of microarray data
with an emphasis on DNA microarrays
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Gene expression & Clustering (Chapter 10)
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Data Type 1: Microarrays
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Agenda Introduction to microarrays
We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.
Microarray - Leukemia vs. normal GeneChip System.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Example of a DNA array used to study gene expression (note green, yellow red colors; also note.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Gene Expression Analysis
Microarray - Leukemia vs. normal GeneChip System.
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Clustering.
Presentation transcript:

B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical engineering Department of Biotechnology

Each probe corresponds to a particular short section of a gene. Microarray can monitor many genes at once, a DNA microarray is an inert, solid, flat and transparent surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA probes of specified sequences are orderly tethered. Each probe corresponds to a particular short section of a gene. So a single gene is covered by several probes which span different parts of the gene sequence.

R EPOSITORIES OF M ICROARRAY S TUDIES Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are: 1. The Gene Expression Omnibus (GEO) 2. National Center for Biotechnology Information (NCBI) 3. Stanford Microarray Data Base (SMD) And for PLANTS Pl ant Ex pression d ata b ase PLEXdb

DNA microarrays measure the RNA abundance with either 1 channel (one color) or 2 channels (two colors). fluorescent red dye Cy5 green fluorescent dye, Cy3 Affymetrix GeneChip has 1 channel and use either fluorescent red dye Cy5 or green fluorescent dye, Cy3 fluorescent red dye Cy5 green fluorescent dye, Cy3 Stanford microarrays measure by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5) compared to its control (labeled with a green fluorescent dye, Cy3) (Two channels)

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

V IDEO

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

M ICROARRAY E XPERIMENT 1. Isolate mRNA 2. Make labelled cDNA library 3. Apply your DNA on the slide 4. Scan the slide 5. Purify the picture 6. Extract the data 7. Analyse your data

R ESULTS The colors denote the degree of expression in the experimental versus the control cells. Gene not expressed in control or in experimental cells Only in control cells Mostly in control cells Only in experimental cells Mostly in experimental cells Same in both cells

Let us talk about the analysis and the mathematical problems: Now we have a lot of pictures which contain a huge information so: 1- we have to purify the picture 2- we have to extract our data.

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

I MAGE ANALYSIS The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

S TEPS IN IMAGE ANALYSIS 1. Addressing. Estimate location of spot centers. 2. Segmentation. Classify pixels as foreground (signal) or background. 3. Information extraction. For each spot on the array and each dye foreground intensities; background intensities; quality measures.

W HY DO WE CALCULATE THE BACKGROUND INTENSITIES ? chemical treatment autofluorescence Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.

Q UANTIFICATION OF EXPRESSION For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg – Gbg

C DNA GENE EXPRESSION DATA Genes mRNA samples Gene expression level of gene 5 in mRNA sample 4 = log 2 ( Red intensity / Green intensity) sample1sample2sample3sample4sample5 … Data on p genes for n samples down-regulated gene Up-regulated gene unchanged expression

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

N ORMALIZATION Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples for example: 1.Dyes activity 2.Dyes quantity 3.scanning parameters 4.location on the array 5.Air bubbles

S ELF - SELF HYBRIDIZATIONS How do we know it is necessary? By examining self-self hybridizations, we label one sample from the same tissue with two dyes Cy3, Cy5 so We find dye biases.

Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Biological verification and interpretation Biological verification and interpretation Microarray experiment Experimental design Image analysis Normalization R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

H OMOGENEITY AND S EPARATION P RINCIPLES Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

B AD C LUSTERING This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

G OOD C LUSTERING This clustering satisfies both Homogeneity and Separation principles

C LUSTERING T ECHNIQUES Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

H IERARCHICAL C LUSTERING

H IERARCHICAL C LUSTERING A LGORITHM 1. Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.

K-M EANS C LUSTERING P ROBLEM : F ORMULATION Input : A set, V, consisting of n points and a parameter k Output : A set X consisting of k points ( cluster centers ) that minimizes the squared error distortion d( V, X ) over all possible choices of X

1-M EANS C LUSTERING P ROBLEM : AN E ASY C ASE Input : A set, V, consisting of n points Output : A single points x ( cluster center ) that minimizes the squared error distortion d( V, x ) over all possible choices of x

1-M EANS C LUSTERING P ROBLEM : AN E ASY C ASE Input : A set, V, consisting of n points Output : A single points x (cluster center) that minimizes the squared error distortion d( V, x ) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm

x1x1 x2x2 x3 K-M EANS C LUSTERING : L LOYD A LGORITHM

1. Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C *This may lead to merely a locally optimal clustering.

T HANK YOU