Presentation is loading. Please wait.

Presentation is loading. Please wait.

B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.

Similar presentations


Presentation on theme: "B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical."— Presentation transcript:

1 B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical engineering Department of Biotechnology

2 Each probe corresponds to a particular short section of a gene. Microarray can monitor many genes at once, a DNA microarray is an inert, solid, flat and transparent surface (e.g.: a microscopic slide) onto which 20,000 to 60,000 short DNA probes of specified sequences are orderly tethered. Each probe corresponds to a particular short section of a gene. So a single gene is covered by several probes which span different parts of the gene sequence.

3 R EPOSITORIES OF M ICROARRAY S TUDIES Due to the large use of microarrays, data repositories have flourished world-wide. Three of the largest databases of gene expression are: 1. The Gene Expression Omnibus (GEO) 2. National Center for Biotechnology Information (NCBI) 3. Stanford Microarray Data Base (SMD) And for PLANTS Pl ant Ex pression d ata b ase PLEXdb

4 DNA microarrays measure the RNA abundance with either 1 channel (one color) or 2 channels (two colors). fluorescent red dye Cy5 green fluorescent dye, Cy3 Affymetrix GeneChip has 1 channel and use either fluorescent red dye Cy5 or green fluorescent dye, Cy3 fluorescent red dye Cy5 green fluorescent dye, Cy3 Stanford microarrays measure by competitive hybridization the relative expression under a given condition (fluorescent red dye Cy5) compared to its control (labeled with a green fluorescent dye, Cy3) (Two channels)

5 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

6 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

7 V IDEO

8 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

9 M ICROARRAY E XPERIMENT 1. Isolate mRNA 2. Make labelled cDNA library 3. Apply your DNA on the slide 4. Scan the slide 5. Purify the picture 6. Extract the data 7. Analyse your data

10 R ESULTS The colors denote the degree of expression in the experimental versus the control cells. Gene not expressed in control or in experimental cells Only in control cells Mostly in control cells Only in experimental cells Mostly in experimental cells Same in both cells

11

12 Let us talk about the analysis and the mathematical problems: Now we have a lot of pictures which contain a huge information so: 1- we have to purify the picture 2- we have to extract our data.

13 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

14 I MAGE ANALYSIS The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

15 S TEPS IN IMAGE ANALYSIS 1. Addressing. Estimate location of spot centers. 2. Segmentation. Classify pixels as foreground (signal) or background. 3. Information extraction. For each spot on the array and each dye foreground intensities; background intensities; quality measures.

16 W HY DO WE CALCULATE THE BACKGROUND INTENSITIES ? chemical treatment autofluorescence Motivation behind background adjustment: A spot’s measured fluorescence intensity includes a contribution that is not specifically due to the hybridization of the target to the probe, but to something else, e.g. the chemical treatment of the slide, autofluorescence etc. Want to estimate and remove this unwanted contribution.

17 Q UANTIFICATION OF EXPRESSION For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg – Gbg

18 C DNA GENE EXPRESSION DATA Genes mRNA samples Gene expression level of gene 5 in mRNA sample 4 = log 2 ( Red intensity / Green intensity) sample1sample2sample3sample4sample5 … 1 0.46 0.30 0.80 1.51 0.00... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09... Data on p genes for n samples down-regulated gene Up-regulated gene unchanged expression

19 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

20 N ORMALIZATION Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples for example: 1.Dyes activity 2.Dyes quantity 3.scanning parameters 4.location on the array 5.Air bubbles

21 S ELF - SELF HYBRIDIZATIONS How do we know it is necessary? By examining self-self hybridizations, we label one sample from the same tissue with two dyes Cy3, Cy5 so We find dye biases.

22 Biological question Differentially expressed genes Sample class prediction etc. Biological question Differentially expressed genes Sample class prediction etc. Biological verification and interpretation Biological verification and interpretation Microarray experiment Experimental design Image analysis Normalization R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

23 H OMOGENEITY AND S EPARATION P RINCIPLES Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

24 B AD C LUSTERING This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

25 G OOD C LUSTERING This clustering satisfies both Homogeneity and Separation principles

26 C LUSTERING T ECHNIQUES Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

27 H IERARCHICAL C LUSTERING 123 4 6 57 8 9 79845 12 36

28 H IERARCHICAL C LUSTERING A LGORITHM 1. Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C 2 10. Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.

29 K-M EANS C LUSTERING P ROBLEM : F ORMULATION Input : A set, V, consisting of n points and a parameter k Output : A set X consisting of k points ( cluster centers ) that minimizes the squared error distortion d( V, X ) over all possible choices of X

30 1-M EANS C LUSTERING P ROBLEM : AN E ASY C ASE Input : A set, V, consisting of n points Output : A single points x ( cluster center ) that minimizes the squared error distortion d( V, x ) over all possible choices of x

31 1-M EANS C LUSTERING P ROBLEM : AN E ASY C ASE Input : A set, V, consisting of n points Output : A single points x (cluster center) that minimizes the squared error distortion d( V, x ) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm

32 x1x1 x2x2 x3 K-M EANS C LUSTERING : L LOYD A LGORITHM

33 1. Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C *This may lead to merely a locally optimal clustering.

34 T HANK YOU


Download ppt "B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical."

Similar presentations


Ads by Google