Gene expression and data analysis. RNA Detection by Northern Blotting.

Gene expression and data analysis

RNA Detection by Northern Blotting

DNA Microarray DNA microarray is a new technology to measure the level of the RNA gene products of a living cell. DNA microarray is a new technology to measure the level of the RNA gene products of a living cell. A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array. A microarray chip is a rectangular chip on which is imposed a grid of DNA spots. These spots form a two dimensional array. Each spot in the array contains millions of copies of some DNA strand, bonded to the chip. Each spot in the array contains millions of copies of some DNA strand, bonded to the chip. Chips are made tiny so that a small amount of RNA is needed from experimental cells. Chips are made tiny so that a small amount of RNA is needed from experimental cells.

DNA Microarray Many applications in both basic and clinical research Many applications in both basic and clinical research determining the role a gene plays in a pathway, disease, diagnostics and pharmacology, … determining the role a gene plays in a pathway, disease, diagnostics and pharmacology, … There are three main platforms for performing microarray analyses. There are three main platforms for performing microarray analyses. cDNA arrays (generic, multiple manufacturers) cDNA arrays (generic, multiple manufacturers) Oligonucleotide arrays (genechips) (Affymetrix) Oligonucleotide arrays (genechips) (Affymetrix) BeatArray (BeadChip) (Illumina) BeatArray (BeadChip) (Illumina) cDNA membranes (radioactive detection) cDNA membranes (radioactive detection)

cDNA Microarray Spot cloned cDNAs onto a glass/nylon microscope slide Spot cloned cDNAs onto a glass/nylon microscope slide usually PCR amplified segments of plasmids usually PCR amplified segments of plasmids Complementary hybridization Complementary hybridization -- CTAGCAGG actual gene -- GATCGTCC cDNA ( Reverse transcriptase) -- CUAGCAGG mRNA Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental Label 2 mRNA samples with 2 different colors of fluorescent dye -- control vs. experimental Mix two labeled mRNAs and hybridize to the chip Mix two labeled mRNAs and hybridize to the chip Make two scans - one for each color Make two scans - one for each color Combine the images to calculate ratios of amounts of each mRNA that bind to each spot Combine the images to calculate ratios of amounts of each mRNA that bind to each spot

CTRL TEST Spotted Microarray Process

cDNA Array Experiment Movie http://www.bio.davidson.edu/courses/genomic s/chip/chip.html http://www.bio.davidson.edu/courses/genomic s/chip/chip.html http://www.bio.davidson.edu/courses/genomic s/chip/chip.html http://www.bio.davidson.edu/courses/genomic s/chip/chip.html

Affymetrix Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) cRNA labeled and scanned in a single “color” cRNA labeled and scanned in a single “color” one sample per chip one sample per chip Can have as many as 760,000 probes on a chip Can have as many as 760,000 probes on a chip Arrays get smaller every year (more genes) Arrays get smaller every year (more genes) Chips are expensive Chips are expensive Proprietary system: “black box” software, can only use their chips Proprietary system: “black box” software, can only use their chips

Affymetrix GeneChip ® Probe Arrays 24~50µm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Single stranded, fluorescently labeled cRNA target Oligonucleotide probe * * * * * 1.28cm GeneChip Probe Array Hybridized Probe Cell *

Affymetrix Genome Arrays

cDNA probes B B B B B B B B B B B B B B B B B BB B B cRNA labeled targets B B B B B B B B B B B BB B B Non- Specific Binding Specific Binding Post hybridiz -ation washes S FL S S

B B B S S S B B B S S S Streptavidin

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Infer gene interactions in pathways and networks Infer gene interactions in pathways and networks Gene regulatory regions predictions based co-regulated genes Gene regulatory regions predictions based co-regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Microarrays: An Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 probes 72 examples (38 train, 34 test), about 7,000 probes well-studied (CAMDA-2000), good test example well-studied (CAMDA-2000), good test example ALLAML Visually similar, but genetically very different

Normalization Need to scale the red sample so that the overall intensities for each chip are equivalent control Sample 1 Sample 2 What can we tell from the two plots ?

Normalization To insure the data are comparable, normalization attempts to correct the following variables: To insure the data are comparable, normalization attempts to correct the following variables: Number of cells in the sample Number of cells in the sample Total RNA isolation efficiency Total RNA isolation efficiency Signal measurement sensitivity Signal measurement sensitivity … Can use simple/complicated math Can use simple/complicated math Normalization by global scaling (bring each image to the same average brightness) Normalization by global scaling (bring each image to the same average brightness) Normalization by sectors Normalization by sectors Normalization to housekeeping genes Normalization to housekeeping genes … Active research area Active research area

AML vs ALL

Feature selection ProbeAML1AML2AML3ALL1ALL2ALL3 D21869_s_at170.755.043.75.5807.91283.5 D25233cds_at60531.0629.2441.795.3205.6 D25543_at2148.72303.01915.549.296.389.8 L03294_g_at241.8721.577.266.1107.3132.5 J03960_at774.53439.8614.355614.412.9 M81855_at10871283.71372.114694611.73211.8 L14936_at212.62848.5236.2260.52650.92192.2 L19998_at3673.2661.7629.4151193.9 L19998_g_at65.256.929.6434.0719.4565.2 AB017912_at1813.79520.62404.33853.16039.44245.7 AB017912_g_at385.42396.8363.7419.36191.95617.6 U86635_g_at83.3470.952.33272.53379.65174.6 …………………

ProbeAML1AML2AML3ALL1ALL2ALL3p-value D21869_s_at170.755.043.75.5807.91283.50.243 D25233cds_at60531.0629.2441.795.3205.60.487 D25543_at2148.72303.01915.549.296.389.80.0026 L03294_g_at241.8721.577.266.1107.3132.50.332 J03960_at774.53439.8614.355614.412.90.260 M81855_at10871283.71372.114694611.73211.80.178 L14936_at212.62848.5236.2260.52650.92192.20.626 L19998_at3673.2661.7629.4151193.90.941 L19998_g_at65.256.929.6434.0719.4565.20.022 AB017912_at1813.79520.62404.33853.16039.44245.70.963 AB017912_g_at385.42396.8363.7419.36191.95617.60.236 U86635_g_at83.3470.952.33272.53379.65174.60.022 ……………………   

Hypothesis Testing

Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis. Null hypothesis is a hypothesis set up to be nullified in order to support an alternative hypothesis. Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Example: Example: Test whether the time to respond to a tone is affected by the consumption of alcohol Test whether the time to respond to a tone is affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0 Hypothesis : µ1 - µ2 = 0 µ1 is the mean time to respond after consuming alcohol µ1 is the mean time to respond after consuming alcohol µ2 is the mean time to respond otherwise µ2 is the mean time to respond otherwise ?

Z-test Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2. Theorem: If x i has a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2.  x i /n ~ N( ,  2 /n).  x i /n ~ N( ,  2 /n). Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use Note: z follows a normal distribution N(0, 1) Note: z follows a normal distribution N(0, 1)normal distribution N(0, 1)normal distribution N(0, 1) Reject the null hypothesis.

Z-test Theorem: If x i follows a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2. Theorem: If x i follows a normal distribution with mean  and standard deviation  2, i=1,…,n, then U=  a i x i has a normal distribution with a mean E(U)=   a i and standard deviation D(U)=  2  a i 2.  x i /n ~ N( ,  2 /n).  x i /n ~ N( ,  2 /n). Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) Z test : H: µ = µ 0 (µ 0 and  0 are known, assume  =  0 ) But, in practice  0 is often unknown.

T-test S x-y standard error of the difference Assuming  1 and  2 are different

William Sealey Gosset (1876-1937) William Sealey Gosset (1876-1937) (Guinness Brewing Company) T-test

P-value Does a particular gene have the same expression level in ALL and AML? ProbeAML1AML2AML3ALL1ALL2ALL3p-value D25543_at2148.72303.01915.549.296.389.80.0026 L03294_g_at241.8721.577.266.1107.3132.50.332 …………………… ALLAML

Data processing Feature selection Feature selection T-test T-test Based on the fold change Based on the fold change

Feature 2 Feature 1 L L L L L L L M M M M M M Nearest Neighbor Classification Nearest Neighbor Classification = AML = ALL = test sample M L Feature 2 Feature 1 L L L L L L L M M M M M M Feature 2 Feature 1 L L L L L L L M M M M M M = AML = ALL = test sample M L

Distance Issues Euclidean distance ■ Pearson distance g1g1 g2g2 g3g3 g4g4

Cross-validation http://en.wikipedia.org/wiki/Cross-validation_(statistics) http://en.wikipedia.org/wiki/Cross-validation_(statistics)

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Feature selection Feature selection Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Genetic Algorithm for Feature Selection Sample Clear cell RCC, etc. Raw measurement data f1 f2 f3 f4 f5 Feature vector = pattern

Why Genetic Algorithm? Assuming 2,000 relevant genes, 20 important discriminator genes (features). Assuming 2,000 relevant genes, 20 important discriminator genes (features). Cost of an exhaustive search for the optimal set of features ? Cost of an exhaustive search for the optimal set of features ? C(n,k)=n!/k!(n-k)! C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20 = 10^40 If it takes one femtosecond (10 -15 second) to evaluate a set of features, it takes more than 3  10^17 years to find the optimal solution on the computer.

Evolutionary Methods Based on the mechanics of Darwinian evolution Based on the mechanics of Darwinian evolution The evolution of a solution is loosely based on biological evolution The evolution of a solution is loosely based on biological evolution Population of competing candidate solutions Population of competing candidate solutions Chromosomes (a set of features) Chromosomes (a set of features) Genetic operators (mutation, recombination, etc.) Genetic operators (mutation, recombination, etc.) generate new candidate solutions generate new candidate solutions Selection pressure directs the search Selection pressure directs the search those that do well survive (selection) to form the basis for the next set of solutions. those that do well survive (selection) to form the basis for the next set of solutions.

A Simple Evolutionary Algorithm Selection Genetic Operators Evaluation

Genetic Operators Crossover 10305070 20406080 Randomly Selected Crossover Point 1030 50702040 6080 Mutation 10306280 Randomly Selected Mutation Site l Recombination is intended to produce promising individuals. l Mutation maintains population diversity, preventing premature convergence.

Genetic Algorithm g2g1g6g3g21 g201g17g51g21g1 g12g7g15g12g10 g25g72g56g23g10 g20g7g5g2g100 Good enough Stop g20g7g6g3g21 g20g7g25g23g14 g12g7g15g22g10 g25g72g56g23g10 g2g1g5g2g100 Not good enough 5 2 1 4 3

GA Fitness At the core of any optimization approach is the function that measures the quality of a solution or optimization. At the core of any optimization approach is the function that measures the quality of a solution or optimization. Called: Called: Objective function Objective function Fitness function Fitness function Error function Error function measure measure etc. etc.

Encoding Most difficult, and important part of any GA Most difficult, and important part of any GA Encode so that illegal solutions are not possible Encode so that illegal solutions are not possible Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Most GA’s use a binary encoding of a solution, but other schemes are possible Most GA’s use a binary encoding of a solution, but other schemes are possible

Genetic Algorithm/K-Nearest Neighbor Algorithm Classifier (kNN) Feature Selection (GA) Microarray Database

Microarray Data Analysis Data processing and visualization Data processing and visualization Supervised learning Supervised learning Machine learning approaches Machine learning approaches Unsupervised learning Unsupervised learning Clustering and pattern detection Clustering and pattern detection Gene regulatory regions predictions based co- regulated genes Gene regulatory regions predictions based co- regulated genes Linkage between gene expression data and gene sequence/function databases Linkage between gene expression data and gene sequence/function databases …

Unsupervised learning Supervised methods Can only validate or reject hypotheses Can not lead to discovery of unexpected partitions Unsupervised learning No prior knowledge is used Explore structure of data on the basis of similarities

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION)

BUT WHAT ABOUT THE OKAPI?

5 24 13 Agglomerative Hierarchical Clustering 3 1 4 2 5 Distance between joined clusters Dendrogram at each step merge pair of nearest clusters initially : each point = cluster Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers (UPGMA)

Hierarchical Clustering - Summary Results depend on distance update method Results depend on distance update method Greedy iterative process Greedy iterative process NOT robust against noise NOT robust against noise No inherent measure to identify stable clusters No inherent measure to identify stable clusters Average Linkage (UPGMA) – the most widely used clustering method in gene expression analysis

Cluster both genes and samples Sample should cluster together based on experimental design Sample should cluster together based on experimental design Often a way to catch labelling errors or heterogeneity in samples Often a way to catch labelling errors or heterogeneity in samples

nature 2002 breast cancer Heat map

Centroid methods – K-means Data points at X i, i= 1,...,N Centroids at Y ,  = 1,...,K Assign data point i to centroid  ; S i =  Cost E: E(S 1, S 2,...,S N ; Y 1,...Y K ) = Minimize E over S i, Y 

K-means “Guess” K=3

Start with random positions of centroids. K-means Iteration = 0

K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid.

K-means Iteration = 2 Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points

K-means Iteration = 3 Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iterate till minimal cost

Fast algorithm: compute distances from data points to centroids Fast algorithm: compute distances from data points to centroids Result depends on initial centroids’ position Result depends on initial centroids’ position Must preset K Must preset K Fails for “non-spherical” distributions Fails for “non-spherical” distributions K-means - Summary

Issues in Cluster Analysis A lot of clustering algorithms A lot of clustering algorithms A lot of distance/similarity metrics A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? Which clustering algorithm runs faster and uses less memory? How many clusters after all? How many clusters after all? Are the clusters stable? Are the clusters stable? Are the clusters meaningful? Are the clusters meaningful?

Which Clustering Method Should I Use? What is the biological question? What is the biological question? Do I have a preconceived notion of how many clusters there should be? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Can a gene be in multiple clusters? Hard or soft boundaries between clusters Hard or soft boundaries between clusters

Gene expression and data analysis. RNA Detection by Northern Blotting.

Similar presentations

Presentation on theme: "Gene expression and data analysis. RNA Detection by Northern Blotting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene expression and data analysis. RNA Detection by Northern Blotting.

Similar presentations

Presentation on theme: "Gene expression and data analysis. RNA Detection by Northern Blotting."— Presentation transcript:

Similar presentations

About project

Feedback