Presentation is loading. Please wait.

Presentation is loading. Please wait.

Whole Genome Expression Analysis

Similar presentations


Presentation on theme: "Whole Genome Expression Analysis"— Presentation transcript:

1 Whole Genome Expression Analysis

2 In this presentation……
Part 1 – Gene Expression Microarray Data Part 2 – Global Expression & Sequence Data Analysis Part 3 – Proteomic Data Analysis

3 Part 1 Microarray Data

4 Examples of Problems Gene sequence problems: Given a DNA sequence state which sections are coding or noncoding regions. Which sections are promoters etc... Protein Structure problems: Given a DNA or amino acid sequence state what structure the resulting protein takes. Gene expression problems: Given DNA/gene microarray expression data infer either clinical or biological class labels or genetic machinery that gives rise to the expression data. Protein expression problems: Study expression of proteins and their function.

5 Microarray Technology
Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. Measure the expression of thousands of genes under different experimental conditions and ask what is different and why.

6 Oligo vs cDNA arrays Lockhart and Winzler 2000

7 Format What is whole genome expression analysis? Clustering algorithm
- Hierarchical clustering - K-means clustering - Principal component analysis - Self organizing maps Beyond clustering - Support vector machines - Automatic discovery of regulatory patterns in promoter region - Bayesian networks analysis

8 What is whole genome expression analysis?
Messenger RNA is only an intermediate of gene expression

9 What is whole genome expression analysis? (continued)
Why measure mRNA?

10 What is whole genome expression analysis? (continued)

11 Why Separate Feature selection ?
most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes We first reduce number of genes by a linear method, e.g. T-values Heuristic: select genes from each class Then apply a favorite machine learning algorithm

12 Feature selection approach
Rank genes by measure; select top T-test for Mean Difference = Signal to Noise (S2N) = Other: Information-based, biological? Almost any method works well with a good feature selection

13 Heatmap Visualization of selected fields
ALL AML Heatmap visualization is done by normalizing each gene to mean 0, std. 1 to get a picture like this. Good correlation overall AML-related Possible outliers sample #12 ALL-related

14 Controlling False Positives
CD37 antigen Class 178 105 4174 7133 1 2 Mean Difference between Classes: T-value = -3.25 Significance: p=0.0007

15 Controlling False Positives with Randomization
Randomized Class CD37 antigen Class Randomization is Less Conservative Preserves inner structure of data 178 105 4174 7133 1 2 2 1 Randomize Class 178 105 4174 7133 2 1 T-value = -1.1

16 Controlling false positives with randomization, II
Class Gene Class 178 105 4174 7133 1 2 2 1 Randomize 500 times Gene Class Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% 178 105 4174 7133 2 1

17 Microarray Data Classification
Part 2 Microarray Data Classification

18 Classification desired features: simplest approaches are most robust
robust in presence of false positives understandable return confidence/probability fast enough simplest approaches are most robust

19 Popular Classification Methods
Decision Trees/Rules find smallest gene sets, but also false positives Neural Nets - work well if # of genes is reduced SVM good accuracy, does its own gene selection, hard to understand K-nearest neighbor - robust for small # genes Bayesian nets - simple, robust

20 Model-building Methodology
Select gene set (and other parameters) on a train set. When the final model is built, evaluate it on the independent test set which should not be used in the model building

21 Selecting the best gene set
We tested 9 sets, 10 fold x-validation (90 neural nets) Select gene set with lowest average error Heuristically, at least 10 genes overall

22 Results on the test data
Evaluation of the 10 genes per class on the test data (34 samples) gives 33 correct predictions (97% accuracy), 1 error on sample 66 Actual Class AML, Net prediction: ALL net confidence low

23 Classification - other applications
Combining clinical and genetic data Outcome / Treatment prediction Age, Sex, stage of disease, are useful e.g. if Data from Male, not Ovarian cancer

24 Classification: Multi-Class
Similar Approach: select top genes most correlated to each class select best subset using cross-validation build a single model separating all classes Advanced: build separate model for each class vs. rest choose model making the strongest prediction

25 Multi-class Data Example
Brain data, Pomeroy et al 2002, Nature (415), Jan 2002 42 examples, about 7,000 genes, 5 classes Selected top 100 genes most correlated to each class Selected best subset by testing 1,2, …, 20 genes subsets, leave-one-out x-validation for each

26 Number of genes (same from each class)
Brain data results Optimal set: 8 genes per class (4 errors from 42) Number of genes (same from each class)

27 Classification of Cancer Microarray Data
Part 3 Classification of Cancer Microarray Data

28 Cancer Classification
38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane d Test data

29 Gene expression and Coregulation

30 Nonlinear classifier

31 Nonlinear SVM Nonlinear SVM does not help when using all genes but does help when removing top genes, ranked by Signal to Noise (Golub et al).

32 Rejections Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM g1 g2 Cancer Reject Normal

33 Rejections

34 Estimating a CDF

35 The Regularized Solution

36 Rejections for SVM 95% confidence or p = .05 d = .107 P(c=1 | d) 1/d
.95 P(c=1 | d) 1/d

37 Results with rejections
Results: 31 correct, 3 rejected of which 1 is an error d Test data

38 Why Feature Selection SVMs as stated use all genes/features
Molecular biologists/oncologists seem to be conviced that only a small subset of genes are responsible for particular biological properties, so they want which genes are are most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement

39 Results with Gene Selection
AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. d Test data B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

40 Leave-one-out Procedure
Leave-one-out Procedure

41 The Basic Idea Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space.


Download ppt "Whole Genome Expression Analysis"

Similar presentations


Ads by Google