Whole Genome Expression Analysis

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.

An Introduction of Support Vector Machine
Support Vector Machines
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Applications to Bioinformatics: Microarray Data Mining
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Reduced Support Vector Machine
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Copyright © 2002 KDnuggets Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro IMA 2002 Workshop on Data-driven.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
This week: overview on pattern recognition (related to machine learning)
Efficient Model Selection for Support Vector Machines
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Integration II Prediction. Kernel-based data integration SVMs and the kernel “trick” Multiple-kernel learning Applications – Protein function prediction.
The Broad Institute of MIT and Harvard Classification / Prediction.
Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma.
Scenario 6 Distinguishing different types of leukemia to target treatment.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
An Introduction to Support Vector Machine (SVM)
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.
Data Mining and Decision Support
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Basic machine learning background with Python scikit-learn
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Whole Genome Expression Analysis

In this presentation…… Part 1 – Gene Expression Microarray Data Part 2 – Global Expression & Sequence Data Analysis Part 3 – Proteomic Data Analysis

Part 1 Microarray Data

Examples of Problems Gene sequence problems: Given a DNA sequence state which sections are coding or noncoding regions. Which sections are promoters etc... Protein Structure problems: Given a DNA or amino acid sequence state what structure the resulting protein takes. Gene expression problems: Given DNA/gene microarray expression data infer either clinical or biological class labels or genetic machinery that gives rise to the expression data. Protein expression problems: Study expression of proteins and their function.

Microarray Technology Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. Measure the expression of thousands of genes under different experimental conditions and ask what is different and why.

Oligo vs cDNA arrays Lockhart and Winzler 2000

Format What is whole genome expression analysis? Clustering algorithm - Hierarchical clustering - K-means clustering - Principal component analysis - Self organizing maps Beyond clustering - Support vector machines - Automatic discovery of regulatory patterns in promoter region - Bayesian networks analysis

What is whole genome expression analysis? Messenger RNA is only an intermediate of gene expression

What is whole genome expression analysis? (continued) Why measure mRNA?

What is whole genome expression analysis? (continued)

Why Separate Feature selection ? most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes We first reduce number of genes by a linear method, e.g. T-values Heuristic: select genes from each class Then apply a favorite machine learning algorithm

Feature selection approach Rank genes by measure; select top 200-500 T-test for Mean Difference = Signal to Noise (S2N) = Other: Information-based, biological? Almost any method works well with a good feature selection

Heatmap Visualization of selected fields ALL AML Heatmap visualization is done by normalizing each gene to mean 0, std. 1 to get a picture like this. Good correlation overall AML-related Possible outliers sample #12 ALL-related

Controlling False Positives CD37 antigen Class 178 105 4174 7133 1 2 Mean Difference between Classes: T-value = -3.25 Significance: p=0.0007

Controlling False Positives with Randomization Randomized Class CD37 antigen Class Randomization is Less Conservative Preserves inner structure of data 178 105 4174 7133 1 2 2 1 Randomize Class 178 105 4174 7133 2 1 T-value = -1.1

Controlling false positives with randomization, II Class Gene Class 178 105 4174 7133 1 2 2 1 Randomize 500 times Gene Class Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% 178 105 4174 7133 2 1

Microarray Data Classification Part 2 Microarray Data Classification

Classification desired features: simplest approaches are most robust robust in presence of false positives understandable return confidence/probability fast enough simplest approaches are most robust

Popular Classification Methods Decision Trees/Rules find smallest gene sets, but also false positives Neural Nets - work well if # of genes is reduced SVM good accuracy, does its own gene selection, hard to understand K-nearest neighbor - robust for small # genes Bayesian nets - simple, robust

Model-building Methodology Select gene set (and other parameters) on a train set. When the final model is built, evaluate it on the independent test set which should not be used in the model building

Selecting the best gene set We tested 9 sets, 10 fold x-validation (90 neural nets) Select gene set with lowest average error Heuristically, at least 10 genes overall

Results on the test data Evaluation of the 10 genes per class on the test data (34 samples) gives 33 correct predictions (97% accuracy), 1 error on sample 66 Actual Class AML, Net prediction: ALL net confidence low

Classification - other applications Combining clinical and genetic data Outcome / Treatment prediction Age, Sex, stage of disease, are useful e.g. if Data from Male, not Ovarian cancer

Classification: Multi-Class Similar Approach: select top genes most correlated to each class select best subset using cross-validation build a single model separating all classes Advanced: build separate model for each class vs. rest choose model making the strongest prediction

Multi-class Data Example Brain data, Pomeroy et al 2002, Nature (415), Jan 2002 42 examples, about 7,000 genes, 5 classes Selected top 100 genes most correlated to each class Selected best subset by testing 1,2, …, 20 genes subsets, leave-one-out x-validation for each

Number of genes (same from each class) Brain data results Optimal set: 8 genes per class (4 errors from 42) Number of genes (same from each class)

Classification of Cancer Microarray Data Part 3 Classification of Cancer Microarray Data

Cancer Classification 38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane d Test data

Gene expression and Coregulation

Nonlinear classifier

Nonlinear SVM Nonlinear SVM does not help when using all genes but does help when removing top genes, ranked by Signal to Noise (Golub et al).

Rejections Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM g1 g2 Cancer Reject Normal

Rejections

Estimating a CDF

The Regularized Solution

Rejections for SVM 95% confidence or p = .05 d = .107 P(c=1 | d) 1/d .95 P(c=1 | d) 1/d

Results with rejections Results: 31 correct, 3 rejected of which 1 is an error d Test data

Why Feature Selection SVMs as stated use all genes/features Molecular biologists/oncologists seem to be conviced that only a small subset of genes are responsible for particular biological properties, so they want which genes are are most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement

Results with Gene Selection AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. d Test data B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

Leave-one-out Procedure   Leave-one-out Procedure  

  The Basic Idea Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space.