Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Pattern Recognition and Machine Learning
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Computational Diagnostics We are a new research group in the department of Computational Molecular Biology at the Max Planck Institute for Molecular Genetics.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Reduced Support Vector Machine
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Sparse vs. Ensemble Approaches to Supervised Learning
3 rd Summer School in Computational Biology September 10, 2014 Frank Emmert-Streib & Salissou Moutari Computational Biology and Machine Learning Laboratory.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Gene Expression Based Tumor Classification Using Biologically Informed Models ISI 2003 Berlin Claudio Lottaz und Rainer Spang Computational Diagnostics.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
EVIDENCE BASED MEDICINE
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Metastatic Breast Cancer: One Size Does Not Fit All Clifford Hudis, M.D. Chief, Breast Cancer Medicine Service MSKCC.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Gene expression profiling identifies molecular subtypes of gliomas
Structured Analysis of Microarrays & Differential Coexpression Claudio Lottaz, Dennis Kostka & Rainer Spang Courses in Practical DNA Microarray Analysis.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Diagnosis using computers. One disease Three therapies.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Cancer 101: A Cancer Education and Training Program for [Target Population] Date Location Presented by: Presenter 1 Presenter 2.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Gene expression.
Molecular Classification of Cancer
Claudio Lottaz and Rainer Spang
Pattern Recognition and Machine Learning
Support Vector Machines 2
Claudio Lottaz and Rainer Spang
Presentation transcript:

Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis

Questions in medical research: Basic Research: Which role plays gene A in disease B ? Clinical Routine: Which consequence has expression status X of gene A for patient Y ?

Yesterday the focus was on basic research questions We have investigated genes - Differentially expressed genes - Coexpressed genes (clustering) Today it will be on patients - Molecular diagnosis - Predicting survival / therapy response Personalized Medicine

Diagnostic Question: Which disease has Ms. Smith ? Disease A or Disease B Therapeutic Question: Which treatment should Ms. Smith get ? Treatment A or Treatment B Pharmacological Question: Will Ms. Smith develop side effects from drug x ? Yes or No

What do the 3 questions have in common ? 1. They all refer to an individual 2. They all address predictive problems 3. They are directly linked to decisions Yesterday‘s questions are of a different kind: They refer to populations, and aim for increasing general knowledge

Predictive data analysis is very different from explanatory data analysis ! No testing No clustering Different types of regression Of course, preprocessing stays the same

DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith

The expression profile a list of 30,000 numbers... that are all properties of Ms. Smith... some of them reflect her health problem (a tumor)... the profile is a digital image of Ms. Smith‘s tumor How can these numbers tell us (predict) whether Ms. Smith has tumor type A or tumor type B ?

By comparing her profile to profiles of people with tumor type A and to patients with tumor type B Ms. Smith ?

The setup for predictive data analysis Ms. Smith There are patients with known outcome - the trainings samples - There are patients with unknown outcome - the „new“ samples -

The challenge of predictive data analysis Ms. Smith Use the trainings samples to learn how to predict „new“ samples

How can we find out whether we have really learned how to predict the outcome? Take some patients from the original training samples and blind the outcome These are now called test samples Only the remaining samples are still training samples. Use them to learn how to predict Predict the test samples and compare the predicted outcome to the true outcome ok ok mistake

We will proceed in 4 Steps - Prediction with 1 gene - Prediction with 2 genes - Prediction with a small number of genes - Prediction with the microarray

Prediction with 1 gene Color coded expression levels of trainings samples AB Ms. Smith  type A Ms. Smith  type B Ms. Smith  borderline Which color shade is a good decision boundary?

Approach: Use the decision boundary with the fewest misclassifications on the trainings samples „ Smallest training error “ Zero training error is not possible! A more schematic illustration: Distribution of expression values in type B Distribution of expression values in type A Decision boundary Training error

What about the test samples? Test error The decision boundary was chosen to minimize the trainings error The two distributions of expression values for type A and B will be similar but not identical in the test data We can not adjust the decision boundary because we do not know the outcome of test samples Test errors are in average bigger then training errors This phenomenon is called overfitting

Prediction with 1 gene The gene is differentially expressed Prediction with 2 genes Both genes are differentially expressed These genes are not differentially expressed. Can they be of any use?

Leukemia Data (ALL): Hyperdiploid vs Pseudo diploid x-axis: NONO y-axis: SLC25A3 The genes are not differentially expressed, but their difference is. This is a bivariate expression signature

Ms. Smith

A decision boundary can be defined by a weighted sum ( linear combination ) of expression values  Separating Signature

Problem 1: No separating line Problem 2: To many separating lines Why is this a problem?

What about Ms. Smith ? This problem is also related to overfitting... more soon

Prediction with more genes with differentially expressed genes or with multivariate signatures

The gap between training error and test error becomes wider The test error can increase !

With more genes, we have more information on the patients one would expect that we make less errors... but the opposite is true. Overfitting becomes a more serious problem Overfitting

With the microarray we have more genes than patients Think about this in three dimensions There are three genes, two patients with known diagnosis (red and yellow) and Ms. Smith (green) There is always one plane separating red and yellow with Ms. Smith on the yellow side and a second separating plane with Ms. Smith on the red side Prediction with 30,000 genes OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in praxis.

From the data alone we can not decide which genes are important for the diagnosis, nor can we give a reliable diagnosis for a new patient This has little to do medicine. It is a geometrical problem. The overfitting disaster

The most important consequence of understanding the overfitting disaster: If you find a separating signature, it does not mean (yet) that you have a top publication in most cases it means nothing.

More important consequences of understanding the overfitting disaster: There always exist separating signatures caused by overfitting - meaningless signatures - Hopefully there is also a separating signature caused by a disease mechanism - meaningful signatures - We need to learn how to find and validate meaningful signatures

How to distinguish a meaningful signature from a meaningless signature? The meaningless signature might be separating – small training error -... but it will not be predictive – large test error – The aim is not a separating signature but a predictive signature: Good performance in clinical practice !!! More later...

Strategies for finding meaningful signatures ? Later we will discuss 2 possible approaches 1.Gene selection followed by discriminant analysis (QDA,LDA,DLDA), and the PAM program 2.Support Vector Machines What is the basis for this methods?

Gene selection When considering all possible linear planes for separating the patient groups, we always find one that perfectly fits, without a biological reason for this. When considering only planes that depend on maximally 20 genes it is not guaranteed that we find a well fitting signature. If in spite of this it does exist, chances are good that it reflects transcriptional disorder.

Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one. Again if a large margin separation exists, chances are good that we found something relevant. Large Margin Classifiers Support Vector Machines

Both gene selection and Support Vector Machines confine the set of a priori possible signatures. However, using different strategies. Gene selection wants a small number of genes in the signature - sparse model - SVMs want some minimal distance between data points and the separating plane - large margin models - There is more than you could do... Regularization

Learning Theory Ridge regression, LASSO, Kernel based methods, additive models, classification trees, bagging, boosting, neural nets, relevance vector machines, nearest-neighbors, transduction etc. etc.

Questions ?

Coffee