Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February.

Slides:



Advertisements
Similar presentations
Mining Association Rules from Microarray Gene Expression Data.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.
Microarrays Dr Peter Smooker,
Microarray analysis Golan Yona ( original version by David Lin )
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
3 rd Summer School in Computational Biology September 10, 2014 Frank Emmert-Streib & Salissou Moutari Computational Biology and Machine Learning Laboratory.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Analysis of microarray data
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
Gene expression profiling identifies molecular subtypes of gliomas
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
Structured Analysis of Microarrays & Differential Coexpression Claudio Lottaz, Dennis Kostka & Rainer Spang Courses in Practical DNA Microarray Analysis.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
Microarray - Leukemia vs. normal GeneChip System.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
High throughput Protein Measurement Techniques Harin Kanani.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 3, May 2004 For written notes.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician, CS2220:
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 13 January.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
An Artificial Intelligence Approach to Precision Oncology
Microarray - Leukemia vs. normal GeneChip System.
Gene expression.
Microarray Technology and Applications
Molecular Classification of Cancer
Volume 1, Issue 2, Pages (March 2002)
Loyola Marymount University
Loyola Marymount University
Loyola Marymount University
Presentation transcript:

Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February 2004 For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician,

Copyright  2004 limsoon wong Course Plan

Copyright  2004 limsoon wong Background on Microarrays

Copyright  2004 limsoon wong What’s a Microarray? Contain large number of DNA molecules spotted on glass slides, nylon membranes, or silicon wafers Measure expression of thousands of genes simultaneously

Copyright  2004 limsoon wong Affymetrix GeneChip Array

Copyright  2004 limsoon wong Making Affymetrix GeneChip quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules exposed linkers become deprotected and are available for nucleotide coupling

Copyright  2004 limsoon wong Gene Expression Measurement by GeneChip

Copyright  2004 limsoon wong A Sample Affymetrix GeneChip File (U95A)

Copyright  2004 limsoon wong Some Advice on Affymetrix Gene Chip Data Ignore AFFX genes –these genes are control genes Ignore genes with “Abs Call” equal to “A” or “M” –measurement quality is suspect Upperbound 40000, lowerbound 100 –accuracy of laser scanner Deal with missing values

Copyright  2004 limsoon wong A Motivating Application Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy

Copyright  2004 limsoon wong Childhood ALL, A Heterogeneous Disease Major subtypes are –T-ALL –E2A-PBX1 –TEL-AML1 –MLL genome rearrangements –Hyperdiploid>50 –BCR-ABL

Copyright  2004 limsoon wong Risk-Stratified Therapy Different subtypes respond differently to the same treatment intensity  Match patient to optimum treatment intensity for his subtype & prognosis BCR-ABL, MLL TEL-AML1, Hyperdiploid>50 T-ALLE2A-PBX1 Generally good-risk, lower intensity Generally high-risk, higher intensity

Copyright  2004 limsoon wong Treatment Failure Overly intensive treatment leads to –Development of secondary cancers –Reduction of IQ Insufficiently intensive treatment leads to –Relapse

Copyright  2004 limsoon wong Risk Assignment The major subtypes look similar Conventional diagnosis requires –Immunophenotyping –Cytogenetics –Molecular diagnostics

Copyright  2004 limsoon wong Mission Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Generally available only in major advanced hospitals  Can we have a single-test easy-to-use platform instead?

Copyright  2004 limsoon wong Single-Test Platform of Microarray & Machine Learning

Copyright  2004 limsoon wong Overall Strategy Diagnosis of subtype Subtype- dependent prognosis Risk- stratified treatment intensity For each subtype, select genes to develop classification model for diagnosing that subtype For each subtype, select genes to develop prediction model for prognosis of that subtype

Copyright  2004 limsoon wong Subtype Diagnosis by PCL Gene expression data collection Gene selection by  2 Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases by PCL

Copyright  2004 limsoon wong Childhood ALL Subtype Diagnosis Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator

Copyright  2004 limsoon wong Training and Testing Sets

Copyright  2004 limsoon wong Signal Selection Basic Idea Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Copyright  2004 limsoon wong Signal Selection by  2

Copyright  2004 limsoon wong Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy A jumping emerging pattern is an emerging pattern that –some members of a class satisfy –but no members of the other class satisfy We use only jumping emerging patterns

Copyright  2004 limsoon wong PCL: Prediction by Collective Likelihood

Copyright  2004 limsoon wong Accuracy of PCL (vs. other classifiers) The classifiers are all applied to the 20 genes selected by  2 at each level of the tree

Copyright  2004 limsoon wong Understandability of PCL E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs These give us the diagnostic rule

Copyright  2004 limsoon wong Multidimensional Scaling Plot Subtype Diagnosis

Copyright  2004 limsoon wong Multidimensional Scaling Plot Subtype-Dependent Prognosis Similar computational analysis was carried out to predict relapse and/or secondary AML in a subtype- specific manner >97% accuracy achieved

Copyright  2004 limsoon wong Is there a new subtype? Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL

Copyright  2004 limsoon wong Childhood ALL Cure Rates Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists  Not available in less advanced ASEAN countries

Copyright  2004 limsoon wong Childhood ALL Treatment Cost Treatment for childhood ALL over 2 yrs –Intermediate intensity: US$60k –Low intensity: US$36k –High intensity: US$72k Treatment for relapse: US$150k Cost for side-effects: Unquantified

Copyright  2004 limsoon wong Current Situation (2000 new cases/yr in ASEAN) Intermediate intensity conventionally applied in less advanced ASEAN countries  Over intensive for 50% of patients, thus more side effects  Under intensive for 10% of patients, thus more relapse  5-20% cure rates US$120m (US$60k * 2000) for intermediate intensity treatment US$30m (US$150k * 2000 * 10%) for relapse treatment Total US$150m/yr plus un-quantified costs for dealing with side effects

Copyright  2004 limsoon wong Using Our Platform Low intensity applied to 50% of patients Intermediate intensity to 40% of patients High intensity to 10% of patients  Reduced side effects  Reduced relapse  75-80% cure rates US$36m (US$36k * 2000 * 50%) for low intensity US$48m (US$60k * 2000 * 40%) for intermediate intensity US$14.4m (US$72k * 2000 * 10%) for high intensity Total US$98.4m/yr  Save US$51.6m/yr

Copyright  2004 limsoon wong Some Caveats Study was performed on Americans May not be applicable to Singaporeans, Malaysians, Indonesians, etc. Large-scale study on local populations currently in the works

Copyright  2004 limsoon wong Proteomic Profile Classification Same thing works on proteomic profiles also

Copyright  2004 limsoon wong Proteomic Profiling by Mass Spec

Copyright  2004 limsoon wong Ovarian Cancer Data Petricoin et al., Lancet 359: , 2002 Identify proteomic patterns in serum that distinguish ovarian cancer from non- cancer release 91 non-cancer samples 162 cancer samples features Each feature is the amplitude of an ion (aka M/Z identities)

Copyright  2004 limsoon wong A sample proteomic profile

Copyright  2004 limsoon wong Typical Procedure in Analysing Proteomic Profiles for Diagnosis Proteomic data collection Ion (M/Z) values selection Classifier training Classifier tuning (optional for some machine learning methods) Apply classifier for diagnosis of future cases

Copyright  2004 limsoon wong Accuracy Errors from 10-fold cross validation using the n M/Z identities of lowest entropy

Copyright  2004 limsoon wong Gene Interaction Prediction

Copyright  2004 limsoon wong Beyond Classification of Gene Expression Profiles After identifying the candidate genes by feature selection, do we know which ones are causal genes and which ones are surrogates? Diagnostic ALL BM samples (n=327) 33 -3  -2  -1  0 11 22  = std deviation from mean Genes for class distinction (n=271) TEL-AML1BCR- ABL Hyperdiploid >50E2A- PBX1 MLLT-ALLNovel

Copyright  2004 limsoon wong Gene Regulatory Circuits Genes are “connected” in “circuit” or network Expression of a gene in a network depends on expression of some other genes in the network Can we reconstruct the gene network from gene expression data?

Copyright  2004 limsoon wong Key Questions For each gene in the network: which genes affect it? How they affect it? –Positively? –Negatively? –More complicated ways?

Copyright  2004 limsoon wong Some Techniques Bayesian Networks –Friedman et al., JCB 7: , 2000 Boolean Networks –Akutsu et al., PSB 2000, pages Differential equations –Chen et al., PSB 1999, pages Classification-based method –Soinov et al., “Towards reconstruction of gene network from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003

Copyright  2004 limsoon wong A Classification-based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003 Given a gene expression matrix X –each row is a gene –each column is a sample –each element x ij is expression of gene i in sample j Find the average value a i of each gene i Denote s ij as state of gene i in sample j, –s ij = up if x ij > a i –s ij = down if x ij  a i

Copyright  2004 limsoon wong To see whether the state of gene g is determined by the state of other genes – we see whether  s ij | i  g  can predict s gj –if can predict with high accuracy, then “yes” –Any classifier can be used, such as C4.5, PCL, SVM, etc. To see how the state of gene g is determined by the state of other genes –apply C4.5 (or PCL or other “rule-based” classifiers) to predict s gj from  s ij | i  g  –and extract the decision tree or rules used A Classification-based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003

Copyright  2004 limsoon wong Advantages of this method Can identify genes affecting a target gene Don’t need discretization thresholds Each data sample is treated as an example Explicit rules can be extracted from the classifier (assuming C4.5 or PCL) Generalizable to time series

Copyright  2004 limsoon wong Deriving Treatment Plan A pure speculation!

Copyright  2004 limsoon wong Can we do more with EPs? Detect gene groups that are significantly related to a disease Derive coordinated gene expression patterns from these groups Derive “treatment plan” based on these patterns

Copyright  2004 limsoon wong Colon Tumour Dataset Alon et al., PNAS 96: , 1999 We use the colon tumour dataset above to illustrate our ideas –22 normal samples –40 colon tumour samples

Copyright  2004 limsoon wong Detect Gene Groups Feature Selection –Use entropy method –35 genes have cut points Generate EPs –9450 EPs in normals –1008 EPs in tumours EPs with largest support are gene groups significantly co- related to disease

Copyright  2004 limsoon wong Top 20 EPs

Copyright  2004 limsoon wong Observation Some EPs contain large number of genes and still have high freq E.g., {25, 33, 37, 41, 43, 57, 59, 69} has freq 72.27% in normal and 0% in cancer samples I.e., almost every normal cell’s gene expression values satisfy all conds. implied by these 8 items

Copyright  2004 limsoon wong Treatment Plan Idea Increase/decrease expression level of particular genes in a cancer cell so that –it has the common EPs of normal cells –it has no common EPs of cancer cells

Copyright  2004 limsoon wong Treatment Plan Example From the EP {25,33,37,41,43,57,59,69} –77% of normal cells express the 8 genes (M16937, H51015, R10066, T57619, R84411, T47377, X53586, U09587) in the corr. Intervals –a cancer cell never express all 8 genes in the same way –if expression level of improperly expressed genes can be adjusted, the cancer cell can have one common EP of normal cells –a cancer cell can then be iteratively converted into a normal one

Copyright  2004 limsoon wong Choosing Genes to Adjust Consider tumour cell T177% of normal cells have this EP If H51015, R84411, T47377, X53586, U09587 in T1 can be down regulated so T1 now contains the EP above, then T1 will have one more common property of normal cells

Copyright  2004 limsoon wong Doing more adjustments...

Copyright  2004 limsoon wong Next, eliminate common EPs of cancer cells in T1

Copyright  2004 limsoon wong “Treatment Plan” Validation “Adjustments” were made to the 40 colon tumour samples based on EPs as described Classifiers trained on original samples were applied to the adjusted samples It works!

Copyright  2004 limsoon wong A Big But... Effective means for identifying mechanisms and pathways through which to modulate gene expression of selected genes need to be developed

Copyright  2004 limsoon wong Notes

Copyright  2004 limsoon wong Exercises Try the gene expression profile classification methods on some example data sets Try Soinov’s gene network reconstruction method some example data sets Visit for some example data sets Suggest some approaches to deal with missing values Summarise key points of this lecture

Copyright  2004 limsoon wong References E.-J. Yeoh et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling”, Cancer Cell, 1: , 2002 E.F. Petricoin et al., “Use of proteomic patterns in serum to identify ovarian cancer”, Lancet, 359: , 2002 U.Alon et al., “Broad patterns of gene expression revealed by clustering analysis of tumor colon tissues probed by oligonucleotide arrays”, PNAS 96: , 1999

Copyright  2004 limsoon wong References J.Li, L. Wong, “Geography of differences between two classes of data”, Proc. 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pp , 2002 J.Li, L. Wong, “Identifying good diagnostic genes or gene groups from gene expression data by using the concept of emerging patterns”, Bioinformatics, 18: , 2002 J.Li et al., “A comparative study on feature selection and classification methods using a large set of gene expression profiles”, GIW, 13:51--60, 2002

Copyright  2004 limsoon wong References M. A. Hall, “Correlation-based feature selection machine learning”, PhD thesis, Dept of Comp. Sci., Univ. of Waikato, New Zealand, 1998 U. M. Fayyad, K. B. Irani, “Multi-interval discretization of continuous-valued attributes”, IJCAI 13: , 1993 H. Liu, R. Sentiono, “Chi2: Feature selection and discretization of numeric attributes”, IEEE Intl. Conf. Tools with Artificial Intelligence 7: , 1995 L.D. Miller et al., “Optimal gene expression analysis by microarrays”, Cancer Cell 2: , 2002

Copyright  2004 limsoon wong Acknowledgements Huiqing LiuAllen YeohJinyan Li