Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.

Slides:



Advertisements
Similar presentations
Analysis of Microarray Genomic Data of Breast Cancer Patients Hui Liu, MS candidate Department of statistics Prof. Eric Suess, faculty mentor Department.
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Outlines Background & motivation Algorithms overview
1 Harvard Medical School Mapping Transcription Mechanisms from Multimodal Genomic Data Hsun-Hsien Chang, Michael McGeachie, and Marco F. Ramoni Children.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E Transcriptional Control in Eukaryotes Background Information Microarrays.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Fuzzy K means.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
DNA Microarrays Examining Gene Expression. Prof. GrossBiology 4 DNA MicroArrays DNA MicroArrays use hybridization technology to examine gene expression.
Gene Expression Based Tumor Classification Using Biologically Informed Models ISI 2003 Berlin Claudio Lottaz und Rainer Spang Computational Diagnostics.
Analysis of microarray data
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Whole Genome Expression Analysis
Non Negative Matrix Factorization
Finish up array applications Move on to proteomics Protein microarrays.
The Broad Institute of MIT and Harvard Classification / Prediction.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.
P. falciparum Life Cycle & Pathogenesis of Malaria Miller et al., Nature  Molecular and genetic.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Blind Information Processing: Microarray Data Hyejin Kim, Dukhee KimSeungjin Choi Department of Computer Science and Engineering, Department of Chemical.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Data Processing Technologies for DNA Microarray Nini Rao School of Life Science And Technology UESTC14/11/2004.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
High-throughput omic datasets and clustering
1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
No reference available
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Sparse nonnegative matrix factorization for protein sequence motifs information discovery Presented by Wooyoung Kim Computer Science, Georgia State University.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
David Amar, Tom Hait, and Ron Shamir
Outline Introduction NMF Chemistry Problem
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Molecular Classification of Cancer
Volume 15, Issue 3, Pages (April 2016)
Volume 1, Issue 2, Pages (March 2002)
Volume 23, Issue 11, Pages (June 2018)
Integrative Analysis of multiple large-scale molecular biological data
Dynamic modeling of gene expression data
Predicting Gene Expression from Sequence
Volume 26, Issue 12, Pages e5 (March 2019)
High-risk neuroblastoma molecular subtypes classification and inference of master regulators. High-risk neuroblastoma molecular subtypes classification.
Presentation transcript:

Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop February 23, ‘07

Scotch whiskey database Original matrix = Prototypical flavor patterns + Residual X Mixing levels (weights)

How many flavor patterns? Scree plot Profile likelihood (Zhu and Ghodsi) Volume filled (Determinant)

AnCnoc Floral Sweetness Fruity Malty Nutty

Balmenach Winey Body Honey Sweetness Nutty Malty

GlenGarioch Spicy Fruity Sweetness Body Malty

Lagavulin & Laphroig Medicinal Smoky Body

Statistical Issues 1.Massive testing: Hundreds of “omic” predictors and several questions per sample. 2.Family-wise versus false discovery. 3.Missing data, outliers. Don’t fool yourself.

Matrix Factorization Methods 1.Principle component analysis. 2.Singular value decomposition. 3.Non-negative matrix factorization. 4.Independent component analysis. 5.Robust MF. Area of active research.

Key Papers 1. Good (1969) Technometrics – SVD. 2. Liu et al. (2003) PNAS – rSVD. 3. Lee and Seung (1999) Nature – NMF. 4. Kim and Tidor (2003) Genome Research. 5. Brunet et al. (2004) PNAS – Micro array. SVD eigen vectors come from a composite of  mechanisms. NMF commits one vector to each mechanism.

NMF Algorithm Green are the “spectra”. Red are the “weights”. = + E WH Samples A Genes or Compounds Start with random elements in red and green. Optimize so that  (a ij – wh ij ) 2 is minimized.

Inference Test each variable sequentially within an ordered set. Each set corresponds to a particular eigenvector, which has been ordered by decreasing values. Increase in statistical power. Genomic example. Simulation.

Group AML: patients with acute myeloid leukemia Group ALL: patients with acute lymphoblastic leukemia –Subgroup ALL-T: T cell subtypes –Subgroup ALL-B: B cell subtypes Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531– 537. Micro Array Example

Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no –4169 Additional subgroup of ALL-B.

Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no –4169 Additional subgroup of ALL-B.

Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no –4169 Additional subgroup of ALL-B.

Cluster 3 ALL-B2 (169 genes) Immune Response 10 genes (p= ) Cell Growth and Proliferation 61 genes Cluster 1 ALL-B1 (33 genes) RNA Processing 11 genes P = Cell Cycle 12 genes Transcription 16 genes DNA Repair and Replication 11 genes P = MHC class II 5 genes MHC class I & II 6 genes P = Proteasome 7 genes P = Immune Response 28 genes (p= ) Sequential testing Upregulation in ALL-B2 genes Higher rate of transcription and replication processes More:  Proliferative nature compared with ALL-B1  Proteasomal activity  Energy production.

Simulation

Genes 1-5: up- regulated by T1 Genes 6-10: up- regulated by T2 Genes 11-20: up- regulated by T1 and T2 Intragroup correlation structure

Simulation results Increased power Same level of FDR For more details see paper

Summary The strategy is conceptually simple: –Non-negative matrix factorization is used to create groups of genes that are moving together in the dataset. –The error rate to be controlled is allocated over these groups. –Within each group, genes are tested sequentially. The strategy should be effective if there are sets of genes moving together so that group formation reflects biological reality. Areas of research: Robust algorithms Multiblock NMF (e.g. relate active motifs with differentially expressed genes) Speed

Contact Information Independent consultant Paul Fogel Stan Young National Institute of Statistical Sciences Literature Software