Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Slides:



Advertisements
Similar presentations
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Advertisements

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.
SPARCLE = SPArse ReCovery of Linear combinations of Expression Presented by: Daniel Labenski Seminar in Algorithmic Challenges in Analyzing Big Data in.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
K nearest neighbor and Rocchio algorithm
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
TRADING OFF PREDICTION ACCURACY AND POWER CONSUMPTION FOR CONTEXT- AWARE WEARABLE COMPUTING Presented By: Jeff Khoshgozaran.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Making the Most of Small Sample High Dimensional Micro-Array Data Allan Tucker, Veronica Vinciotti, Xiaohui Liu; Brunel University Paul Kellam; Windeyer.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.
Clustering Top-Ranking Sentences for Information Access Anastasios Tombros, Joemon Jose, Ian Ruthven University of Glasgow & University of Strathclyde.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Supervised Clustering of Label Ranking Data Mihajlo Grbovic, Nemanja Djuric, Slobodan Vucetic {mihajlo.grbovic, nemanja.djuric,
A Knowledge-Based Clustering Algorithm Driven by Gene Ontology Jill Cheng Affymetrix, Inc. Jan 15, 2004.
H. Lexie Yang1, Dr. Melba M. Crawford2
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
NTU & MSRA Ming-Feng Tsai
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
On Using SIFT Descriptors for Image Parameter Evaluation Authors: Patrick M. McInerney 1, Juan M. Banda 1, and Rafal A. Angryk 2 1 Montana State University,
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Machine Learning Feature Creation and Selection
Collaborative Filtering Nearest Neighbor Approach
1 Department of Engineering, 2 Department of Mathematics,
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Topological Signatures For Fast Mobility Analysis
Semi-Automatic Data-Driven Ontology Construction System
Donghui Zhang, Tian Xia Northeastern University
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presenter: Donovan Orn
Presentation transcript:

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single gene based discriminative scores, correlation (redundancy) based algorithms Virtual Gene Algorithm –Using correlations between genes Gene Ontology Based Gene Selection –Integrating domain knowledge Boost Selection –Feature selection based on bootstraps

Virtual Gene Algorithm Gene to gene correlations are generally ignored in feature selection algorithms. In this work, we examine using instead of ignoring such correlations for the purpose of gene selection. Motivating examples are shown in the next two pages, from both synthetic and real datasets.

Virtual Gene: Motivating Example

Virtual Gene Algorithm The expression levels of any single gene does not capture the class label distinction However, the combination of expression levels of two genes captures class label distinction pretty well Virtual Gene: a linear combination of genes

Virtual gene definitions

Systematically examining all possible virtual genes There are possible virtual genes that can be constructed from a set of n genes. Pairwise virtual genes are those virtual genes that limit the size of constituent gene set to be 2. This reduces computation enormously. Clustering algorithms are further used to reduce the number of gene pairs to be considered. Clustering algorithm identifies genes that potentially interact or share similar functions.

Pairwise virtual gene algorithm Our experiments show that limiting pairwise virtual gene computation to genes in the same cluster greatly reduces computational complexity while preserving classification accuracy.

Pairwise virtual gene algorithm Pairwise Virtual Gene algorithm runs in three stages 1.Cluster genes into gene clusters using k- means algorithm 2.Compute pairwise virtual genes within clusters, their virtual gene expressions and their discriminative power 3.Select top ranked virtual gene, degrade the discriminative power using α, β(parameters supplied by user)

Pairwise virtual gene algorithm Parameters to pairwise virtual gene algorithm: α: ranges [0,1], the likelihood of virtual genes with same constituent genes being selected β: ranges [0,1], the likelihood of virtual genes whose constituent genes come from same cluster being selected k : number of virtual genes to be selected

Experiments: Virtual Gene Extensive experiments are performed on three publicly available datasets: colon cancer, leukemia and multi-class cancer. We will briefly discuss the performance on these dataset, and report more detailed result on colon cancer dataset. Performance are measured by cross validation procedure, three classifiers (SVM, KNN, DLD) are used. Performance of four FSS algorithms are compared.

Experiments: Virtual Gene Summary of classification performance of virtual gene algorithm.

Experiments: Virtual Gene Summary of classification performance of virtual gene algorithm.

Experiments: Virtual Gene More detailed result on colon cancer dataset –Study how the choice of number of clusters in the pairwise virtual gene algorithm affects classification performance. –Study how the choice of initial cluster centers in the pairwise virtual gene algorithm affects gene selection performance.

Experiments: Virtual Gene, number of clusters

Experiments: Virtual Gene, initial cluster centers

The limit of pairwise virtual gene algorithm Biological process obviously could involve more than 2 genes at a time. Pairwise virtual gene algorithm might be too restrictive in this sense. Our goal is to investigate the relative expression values of biologically related genes. Using domain knowledge enables us to do just that, to some degree.

Different levels of feature selection Single gene based discriminative scores ignore feature correlations completely. Exhaustive search of the power set is too slow. GO based virtual gene algorithm utilizes domain knowledge information and decide which set to explorer intelligently.

More on GO and GO annotation Gene Ontology (GO) consists of GO terms, which form a shared biological vocabulary. GO terms are connected based on is-a or is-part-of relationship. Combined, GO terms and relationships between them form a DAG (directed acyclic graph). Genes are annotated by GO terms by GO collaborators. Gene annotations are assumed to be transitive in this thesis: if a gene is annotated by a GO term, it is also considered to be annotated by all the parent GO terms of that GO term.

Domain knowledge in form of gene ontology annotations

Some definitions

Explaining of Definitions The GO distance between genes measures how close two genes are from the information embedded in GO annotations. Gene connectivity graph shows the overall gene affinity. We want to examine correlation in gene expressions between tightly related genes. Our algorithm best demonstrated using the graph in the next slide.

GO based virtual gene algorithm First, GO distances between genes are computed. Genes that are close to each other are identified by finding cliques in gene connectivity graph. Each small gene clique is used to create a virtual gene. Virtual genes are then ranked using single gene based discriminative scores.

Experiment Setup Two publicly available microarray expression data sets are used: colon cancer, leukemia. Three gene ontology branches are used separately. Three classifiers are used. GO annotations are extract from Stanford's online database SOURCE.

Experiments: GO Virtual Gene Experiment result on Colon Cancer data set.

Experiment: GO Virtual Gene Experiment result on Leukemia data set.

Conclusion: GO Virtual Gene Usage of domain knowledge embedded in GO annotations enables us to example expression correlations between a large set of genes. GO based virtual gene algorithm sometimes improves gene selection performance significantly.