Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.

Slides:

Advertisements

Similar presentations

Negative Selection Algorithms at GECCO /22/2005.

Advertisements

Rerun of machine learning Clustering and pattern recognition.

Naïve-Bayes Classifiers Business Intelligence for Managers.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Exploratory Data Mining and Data Preparation

Properties of Machine Learning Applications for Use in Metamorphic Testing Chris Murphy, Gail Kaiser, Lifeng Hu, Leon Wu Columbia University.

Machine Learning Group University College Dublin 4.30 Machine Learning Pádraig Cunningham.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Introduction to Data Mining Engineering Group in ACL.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.

Data mining and machine learning A brief introduction.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Data Clustering 1 – An introduction

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Machine Learning.

Applying Neural Networks Michael J. Watts

1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.

+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.

Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Data Mining and Decision Support

Clustering Algorithms Minimize distance But to Centers of Groups.

Data Mining and Text Mining. The Standard Data Mining process.

4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.

Oracle Advanced Analytics

Experience Report: System Log Analysis for Anomaly Detection

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Machine Learning for Computer Security

What Is Cluster Analysis?

Applying Neural Networks

Visual Learning with Navigation as an Example

A Personal Tour of Machine Learning and Its Applications

IMAGE PROCESSING RECOGNITION AND CLASSIFICATION

An Artificial Intelligence Approach to Precision Oncology

School of Computer Science & Engineering

Chapter 6 Classification and Prediction

Mixture of SVMs for Face Class Modeling

Machine Learning Basics

Assessing Hierarchical Modularity in Protein Interaction Networks

Self organizing networks

K Nearest Neighbor Classification

William Norris Professor and Head, Department of Computer Science

CSE572, CBS598: Data Mining by H. Liu

Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Classification and Prediction

CSE572, CBS572: Data Mining by H. Liu

Dimension reduction : PCA and Clustering

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Junheng, Shengming, Yunsheng 11/09/2018

©Jiawei Han and Micheline Kamber

CSE572: Data Mining by H. Liu

Data Pre-processing Lecture Notes for Chapter 2

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

What is Artificial Intelligence?

Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.

Presentation transcript:

Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.

Characteristics of Microarray Data High dimensionality of gene space, low dimensionality of sample space. Thousands to tens of thousands of genes, tens to hundreds of samples. Features (genes) correlation. Genes collaborate to function. Gene correlation characterizes how the system works. A plethora of domain knowledge. Tons of knowledge accumulated about genes in question.

Microarray Data Analysis Analysis from two angles sample as object, gene as attribute gene as object, sample/condition as attribute Here we map the samples and the genes into 2-dimensional space. As we can see, the genes has some dense area, if we remove the outliers and zoom in the dense area, we will find detailed dense area and some outliers. So the gene distribution has some hierarchical-dense structure. But the samples are very sparse in high-dimensional space. Even mapped into 2-dimensional space, there are no class structure can be detected. We can partition the sample by many hyperplane, but cannot judge which partition is better. So the the techniques that are effective for gene-based analysis are not adequate for analyzing samples. Effective and efficient sample-based analysis remains a challenging problem.

Supervised Analysis Select training samples (hold out…) Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes Class 1 Class 2 g1 g2 . g4131 g4132 1 1 … 1 0 0 … 0 1 1 … 1 0 0 … 0 g1 g2 . g4131 g4132 1 1 … 1 0 0 … 0 1 1 … 1 0 0 … 0 The existing methods of selecting informative genes to cluster samples fall into two major categories: supervised analysis and unsupervised analysis. The supervised approach assumes that additional information is attached to some (or all) data, for example, that biological samples are labeled as diseased vs. normal. The most famous supervised method is the neighborhood analysis method which is a science paper published in 1999 and it stimulate the research of sample phenotype detection. Other supervised method include: tree harvesting, support vector machines, decision tree method, genetic algorithm, the artificial neural networks, and a variety of ranking based methods. The basic steps of these supervised methods is first select a subset of samples as the training set, using the phenotypes as a reference to select a small percent of informative genes which manifest the phenotype partition within the training samples. Finally, the whole set of samples are grouped according to the selected informative genes. 0 0 … 0 1 1 … 1 0 0 … 0 1 1 … 1 0 0 … 0 1 1 … 1 0 0 … 0 1 1 … 1

Phenotype Structure Mining samples 1 2 3 4 5 6 7 8 9 10 gene1 gene6 gene7 gene2 gene4 gene5 gene3 gene1 gene6 gene7 gene2 gene4 gene5 gene3 Informative Genes Non- informative Genes An informative gene is a gene which manifests samples' phenotype distinction. Phenotype structure: sample partition + informative genes.

Existing Feature Selection and Extraction Algorithms The characteristic of microarray data set makes feature selection a critical process. Too many features, too few samples. Existing feature selection/extraction algorithms include: Single gene based discriminative scores, such as t-test score, S2N, etc. Redundancy removal based FSS algorithms. General feature selection algorithms. (Relief family, Float selection, etc.). General feature extraction algorithms: PCA, SVD, FLD etc. Haven’t witnessed specific feature extraction algorithms.