Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University.

Slides:

Advertisements

Similar presentations

BioInformatics (3).

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

An Overview of Machine Learning

Principal Component Analysis

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

What is Cluster Analysis

Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Introduction. 1.Data Mining and Knowledge Discovery 2.Data Mining Methods 3.Supervised Learning 4.Unsupervised Learning 5.Other Learning Paradigms 6.Introduction.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.

1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.

Biomedical Image Analysis and Machine Learning BMI 731 Winter 2005 Kun Huang Department of Biomedical Informatics Ohio State University.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

Clustering and Classification – Introduction to Machine Learning BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Gene expression profiling identifies molecular subtypes of gliomas

JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.

CEN 592 PATTERN RECOGNITION Spring Term CEN 592 PATTERN RECOGNITION Spring Term DEPARTMENT of INFORMATION TECHNOLOGIES Assoc. Prof.

Introduction Mohammad Beigi Department of Biomedical Engineering Isfahan University

CSE 185 Introduction to Computer Vision Pattern Recognition.

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

This week: overview on pattern recognition (related to machine learning)

Chapter 4 CONCEPTS OF LEARNING, CLASSIFICATION AND REGRESSION Cios / Pedrycz / Swiniarski / Kurgan.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

Peter Bajcsy, PhD Automated Learning Group National Center for Supercomputing Applications University of Illinois September.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.

Analyzing Expression Data: Clustering and Stats Chapter 16.

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

Data Mining and Decision Support

NTU & MSRA Ming-Feng Tsai

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Data Mining: Concepts and Techniques (3rd ed

Machine Learning Dimensionality Reduction

Data Mining 資料探勘分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育

Dimension reduction : PCA and Clustering

Presentation transcript:

Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University

Machine Learning Statistical learning Artificial intelligence Pattern recognition Data mining

Machine Learning Supervised Unsupervised Semi-supervised Regression

Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining

- Clustering or classification? - Is training data available? - What domain specific knowledge can be applied? - What preprocessing of data is needed? - Log / data scale and numerical stability - Filtering / denoising - Nonlinear kernel - Feature selection (do I need to use all the data?) - Is the dimensionality of the data too high?

-Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)

How do we process microarray data (clustering)? - Feature selection – genes, transformations of expression levels. - Genes discovered in the class comparison (t-test). Risk: missing genes. - Iterative approach : select genes under different p- value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).

- Dimensionality Reduction - Principal component analysis (PCA) - Singular value decomposition (SVD) - Karhunen-Loeve transform (KLT) Basis for P SVD

- Principal Component Analysis (PCA) - Other things to consider - Numerical balance/data normalization - Noisy direction - Continuous vs. discrete data - Principal components are orthogonal to each other, however, biological data are not - Principal components are linear combinations of original data - Prior knowledge is important - PCA is not clustering!

Visualization of Microarray Data Multidimensional scaling (MDS) High-dimensional coordinates unknown Distances between the points are known The distance may not be Euclidean, but the embedding maintains the distance in a Euclidean space Try different dimensions (from one to ???) At each dimension, perform optimal embedding to minimize embedding error Plot embedding error (residue) vs. dimension Pick the knee point

Visualization of Microarray Data Multidimensional scaling (MDS)

Distance Measure (Metric?) -What do you mean by “similar”? -Euclidean -Uncentered correlation -Pearson correlation

Distance Metric -Euclidean _atLip _atAp1s d E (Lip1, Ap1s1) = 12883

Distance Metric -Pearson Correlation _atLip _atAp1s d P (Lip1, Ap1s1) = 0.904

Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.

Distance Metric -Uncentered Correlation _atLip _atAp1s d u (Lip1, Ap1s1) =  About 33.4 o

Distance Metric -Difference between Pearson correlation and uncentered correlation _atLip _atAp1s Pearson correlation Baseline expression possible Uncentered correlation All are considered signals

Distance Metric -Difference between Euclidean and correlation

Distance Metric -PCC means similarity, how can we transform it to distance? -1-PCC -Negative correlation may also mean “close” in signal pathway (1-|PCC|, 1-PCC^2)

Supervised Learning Perceptron – neural networks

Supervised Learning Perceptron – neural networks

-Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model

-Supervised Learning - Naïve Bayesian classifier -Bayes rule -Maximum a posterior (MAP) Prior prob. Conditional prob.

- Dimensionality reduction: linear discriminant analysis (LDA) B A w. (From S. Wu’s website)

Linear Discriminant Analysis B A w. (From S. Wu’s website)

-Supervised Learning - Support vector machines (SVM) and Kernels -Kernel – nonlinear mapping

How do we use microarray? Profiling Clustering Cluster to detect patient subgroups Cluster to detect gene clusters and regulatory networks

How do we process microarray data (clustering)? - Unsupervised Learning – Hierarchical Clustering

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).

How do we process microarray data (clustering)? -Unsupervised Learning – Hierarchical Clustering Single linkage – Prone to chaining and sensitive to noise Complete linkage – Tends to produce compact clusters Average linkage – Sensitive to distance metric

-Unsupervised Learning – Hierarchical Clustering

Dendrograms Distance – the height each horizontal line represents the distance between the two groups it merges. Order – Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.

-Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization

-Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric

-Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization

-Unsupervised Learning - K-means

-Unsupervised Learning -Self-organized maps (SOM) -Neural network based method -Originally used as a visualization method for visualize (embedding) high-dimensional data -Also related vector quantization -The idea is to map close data points to the same discrete level

-Issues -Lack of consistency or representative features (5.3 TP PTEN doesn’t make sense) -Data structure is missing -Not robust to outliers and noise D’Haeseleer 2005 Nat. Biotechnol 23(12):

-Model-based clustering methods (Han) Pan et al. Genome Biology :research doi: /gb research0009

-Structure-based clustering methods

– Data Mining is searching for knowledge in data –Knowledge mining from databases –Knowledge extraction –Data/pattern analysis –Data dredging –Knowledge Discovery in Databases (KDD)

−The process of discovery Interactive + Iterative  Scalable approaches

Popular Data Mining Techniques – Clustering: Most dominant technique in use for gene expression analysis in particular and bioinformatics in general. –Partition data into groups of similarity – Classification: –Supervised version of clustering  technique to model class membership  can subsequently classify unseen data. – Frequent Pattern Analysis – A method for identifying frequently re-curring patterns (structural and transactional). – Temporal/Sequence Analysis –Model temporal data  wavelets, FFT etc. – Statistical Methods –Regression, Discriminant analysis

Summary −A good clustering method will produce high quality clusters with −high intra-class similarity −low inter-class similarity −The quality of a clustering result depends on both the similarity measure used by the method and its implementation. −Other metrics include: density, information entropy, statistical variance, radius/diameter −The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Recommended Literature 1. Bioinformatics – The Machine Learning Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press, Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J. Friedman, Springer-Verlag, 2001