Research in Computational Molecular Biology , Vol (2008)

Slides:

Advertisements

Similar presentations

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Advertisements

Cluster Analysis: Basic Concepts and Algorithms

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Introduction to Bioinformatics

1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Facial Recognition CSE 391 Kris Lord.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.

Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority Computational Metagenomics: Algorithms for Understanding.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

Canadian Bioinformatics Workshops

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9):

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Graph-based WSD の続き DMLA /7/10 小町守.

Flow cytometry data analysis: SPADE for cell population identification and sample clustering Narahara.

Naifan Zhuang, Jun Ye, Kien A. Hua

Big data classification using neural network

Principal Component Analysis (PCA)

Cohesive Subgraph Computation over Large Graphs

Metagenomic Species Diversity.

Compressive Coded Aperture Video Reconstruction

Preprocessing Data Rob Schmieder.

Quality Control & Preprocessing of Metagenomic Data

Metagenomic assembly Cedric Notredame

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.

کاربرد نگاشت با حفظ تنکی در شناسایی چهره

Machine Learning Basics

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

K Nearest Neighbor Classification

Metagenomics Image: Iverson et al. 2012, Science.

Historic Document Image De-Noising using Principal Component Analysis (PCA) and Local Pixel Grouping (LPG) Han-Yang Tang1, Azah Kamilah Muda1, Yun-Huoy.

Learning with information of features

X.1 Principal component analysis

SEG5010 Presentation Zhou Lanjun.

Multidimensional Scaling

Chapter 19 Molecular Phylogenetics

CS 394C: Computational Biology Algorithms

Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.

Algorithms for Inferring the Tree of Life

Data Pre-processing Lecture Notes for Chapter 2

Toward Accurate and Quantitative Comparative Metagenomics

Presentation transcript:

Research in Computational Molecular Biology , Vol. 4955 (2008) CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads Sourav Chatterji et al. Research in Computational Molecular Biology , Vol. 4955 (2008)

Outline Background CompostBin Principle Component Analysis CompostBin Algorithm Feature Extraction by PCA Bisection by Normalized Cuts Computation of Similarity Measure Semi-supervision Using Phylogenetic Markers Normalized Cut and its approximation Generalization to Multiple Bins Test Data Sets Discussion

Background A metagenomic project generally starts by using Whole Genome Shotgun sequencing followed by assembling sequence reads, gene prediction, functional annotation and metabolic pathway construction A necessary step in metagenomics is called binning. The binning process sorts contigs and scaffolds of multiple species into phylogenetically related groups (bins)

Background The taxonomic resolution achievable by this grouping depends on both the binning method and the complexity of the community A variety of approaches have been developed for binning: Assembly Phylogenetic analysis Database search Alignment with reference genome DNA composition metrics

Background Most current binning methods suffer from two major limitations They perform poorly on short sequences To overcome, almost all current binning methods are applied to assembled contigs They require closely related reference genomes for training/alignment To overcome, CompostBin

CompostBin Can bin raw sequence reads into taxon-specific bins with high accuracy Does not require training on currently available genomes Like other composition-based methods, it seeks to distinguish different genomes based on their characteristic DNA compositional patterns, termed “signature” Kmers

CompostBin Poor performance of many composition-based binning algorithms in shorter fragments is caused by the noise associated with the high dimensionality of the feature space 4K possible oligonucleotides (dimensions) of length K in DNA sequences For instance, if one looks at the frequency of hexamers in 2kb fragments, the dimensionality of the feature space is twice the length of the sequenced fragments -> sparse vector space

CompostBin CompostBin uses Principal Component Analysis (PCA) to deal with the high dimensionality of the feature vector. Instead of treating all components of the noisy feature space equally, we extract the most “important” directions and use these components for distinguishing between taxa Binning is further guided by information from phylogenetic markers

CompostBin This study demonstrates that shrewdly analyzed Kmer frequency data from short sequences can also provide a signature

Principle Component Analysis A mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components PCA supplies the user with a lower-dimensional picture, a "shadow" of the original dataset

CompostBin Algorithm Inputs Raw sequence reads Mate pair information Taxonomic assignment of reads containing phylogenetic markers Number of bins in the output Mate pairs are joined together and treated as a single sequence because they are highly likely to have originated from the same organism

CompostBin Algorithm - Feature Extraction by PCA Nx4096 feature matrix A: Each sequence being analyzed is initially represented as a 4096-dimensional feature vector PCA is then used to decrease the noise inherent in this high-dimensional data set by identifying the principal components of the feature matrix A Use the first three principal components

CompostBin Algorithm - Feature Extraction by PCA Determining the number of principal components required for analysis is crucial to the success of the algorithm Too few components and some important information may be lost Too many components increases the noise in the data unnecessarily Authors claim that when using PCA to bin sequences, use of just the first three principal components is adequate to separate sequences from different species

CompostBin Algorithm - Bisection by Normalized Cuts

CompostBin Algorithm - Computation of Similarity Measure 6-nearest neighbor graph G(V,E,W) The vertices in V correspond to the sequences An edge (v1,v2)ЄE between two sequences v1 and v2 exists only if one of the sequences is a 6-nearest neighbor of the other in Euclidean space W is the set of weights of edges

CompostBin Algorithm - Semi-supervision Using Phylogenetic Markers Taxonomic information from 31 phylogenetic markers is used as a label for each input sequence, with each label corresponding to a single taxonomic group Two vertices v1 and v2 are connected with w(v1,v2)=1 if the corresponding sequences are from the same taxonomic group, and the edge is removed, e.g. w(v1,v2)=0, if they are from different groups

CompostBin Algorithm - Normalized Cut and its approximation Association between two subsets X and Y of V The normalized cut algorithm bisects V into two disjoint subsets U and Ū such that the association within each cluster is large while the association between clusters is small

CompostBin Algorithm - Normalized Cut and its approximation The exact solution to this problem is NP-hard. An approximate solution is used

CompostBin Algorithm - Generalization to Multiple Bins To generalize the algorithm for more than two bins Uses PCA and the normalized cut algorithm iteratively

Test Data Sets ReadSim is used to simulate Sanger sequences from known genomes Took into account several variables that affect the difficulty of binning the number of species in the sample their relative abundance their phylogenetic diversity the differences in GC content between genomes One well accepted metagenomic data set that contains sequence reads obtained from gut bacteriocytes of the glassy-winged sharpshooter, Homalodisca coagulata

Discussion The frequencies of hexamers (oligonucleotides of length 6) are used as the metric for analysis Since the length of the feature vector for analyzing Kmers is O(4K), both the memory and the CPU requirements of the algorithm become infeasible for large data sets when K is greater than 6 Using hexamers is biologically advantageous in that, being the length of two codons, their frequencies can capture biases in codon usage

Discussion CompostBin is based primarily on DNA composition metrics and, like all such methods, it cannot distinguish between organisms unless their DNA compositions are sufficiently divergent Parameters Number of bins Number of nearest neighbors Number of principle components