A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.
Metabarcoding 16S RNA targeted sequencing
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Heuristic alignment algorithms and cost matrices
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Metagenomics Binning and Machine Learning
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Presented by Tienwei Tsai July, 2005
Development and Evaluation of a Comprehensive Functional Gene array for Environmental Studies Zhili He 1,2, C. W. Schadt 2, T. Gentry 2, J. Liebich 3,
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Isaam Saeed & Saman K Halgamuge MERIT,
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Identification of Cancer-Specific Motifs in
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Cluster validation Integration ICES Bioinformatics.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Metagenomic Species Diversity.
Research in Computational Molecular Biology , Vol (2008)
Metagenomics Image: Iverson et al. 2012, Science.
Alternative Splicing QTLs in European and African Populations
Taxonomic identification and phylogenetic profiling
ACGT group meeting 23/11/11 by Yaron Orenstein
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Universal microbial diagnostics using random DNA probes
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011

Metagenomics Environmental Genomics Metagenomes: genetic material recovered directly from environmental samples

Metagenomics (Data Analysis) Major difficulty of metagenomics lies in the fact that most bacteria (up to 99%) found in environmental samples are unknown and cannot be cultivated and isolated under laboratory conditions. One possible solution is to directly sequence DNA fragments of multiple species obtained from the mixed environmental DNA sample. Identification and taxonomic characterization of DNA fragments resulting from sequencing a sample of mixed species  Binning Group DNA fragments from similar species.

History Similarity-based methods Align each DNA fragment to known reference genomes Limited to the availability of known microorganism genomes <1%

History Composition-based methods Group DNA fragments using genetic features such as genome structure or composition Low availability and reliability of taxonomic markers Some species may share multiple marker with other species

What’s the best? The more promising method is to use unsupervised binning algorithms based on the occurrence frequencies of l-mers. It is based on the observation that the l-mer distributions of the fragments in the same genome are more similar than those l-mer distributions of two unrelated species.

l-mer Based Methods Tetra (Teeling et al., 2004) MetaCluster (Yang et al., 2009) MetaCluster 2.0 (Yang et al., 2010) AbundanceBin (Wu and Ye, 2010) MetaCluster 3.0 (Yang et al., 2011)

MetaCluster 3.0 Based on l-mer Two step: Top-Down Bottom-Up

MetaCluster 3.0

l-mer frequency The DNA composition features of each DNA fragment are represented by the l-mer frequencies of the DNA fragment  kinds of l-mers DNA feature vector: [f 1, f 2, …, f n(l) ] f w : frequency of each l-mer n(l): number of different l-mers

l-mer frequency l=4 best for DNA fragment size 1000 to 10,000 (Chor et al & Zhou et al) Observation that the l-mer distributions of those DNA substrings (fragments) from the same genome are similar.  compare the l-mer distribution of reads  Spearman distance distribution

Spearman distance distribution The difference of two l-mer distribution from two fragments A: (a 1, a 2, …, a n(l) ) B: (b 1, b 2, …, b n(l) ) be the rank of a i in the sorted list of a i ’s and be the rank of b i in the sorted list of b i ’s. The smaller the value of the metric, the more similar the vectors are For vectors with size k, the distance value can range between 0 and k(k+1)

Spearman distance distribution Benefits: Compared with other distance metrics that rely on the exact value of each entry in the feature vectors less sensitive to those entries with unexpectedly large values. more global view of the distance of two feature vectors with respected to all the entries.

Spearman distance distribution Observation using empirical study for 1000 genome  The Spearman distance distribution of the differences between two l-mer distributions of fragments from the same species and those from species of different families can be approximated by a normal distribution  There is significant difference between these two distribution

Spearman distance distribution

Top-down clustering K-median algorithm Cluster fragments into k’ Repeatedly assign feature vector to closest cluster Select a feature vector as the center with the following objective function:

Top-down clustering K-median algorithm  greedy algorithm It repeated several dozen times with different initial clustering center  Select the ones with minimum objective function

Top-down clustering (k’ determination ) Distance between each fragment and the center from the same species: Distance between each fragment and the center from the different species:

Top-down clustering (k’ determination ) The expected number of false positive: Since the expected number of false positives decreases with the value of k’, MetaCluster 3.0 increases the value of k’ until the expected number of false positives in a cluster ≤tn/ k’. Set t=5% such that the expected accuracy is over 95% for the first phase  k’ can be much larger than the number of species Bottom-up Merging

Bottom-up merging Goal: merge clusters from same species into one cluster It is based on intercluster similarity like intercluster distance Average of all distances between pairs of DNA fragments A in C 1 and B in C 2

Bottom-up merging Known k: Merge the pair of cluster with the minimum intercluster distance greedily until k cluster

Bottom-up merging Unknown k: It is based on the observation that the spearman distance between two fragments from the same species is smaller than from different species Average intracluster distance of C 1 ad C 2 are d 1 and d 2 Merge two clusters if and only if the intercluster distance dist(C 1, C 2 ) is similar to d 1 and d 2 α dist(C 1, C 2 )≤average(d 1, d 2 ) for some threshold α

Bottom-up merging Calculating threshold

Results Source: Randomly selected 240 genomes of bacteria, complete reference from NCBI to generate 1080 test datasets. Comparison between metacluster 2.0 and metacluster 3.0 and AbundanceBin Different species with vary abundance ratio from 1:1 to 1: 24

Result Overall performance of all datasets

Result i) Family: DNA fragments from the same order but different families ii) Order: DNA fragments from the same class but different orders iii) Class: DNA fragments from different classes

Result Performance for Class, Order and Family datasets

A novel method Based on the Shannon entropy Shannon entropy: The information content of a signal. Entropy is a measure of “disorder” in a signal. Entropy shows the “diversity”.

Algorithm For each read: Calculate the Shannon entropy for k-mers for k starting from 1. Choose the largest Shannon entropy Sort the reads based on the Shannon entropy Clustering based on the entropy.

Example A DNA read: CACGACACGCCATTGACTAGCAGTGTCTGATGCAGAAACC The entropy is calculated for l-mers: S 1, S 2, S 3, S 4, … The reads that correspond to the same species must have similar entropy distribution

Implementation I have implemented the algorithm. The developed code calculates the entropy for different l-mers automatically. The Entropy is defined as,

Results I used some data from the dataset contains 2000 DNA fragments from "Acinetobacter_baumannii_SDF“. The length of each DNA fragment is 1000bp.

Result The Shannon entropy for different k for 2 reads from the same species: