Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

Slides:



Advertisements
Similar presentations
RNAseq.
Advertisements

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Protein and RNA Families
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Metagenomic dataset preprocessing – data reduction
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Discovery tools for human genetic variations
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih Yu

Outline Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) – Goal – Methodology – Metagenome comparison – Conclusion Discussion

Goal Reduce computation time – Global Ocean Survey(GOS): 1 M CPU Hours = 144 yrs Discover the novel gene or protein families – Metagenomic Profiling of Nice Biomes(BIOME) : ~90% sequences unknown – GOS: double the protein families Compare metagenome data – Clustering-based – Protein family-based

RAMMCAP

RNA

RAMMCAP

Meta_RNA & tRNA‐scan High sensitivity, Low specificity(Except 16S) “Identification of ribosomal RNA genes in metagenomic fragments.“, Huang, Y., Gilna, P. & Li, W. Z. Bioinformatics “tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.“, Lowe, T.M. and Eddy, S.R. Nucleic Acids Res

CLUSTERING CD-HIT

RAMMCAP

CD-HIT Greedy incremental clustering algorithm Whole pairwise alignment avoid Short word (2~5) Index table "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, et al. Bioinformatics, (2001) "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, et al. Bioinformatics, (2002) "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li, et al. Bioinformatics, (2006).

Limitation of CD-HIT Evenly distributed mismatches Greedy issue – Group in first meet cluster

CD-HIT Performance

ORF S CLUSTERING

RAMMCAP

Why Cluster ORFs Function studies Novel genes finding

ORF Prediction ORF_finder Metagene

ORF Prediction Performance MetaSim – Average 100, 200, 400, 800 bp, 1 million reads True ORF (sensitivity) – Overlap 30 AA with NCBI annotated ORF Predicted ORF (specificity) – 50% overlap with true ORF

ORF Clustering Run 1 clustering – 90~95% identity Run 2 clustering – 60% identity over 80% of length (454) – 30% identity over 80% of length (Sanger) Merge run 1 & 2 result

Clustering Evaluation Test sets – GOS-ORF (30%),BIOME (95%),BIOME-ORF (60%)

BIOME Microbiomes & Viromes Microbial sequences are more conserved than viral sequences.

Clustering Quality Need conservative threshold Use only >30 AA Pfam sequence Discard short sequence in overlapping Pfam sequence Place into different cluster – Sequence in the same Pfam, place into different cluster.

Clustering Validation Generate a clusters whose sequences from the same Pfam Minimize the number of clusters Good clusters : >95% members from the same Pfam – >97% sequences are in good clusters – ~30 times more than bad clusters Number of sequences Number of clusters Cluster Size

RAMMCAP

Protein Family Annotation Pfam (24.0, Oct. 2009, families) – textual descriptions, other resources and literature references TIGRFAMs (9.0, Nov. 2009, 3808 models) – GO, Pfam and InterPro models COG(2003, 4873 clusters of orthologous groups) – 3 lineages and ancient conserved domain – RPS‐BLAST(Reverse psi-blast) E values ≤ 0.001

Novel Protein Families Discovery Spurious ORFs in a large size of cluster without homology match may contain novel protein families. In GOS only 1.3% of clusters with cluster size ≧ 10 map to 93% of true ORFs In BIOME only 1.0% of clusters with cluster size ≧ 5 map to 28% of true ORFs

METAGENOME COMPARISON

Statistical Comparison of Metagenomics Occurrence profile coefficient z score, why? (not Rodriguez-Brito's require 10 5 simulated samples) Low occurrence cut off H A =4 (0.95) z=1.96 H A =7 (0.99) z= z> cut off 2.P A ≧ f x P B

Comparison between Rodriguez-Brito's method and z test method.

Clustering-based Comparison GOS ORF clusters r AB No. of cluster

Clustering-based Comparison BIOME samples are more diverse than GOS BIOME clusters

Protein Family-based Comparison Merge Pfam, Tigrfam and COG into super families – Pfam- clans, Tigrfam- role categories, and COG- functional classes Compare with a specific super family

Protein Family-based Comparison (a) GOS on COG Class F, (b) GOS on COG Class T, (c) BIOME on COG Class F, (d) BIOME on COG Class T

Conclusion RAMMCAP improve performance – CD-HIT – z test Novel protein families discovery – ORFs clustering Metagenome comparison – Cluster-based – Protein family-based

Discussion How much improvement when apply RNA prediction before raw reads? How to determine significant factor? – P A ≧ f * P B (f>1)