Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Metabarcoding 16S RNA targeted sequencing
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
Bioinformatics and Phylogenetic Analysis
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
Metagenomics Binning and Machine Learning
Metagenomic Analysis Using MEGAN4
Development of Bioinformatics and its application on Biotechnology
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
Molecular Microbial Ecology
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Christian Rinke Microbial Genomics DOE, Joint Genome Institute Introduction to ARB (From A User's Perspective)
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
The iPlant Collaborative
Construction of Substitution Matrices
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Metagenomic dataset preprocessing – data reduction
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
CS 6293 AT: Current Bioinformatics HW2 Papers 1
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Convenience Sample of 4 Adults and 6 Infants. Adults 4 visits over 2 weeks; infants 2 visits over 2 weeks Adult specimens: 1) plaque (by method, teeth,
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Robert Edgar Independent scientist
What is BLAST? Basic BLAST search What is BLAST?
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
MGmapper A tool to map MetaGenomics data
Basics of BLAST Basic BLAST Search - What is BLAST?
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Research in Computational Molecular Biology , Vol (2008)
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Workshop on the analysis of microbial sequence data using ARB
Overview Bioinformatics: Analyzing biological data using statistics, math modeling, and computer science BLAST = Basic Local Alignment Search Tool Input.
H = -Σpi log2 pi.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Bioinformatics for plant biosecurity and surveillance systems
Maximize read usage through mapping strategies
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig 10 September 2012 UNCLASSIFIED//APPROVED FOR PUBLIC RELEASE

Problem Statement Environmental sampling is computationally complex for several reasons: Reference database contains limited coverage of plant, fungi, and other material Organism representation changes over time and across distance in ways that we have not adequately modeled or captured To respond to these limitations, time course studies can be used to identify the genomic component of interest

Objective Invent novel analytical method to identify changes observed among samples. Subgoals: Changes observed need some measure of confidence. Changes can be clustered in confidence “windows”. Generate time series representation of multiple datasets. Compress data in order to efficiently handle huge datasets.

Environmental Sample Described Metagenomic datasets (FASTQ format) from environmental samples taken at 3 timepoints. Environmental sample taken once a day for 3 days. Each day, biological material and particulates captured in a buffer that is not DNA-free. Particulates (inorganic or plant material) removed by centrifugation), DNA extracted, and material sequenced on Illumina HiSeq 2000

Issues to Consider BLAST is computationally expensive and time consuming. For many time-course samples, the system will be bogged down in this analysis. –What can be done to simplify the dataset to reduce the computational burden of BLAST/megaBlast Bacteria, virus, and fungi which do not cause human disease are poorly represented in reference databases. Therefore, much of the genomic data will appear as ‘unknown’ –How can we cluster or categorize genomic data without a good reference Confidence assessment is a difficult problem.

Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Analytically equivalent (almost), but computationally- improved approaches –Reference Mapping tools, such as Bowtie (sourceforge)

Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Heuristics –16S/23S rDNA analysis. Because these genomic sequences are highly conserved, organisms that have not been sequenced can still be included in a relative quantitation of bacterial species in a sample. –Pro: 16S database is smaller than NCBI RefSeq. Also, this approach could be joined with other heuristics to selectively evaluate a subset of reads within a large datasets. –Con: In the best of cases, the granularity of analysis is low. Additional analysis would be required based on the output, and this would likely rely on BLAST.

Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Heuristics –CLoVR, from Institute for Genome Sciences, University of MD: Completes 16S classification and alignment to capture diversity. Developed for 16S ribosomal RNA amplicon sequencing –The CloVR-16S pipeline employs several well-known phylogenetic tools and protocols: 1.QIIME – a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST, PyNAST and the RDP Bayesian classifier; 2.UCHIME – a tool for rapid identification of chimeric 16S sequence fragments; 3.Mothur – a C++-based software package for 16S analysis; 4.Metastats and custom R scripts used to generate additional statistical and graphical evaluations.

Tools to provide preliminary analysis: Improved speed with lower accuracy

Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Pattern-Matching programs: –K-mer analysis: K-mer distributions have been observed to be well-preserved among related strain/species. These could be clustered into groups, allowing for directed post-k-mer analysis. –Amino acid K-mers can be used to identify homologous genes –Microbes are present everywhere, but the reference material available in NCBI is not a uniform representation of existing flora and fauna.

Objective Invent novel analytical method to identify changes observed among samples. Subgoals: Changes observed need some measure of confidence. Changes can be clustered in confidence “windows”. Generate time series representation of multiple datasets. Compress data in order to efficiently handle huge datasets.