Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Similar presentations


Presentation on theme: "Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig."— Presentation transcript:

1 Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig 10 September 2012 UNCLASSIFIED//APPROVED FOR PUBLIC RELEASE

2 Problem Statement Environmental sampling is computationally complex for several reasons: Reference database contains limited coverage of plant, fungi, and other material Organism representation changes over time and across distance in ways that we have not adequately modeled or captured To respond to these limitations, time course studies can be used to identify the genomic component of interest

3 Objective Invent novel analytical method to identify changes observed among samples. Subgoals: Changes observed need some measure of confidence. Changes can be clustered in confidence “windows”. Generate time series representation of multiple datasets. Compress data in order to efficiently handle huge datasets.

4 Environmental Sample Described Metagenomic datasets (FASTQ format) from environmental samples taken at 3 timepoints. Environmental sample taken once a day for 3 days. Each day, biological material and particulates captured in a buffer that is not DNA-free. Particulates (inorganic or plant material) removed by centrifugation), DNA extracted, and material sequenced on Illumina HiSeq 2000

5 Issues to Consider BLAST is computationally expensive and time consuming. For many time-course samples, the system will be bogged down in this analysis. –What can be done to simplify the dataset to reduce the computational burden of BLAST/megaBlast Bacteria, virus, and fungi which do not cause human disease are poorly represented in reference databases. Therefore, much of the genomic data will appear as ‘unknown’ –How can we cluster or categorize genomic data without a good reference Confidence assessment is a difficult problem.

6 Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Analytically equivalent (almost), but computationally- improved approaches –Reference Mapping tools, such as Bowtie (sourceforge)

7 Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Heuristics –16S/23S rDNA analysis. Because these genomic sequences are highly conserved, organisms that have not been sequenced can still be included in a relative quantitation of bacterial species in a sample. –Pro: 16S database is smaller than NCBI RefSeq. Also, this approach could be joined with other heuristics to selectively evaluate a subset of reads within a large datasets. –Con: In the best of cases, the granularity of analysis is low. Additional analysis would be required based on the output, and this would likely rely on BLAST.

8 Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Heuristics –CLoVR, from Institute for Genome Sciences, University of MD: Completes 16S classification and alignment to capture diversity. Developed for 16S ribosomal RNA amplicon sequencing –The CloVR-16S pipeline employs several well-known phylogenetic tools and protocols: 1.QIIME – a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST, PyNAST and the RDP Bayesian classifier; 2.UCHIME – a tool for rapid identification of chimeric 16S sequence fragments; 3.Mothur – a C++-based software package for 16S analysis; 4.Metastats and custom R scripts used to generate additional statistical and graphical evaluations.

9 Tools to provide preliminary analysis: Improved speed with lower accuracy http://clovr.org/wp-content/uploads/2010/07/clovrfig11.png

10 Tools to provide preliminary analysis: Improved speed with lower accuracy Tools to analyze genomic data that do not rely on BLAST: Pattern-Matching programs: –K-mer analysis: K-mer distributions have been observed to be well-preserved among related strain/species. These could be clustered into groups, allowing for directed post-k-mer analysis. –Amino acid K-mers can be used to identify homologous genes –Microbes are present everywhere, but the reference material available in NCBI is not a uniform representation of existing flora and fauna.

11 Objective Invent novel analytical method to identify changes observed among samples. Subgoals: Changes observed need some measure of confidence. Changes can be clustered in confidence “windows”. Generate time series representation of multiple datasets. Compress data in order to efficiently handle huge datasets.


Download ppt "Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig."

Similar presentations


Ads by Google