DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,

Slides:



Advertisements
Similar presentations
Cyber Metagenomics; Challenge to See The Unseen Majority in The Ocean
Advertisements

Creating a Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (a.k.a. CAMERA) Invited Talk Honoring David Kingsbury.
The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
ODM2: Developing a Community Information Model and Supporting Software to Extend Interoperability of Sensor and Sample Based Earth Observations Jeffery.
An Architecture for Creating Collaborative Semantically Capable Scientific Data Sharing Infrastructures Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra,
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
The HMP Data Analysis and Coordination Center (DACC) plays the role of collecting, integrating & standardizing different data types from diverse sources.
January, 23, 2006 Ilkay Altintas
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology Metagenomics Center for Earth Observations and Applications Advisory Committee.
About CUAHSI The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is an organization representing 120+ universities.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Research Data at NCAR 1 August, 2002 Steven Worley Scientific Computing Division Data Support Section.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
1 Enhancing Organism Based Disease Knowledge Using Biological Taxonomy, and Environmental Ontologies Ken Baclawski Northeastern University Neil Sarkar.
“Quantified Self- On Being a Personal Genomic Observatory” Keynote in the “Humans as Genomic Observatories” Meeting Session in the Genomics Standards Consortium.
1 OPeNDAP/ECHO Demo Integrating and Chaining services September, 2006 CEOS WGISS 22 Annapolis, MD.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
TWC Adoption of RDA DTR and PID in Deep Carbon Observatory Data Portal Stephan Zednik, Xiaogang Ma, John Erickson, Patrick West, Peter Fox, & DCO-Data.
Metadata in the iPlant Collaborative Cyberinfrastructure Birds of a Feather meeting at PAG XXII, Jan. 14, 2014.
SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Experts Workshop on the IPT, v. 2, Copenhagen, Denmark The Pathway to the Integrated Publishing Toolkit version 2 Tim Robertson Systems Architect Global.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Proof of concept study of the Socio-Ecological Research and Observation oNTOlogy (SERONTO) for integrating multiple ecological databases. Introduction.
Tsute (George) Chen Bioinformatics Core Department of Microbiology The Forsyth Institute March 24 th, 2015 HOMD A Tour to the Data and Tools.
TWC Adoption of RDA DTR and PID in Deep Carbon Observatory Data Portal Stephan Zednik, Xiaogang Ma, John Erickson, Patrick West, Peter Fox, & DCO-Data.
Data Integration and Management A PDB Perspective.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
IPlant Collaborative Hands-on Cyberinfrastructure Workshop - Part 1 R. Walls University of Arizona Biodiversity Information Standards (TDWG) Sep. 28, 2015,
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
Sara E. Richardson Calit2 Summer Undergraduate Research Scholarship Program Advisor: Jurgen Schulze Ivl.calit2.net/wiki CAMERA is.
FuGE: A framework for developing standards for functional genomics Andrew Jones School of Computer Science, University of Manchester Metabomeeting 2.0.
2009 IADR, MIAMI, FL, USA Hands-on Experience for using the Human Oral Microbiome Database (HOMD) 2009 IADR Workshop, Miami, FL, USA Tsute (George) Chen.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
Don’t make me think Biodiversity Data Publishing Made Easy Laurence Livermore, Vince Smith, Alice Heaton, Simon Rycroft, Ed Baker, Ben Scott & Lyubomir.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Children’s Health Exposure Analysis Resource (CHEAR) CHEAR Center for Data Science Susan Teitelbaum, PhD November 4, 2015.
LTER Science 2050: Challenges, Constraints and Opportunities Bill Michener Professor and DataONE Project Director University of New Mexico 12 September.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
es/by-sa/2.0/. Metagenomics Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Scientific Workflows for OOI Ilkay Altintas Charles.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
Integration of BioInformatics tools at NUS. GenBank Growth Chart Year Bases.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
High throughput biology data management and data intensive computing drivers George Michaels.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
An Open Data Platform in the framework of the EGI-LifeWatch Competence Centre Fernando Aguilar Jesús Marco
Tools and Services Workshop
Ilya Zaslavsky Jeffrey Grethe amarnath Gupta burak Ozyurt
Connect UNAVCO, a VIVO for a Scientific Community
SRA Submission Pipeline
Metagenomics Microbial community DNA extraction
Presentation transcript:

DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc., Ph.D. California Institute for Telecommunications & Information Technology (Calit2) University of California, San Diego Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA)

Global Scientific Research Cyber-Community

3100 users 70 countries

CAMERA 2.0 Objectives CAMERA serves as one representation of a specific research community’s need for a system to - Provide a metadata rich family of scalable databases and make them available to the community - Collect and reference increasing metadata relevant to environmental metagenome datasets - Exploit the power of querying on metadata across multiple geospatial locations - Provide a facility that allows for a diversity of software tools to be easily integrated into the system (and sufficient compute resources to support these analyses)

The Semantically Aware DB Schema Some key features of the semantically aware DB schema - Environmental parameters: Modeled more generally, to accommodate any environment and any parameter within an environment - Sequence: Separate “registries” for DNA, rRNA, mRNA, viral segments, reference genomes etc. Sequence annotations are independently searchable. - Workflow Connection: Every computed property is associated with the workflow instance that created it. - Associated Data : Data not produced in CAMERA but often used for analysis and comparison - Ontologies: All metadata, measured and observed parameters are connected to ontologies, whenever possible.

Integration of External Data Warehousing - Reference genomes - Homologs, CoG clusters - Raster data from slow/complex servers Remote Data - KEGG pathways - NASA MODIS data - World Ocean Atlas - Other data that come as “data sets” that do not conform to the schema

NASA Aqua-MODIS satellite data Metadata: beyond data collected at sampling site Sea Surface Temp Chlorophyll MODIS Images covering GOS sites #8 – 12, mid November, 2003

Integration of Enhanced Metadata

Integrate and browse additional sources of microbial data

CAMERA 2.0 (Data Submission) Growing the CAMERA Community and Resource…

Investigator submits proposal to GBMF Investigator submits metadata to CAMERA CAMERA sends acknowledgement to Investigator, Seq. Group, GBMF Seq. Group send barcoded sample “kit” to investigators Seq. Group Upload data to CAMERA (& Investigator) Data & Metadata Released in six months Metadata now collected before sequence data: GSC-compliant Project-ID serves as acceptance-proof Sample is Received and Sequenced Webb Miller and Stephan C. Schuster, and Roche / 454 Genome Sequencer GBMF Data Acquisition Pipeline: A New Data Submission Paradigm-Metadata First!

Data Standards Minimal Information for (Meta)Genomic Sequences: MIGS/MIMS A Metadata standard, developed by the Genomics Standards Consortium - Controlled vocabularies e.g. EnvO, PATO - Common language: GCDML Submissions shall comply with a MIMS/MIGS core, but any metadata can be entered via keywords and free text Different metadata submission forms for different habitats: (water, soil, air, hosts)

User Friendly Compute Environment

CAMERA 2.0 (Computation) From simple job submission to community developed and published workflows…

RAMMCAP – Rapid clustering and functional annotation for metagenomic sequences RNA finding/filtering DNA Clustering Unique sequence Taxonomy / population analysis ORF clustering ORF calling Unique sequences Protein families ORF and cluster annotation Pfam, Tigrfam, COG, etc. Features Very fast (10-100x) as compared to BLAST-based methods Effective tools: CD-HIT, HMMERHEAD, meta_RNA, and RPS-BLAST Focused functional annotation via curated protein families CD-HIT, 90-95% More in-depth analysis and further annotation Metagenomic Raw reads CD-HIT-EST, 95% DNA clusters Protein clusters Representative sequences Unique DNA sequences ORF Annotation 1. ORF_finder 2. Metagene CD-HIT, 60 or 30% COG Pfam Tigrfam HMMER HMMERHEAD RPS-BLAST Cluster Annotation 1. tRNA scan 2. rRNA scan 3. meta_RNA ORFs Non-redundant ORFs tRNAs rRNAs

Annotation workflow A green box is called an ‘actor’, which performs a task. This special actor represents an annotation component, such as BLAST search. Workflow parameters, which can be specified by users in the portal, are passed to workflow components. Data flow is divided.

Provenance of Workflow Related Data Provenance: A concept from art history and library - Inputs, outputs, intermediate results, workflow design, workflow run Collected information - Can be used in a number of ways - Validation, reproducibility, fault tolerance, etc… - Linked to the semantic database - Viewable and searchable from CAMERA 2.0