VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Chapter 3 Application Software p. 6.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The Insecticide Resistance Section Of The PopBio Database Emmanuel Dialynas IMBB/FORTH, Greece 18 July 2013.
Vector Epidemiology Data Gloria I. Giraldo-Calderon March 31, 2015.
Management Information Systems, Sixth Edition
Anopheles gambiae PopGenBase Groundwork for a vector population genetics database UC Davis - UCLA.
15 Chapter 15 Web Database Development Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
Everything but the Kitchen Sink: Building a metadata repository for time series data at the Federal Reserve Board San Cannon and Meredith Krug Federal.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide 1- 1.
Chapter 1 Introduction to Databases
Live Meeting APIs Robert Devine Program Manager Microsoft Corporation.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Introduction to Databases Transparencies 1. ©Pearson Education 2009 Objectives Common uses of database systems. Meaning of the term database. Meaning.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 24 How Websites Work with Databases How Websites Work with Databases.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Management Information Systems By Effy Oz & Andy Jones
Event-Based Model for Reconciling Digital Entries Thesis Proposal Ahmet Fatih Mustacoglu 10/3/20151Ahmet.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
9. Introduction to ArcObjects Most GIS analysis carried out within a GIS consists of a labor- intensive sequence of steps. Automating a GIS makes it possible.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
 Chapter 6 Architecture 1. What is Architecture?  Overall Structure of system  First Stage in Design process 2.
Business Software What is database software? p. 145 Allows you to create, access, and manage data Add, change, delete, sort, and retrieve data Next.
The new VectorBase: our improved resource for invertebrate vectors Scott Emrich On behalf of VectorBase “bigger, better, faster” Or “ "consolidate, improve.
Open Data Protocol * Han Wang 11/30/2012 *
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
1 Application Software What is application software?  Programs that perform specific tasks for users.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Pantelis Topalis and Emmanuel Dialynas.  Ontology content  Data annotation with ontologies  Tools to handle and visualize ontologies OWL – OBO parsers.
VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Department of computer science and engineering Two Layer Mapping from Database to RDF Martin Švihla Research Group Webing Department.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Lecture Set 14 B new Introduction to Databases - Database Processing: The Connected Model (Using DataReaders)
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
The Semantic Logger: Supporting Service Building from Personal Context Mischa M Tuffield et al. Intelligence, Agents, Multimedia Group University of Southampton.
Map-based Exploration of Population Biology Data in VectorBase What is VectorBase? We are a consortium of institutions that hosts the genomes of invertebrate.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Database and Information Management Chapter 9 – Computers: Understanding Technology, 3 rd edition.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
External Data Access Adam Rauch, 6/05/08 Team: Geoff Snyder, Kevin Beverly, Cory Nathe, Matthew Bellew, Mark Igra, George Snelling.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
External Data Access 5/29/08. Current Problems No way to load, process & analyze live Atlas data via critical analysis & programming tools (SAS, R, Perl)
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Introduction to the GEOSS Registries: Components, Services, and Standards Doug Nebert U.S. Federal Geographic Data Committee June 2007.
API (Application Program Interface)
Application Software Chapter 6.
Database Systems: Design, Implementation, and Management Tenth Edition
Introduction to Databases Transparencies
Comext Architecture and data flows
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Database Design Hacettepe University
Web Application Development Using PHP
Presentation transcript:

VectorBase PopBio Introduction NIH/NIAID VectorBase site visit March 2015

What is PopBio? Flexible database for sample and assay metadata for field- or lab-derived population biology data. ●collection event & location (GeoData) ●basic sample information ●assays o species identification o phenotypes (host species [e.g. from blood meal], insecticide resistance,...) o genotypes o manipulations (sampleA+sampleB->sampleC)

What is it for? Allows integration of individual studies (e.g. insecticide resistance studies conducted in individual countries). Enables meta-analysis of community data.

Data sources Legacy: IRbase UC Davis/UCLA (but updates planned) Recent: Bulk imports (e.g. Malaria Atlas Project surveillance data) Publications (typically with extra data direct from authors) MalariaGen & 16 Anopheles Other unpublished/in progress

Future data sources ICEMRs National/international IR surveillance MalariaGen Partners (Vestergaard, Oxford University MAP) Smaller published and unpublished datasets

Data model GMOD Chado schema Heavy reliance on CVs/ontologies → flexibility → computability Vastly oversimplified explanation of schema: Projects have samples have assays have results

Ontologies VectorBase ontologies: insecticide resistance, malaria, dengue & anatomy Third party ontologies: sample properties, genomic variation types, placenames, phenotypic qualities

Curation and data import ISA-Tab spreadsheet format Investigation - Study - Assay Widely used for 'omics metadata Ontology-based annotation is well supported Ontology term suggestion tools available in Google Spreadsheets Challenges ●consistent representation of data and choice of ontology terms by curator(s) through time ●too complex for casual submitters ISA-Tab's Study and its associated list of samples maps to PopBio's project and samples, while Assay maps to… assay! High level "object relational mapper" Perl API handles storage into and retrieval from Chado database for consistency and maintainability. Example: a sample may have several species identification assays. Our API provides a method for the sample object which returns the best single species term to summarise those results.

Updating existing data 1.Edit ISA-Tab, delete project and reload project from new ISA-Tab (stable IDs for project, samples and assays are retained) 2.Edit ISA-Tab but apply simple SQL updates or an API script to modify the database (as delete+reload can be slow) No database → ISA-Tab route at present.

Scalability (storage + maintenance) Current size: 121 projects, 57, 637 samples, 172, 636 assays (of which 4, 387 are IR) API overhead ⇒ some tasks take overnight ●loading for sample datasets ●search index generation No issues yet with maintenance (e.g. backup and transfer of databases

Scalability (web-based retrieval) "Dumb" API-based retrieval for "smart" web client (see next slide) is too slow on its own. Currently using pre-filled RAM-based cache to speed up API requests for web-users. Not necessarily scalable. Still not very fast! See future plans...

{"sample_manipulations":[], "name":"G ", "species_identification_assays":[{"result_summary":" Anopheles arabiensis (PCR-based species identification)", "name":"G species", "description":null, "props":[{"cvterms":[{"name":"species assay result", "accession":"VBcv: "}, {"name":"Anopheles arabiensis", "accession":"VBsp: "}]}], "protocols":[{"props":[], "name":"VBA :PROTO2", "type":{"name":"PCR-based species identification", "accession":"MIRO: "}, "description":"Mosquito DNA was extracted from the carcass and identified to species and molecular form using rDNA-based PCR assays.", "uri":""}], "performers":[], "id":"VBA ", "type":"species identification assay"}], "species":{"name":"Anopheles arabiensis", "accession":"VBsp: "}, "description":null, "genotype_assays":[{"result_summary":"inversion: 2La/a; inversion: 2Rjb/b (cytological chromosome examination)", "genome_browser_path":null, "name":"G karyotyping", "description":null, "genotypes":[{"uniquename":"VBA :2La/a", "props":[{"value":"2La/a", "cvterms":[{"name":"inversion", "accession":"SO: "}]}, {"value":"2L", "cvterms":[{"name":"chromosome_arm", "accession":"SO: "}]}], "name":"2La/a", "type":{"name":"paracentric_inversion", "accession":"SO: "}, "description":"inversion: 2La/a"}, {"uniquename":"VBA :2Rjb/b", "props":[{"value":"2Rjb/b", "cvterms":[{"name":"inversion", "accession":"SO: "}]}, {"value":"2R", "cvterms":[{"name":"chromosome_arm", "accession":"SO: "}]}], "name":"2Rjb/b", "type":{"name":"paracentric_inversion", "accession":"SO: "}, "description":"inversion: 2Rjb/b"}], "vcf_file":null, "props":[], "protocols":[{"props":[{"value":"microscope manufacturer: Olympus", "cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}]}, {"cvterms":[{"name":"protocol component", "accession":"VBcv:autocreated:protocol component"}, {"name":"Giemsa staining", "accession":"IDOMAL: "}]}], "name":"VBA :PROTO3", "type":{"name":"cytological chromosome examination", "accession":"MIRO: "}, "description":"Ovaries were prepared for karyotype analysis according to standard procedures. The banding pattern was observed under a phase-contrast microscope (400×) and interpreted with reference to the chromosomal map and nomenclature of Coluzzi and colleagues. ", "uri":""}], "performers":[], "type":"genotype assay", "id":"VBA "}], "props":[{"cvterms":[{"name":"sex", "accession":"EFO: "}, {"name":"female", "accession":"PATO: "}]}, {"cvterms":[{"name":"developmental stage", "accession":"EFO: "}, {"name":"adult", "accession":"IDOMAL: "}]}], "field_collections":[{"result_summary":"Burkina Faso (pyrethrum spray catch)", "name":"G collect", "description":null, "geolocation":{"longitude":" ", "props":[{"cvterms":[{"name":"collection site", "accession":"VBcv: "}, {"name":"Burkina Faso", "accession":"GAZ: "}]}, {"value":"Bonsse", "cvterms":[{"name":"location", "accession":"VBcv: "}]}, {"value":"Burkina Faso", "cvterms":[{"name":"country", "accession":"VBcv: "}]}], "latitude":" ", "geodetic_datum":"WGS 84", "name":"Burkina Faso", "altitude":null}, "props":[{"value":" ", "cvterms":[{"name":"date", "accession":"VBcv: "}]}], "protocols":[{"props":[], "name":"VBA :PROTO1", "type":{"name":"pyrethrum spray catch", "accession":"MIRO: "}, "description":"Freshly-fed female An. gambiae s.l. were collected in the morning while resting inside human dwellings by manual aspiration with the aid of electrical aspirators. Mosquitoes were kept in small cages wrapped in wet towels and stored inside cool boxes. Additionally, indoor insecticide space-sprays were carried out in the early afternoon.", "uri":"\n"}], "performers":[], "type":"field collection", "id":"VBA "}], "species_qualifications":[{"name":"unambiguous", "accession":"VBcv:autocreated:unambiguous"}], "type":{"name":"individual", "accession":"EFO: "}, "id":"VBS ", "phenotype_assays":[]}

Web interface PopBio browser: A good example project page: New entry page currently in development:

Web interface Plan to develop or modify something similar to MalariaGen's Panoptes with richer/more flexible metadata capabilities:

Plans Map interface: delivery for June (VB ) release and present/demo at Kolymbari, ICEMR meetings Spreadsheet submission wizard development scheduled for Fall Year 2: Sample x genotype browser development, including e! REST and variation Solr work. Year 2: Refactor project pages with scalable (but still flexible) data transfer (probably also Solr-driven) & update graphics.