Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases & Applications

Similar presentations


Presentation on theme: "Databases & Applications"— Presentation transcript:

1 Databases & Applications
J. da Silva 2/16/2019 Databases & Applications Jack da Silva, PhD Bioinformatics Specialist NCSC The BioGrid is all about easy access to bioinformatics resources, namely databases and applications. 2/16/2019 NC BioGrid NC BioGrid

2 Overview Molecular Biology Databases Bioinformatics Applications
J. da Silva 2/16/2019 Overview Molecular Biology Databases Bioinformatics Applications User Interfaces Research & Development Summary I will be describing the types of databases and applications that should be available to BioGrid Users. Related to this is the subject of what type of interface user’s will encounter when accessing these resources. Finally, I’ll mention some opportunities for collaborative research and development that will arise from this effort. 2/16/2019 NC BioGrid NC BioGrid

3 Molecular Biology Databases
J. da Silva 2/16/2019 Molecular Biology Databases Public Domain NC Initiatives Rest of the World Commercial NC BioGrid Database Service There are numerous molecular biology databases in the public domain and in the commercial sector. Any of these can, and most should, be available on the BioGrid. 2/16/2019 NC BioGrid NC BioGrid

4 NCSC Public-Domain Databases
J. da Silva 2/16/2019 NCSC Public-Domain Databases High-Performance Bioinformatics Initiative Major sequence repositories GenBank, EMBL, DDBJ, etc. Formatted for GCG & BLAST ExPASy (Expert Protein Analysis System) Mirror Site Peptide databases & associated tools SWISS-PROT Knowledgebase At the North Carolina Supercomputing Center we house copies of the data from the major sequence repositories, as do many other groups in the State and elsewhere. These comprise all of the publicly available nucleotide sequences and their peptide translations. We actually have two copies, one formatted for the GCG Wisconsin Package of bioinformatics programs, and the another formatted for BLAST searches. We are also a mirror site for ExPASy, a web portal to a highly regarded collection of peptide databases and associated tools, featuring the SWISS-PROT Knowledgebase. 2/16/2019 NC BioGrid NC BioGrid

5 Specialized Public-Domain Databases & NC Initiatives
J. da Silva 2/16/2019 Specialized Public-Domain Databases & NC Initiatives Value-added Highly annotated (e.g., interactions) Organism specific (e.g., human) Molecule specific (e.g., protein) Data specific (e.g., gene expression) North Carolina Initiatives Please come forward SWISS-PROT is an example of a specialized database. These are value-added databases. They tend be highly annotated, and some are organism specific, molecule specific, or data specific. There are a number of these types of database projects in North Carolina, and these clearly would be a valuable resource on the BioGrid. So, if you are an owner of one of these databases and would like to provide access through the BioGrid, please contact us. 2/16/2019 NC BioGrid NC BioGrid

6 Commercial Databases Celera Genomics DoubleTwist LabBook
J. da Silva 2/16/2019 Commercial Databases Celera Genomics Assembled & annotated human & mouse genome databases + DoubleTwist Assembled & annotated human genome database LabBook OSU Annotated Human Genome Database Free to Academia Incyte Genomics Human transcript database + A minority of specialized databases are commercial. I show a few examples of commercial databases containing human genome data. 2/16/2019 NC BioGrid NC BioGrid

7 Molecular Biology Databases Around the World (335)
J. da Silva 2/16/2019 Molecular Biology Databases Around the World (335) Major Seq. Repositories (7) Comparative Genomics (7) Gene Expression (19) Gene ID & Structure (31) Genetic & Physical Maps (9) Genomic (49) Intermolecular Interactions (5) Metabolic Pathways & Cellular Regulation (12) Mutation (34) Pathology (8) Protein (51) Protein Sequence Motifs (18) Proteome Resources (8) Retrieval Systems & DB Structure (3) RNA Sequences (26) Structure (32) Transgenics (2) Varied Biomedical (18) I’ve mentioned just a tiny fraction of the molecular biology databases in existence – at last count there are 335. The vast majority of these are in the public domain, and apart from the 7 major sequence repositories, these are specialized databases. Databases are shown here categorized by type of data. These categories include the major sequence repositories and many other sequence databases, such as whole genome databases, but also include non-sequence data, such as that of gene expression, intermolecular interactions, and metabolic pathways. Baxevanis, A.D Nucleic Acids Research 30: 1-12. 2/16/2019 NC BioGrid NC BioGrid

8 2/16/2019 NC BioGrid J. da Silva 2/16/2019
This diagram from Lion bioscience shows how some of these databases cross-reference each other. 2/16/2019 NC BioGrid NC BioGrid

9 NC BioGrid Database Service
J. da Silva 2/16/2019 NC BioGrid Database Service Establish service Housing & updating data Public-domain & commercial Virtual data federation Collaborative effort High band-width network environment (NCREN) One goal of the BioGrid project is to make these databases available to its users. One way to achieve this goal is to establish an NC BioGrid Database Service that would house and update databases. Apart from providing easy access to data, another advantage of having many databases locally is that they would be accessible on a high band-width network environment. However, there remains the problem of providing a simple and unified view of the data contained in these heterogeneous databases. 2/16/2019 NC BioGrid NC BioGrid

10 J. da Silva 2/16/2019 Federated Databases Provide uniform access/view of heterogeneous databases IBM DiscoveryLink “Provides a single-format virtual database view of multiple heterogeneous data sources” Lion bioscience SRS “The power of SRS lies in its ability to effectively integrate heterogeneous data sources behind a single interface and integration framework.” Data standards development (e.g., XML) There exist commercial solutions to this problem that deserve some attention. The two shown here, IBM’s DiscoveryLink and Lion bioscience’s SRS, take the approach of federating databases. This means that they use sophisticated querying methods to make it appear to the user that there is one large database. This problem is also related to the issue of data standards, such as Extensible Markup Language, which allows the use of a single data format to move data between different applications. 2/16/2019 NC BioGrid NC BioGrid

11 I3C Workflow Demo Interoperable Informatics Infrastructure Consortium
J. da Silva 2/16/2019 I3C Workflow Demo Interoperable Informatics Infrastructure Consortium The Interoperable Informatics Infrastructure Consortium deals with this issue and has developed a demonstration using XML in which users query databases, use the results in analyses provided by analysis services, and use the new results to conduct further searches of databases. The BioGrid should have representation on the I3C. Demo uses XML-in, XML-out paradigm 2/16/2019 NC BioGrid NC BioGrid

12 Bioinformatics Applications
J. da Silva 2/16/2019 Bioinformatics Applications Grid-Unaware Grid-Aware NC BioGrid Application Service Databases are not much use without applications with which to extract and analyze data. For our purposes, these applications can be divided into those that are not aware that they are on a grid, and those that are. 2/16/2019 NC BioGrid NC BioGrid

13 Grid-Unaware Applications
J. da Silva 2/16/2019 Grid-Unaware Applications Any application can run on a grid server NCSC High-Performance Bioinformatics Apps Public-domain apps on other NC servers Commercial apps on NC servers By grid-unaware, I mean any regular application sitting on a server that is part of the grid, because any such app can be run by a user of the grid. This includes all existing bioinformatics applications on NC servers. 2/16/2019 NC BioGrid NC BioGrid

14 NCSC Applications High-Performance Bioinformatics ExPASy tools
J. da Silva 2/16/2019 NCSC Applications High-Performance Bioinformatics Parallel applications optimized for parallel supercomputers Accelrys GCG Wisconsin Package (commercial) BLAST & HT-BLAST Parallel Clustal & HT Clustal Parallel Molecular Systematics Apps ExPASy tools High-performance molecular modeling packages (commercial) At the NCSC, we have a unique collection of parallel applications as part of our High-Performance Bioinformatics Initiative. These are applications optimized to run in parallel on multiprocessor supercomputers. These include GCG, BLAST & HT-BLAST, Parallel versions of Clustal and HT-Clustal, and parallel applications for evolutionary tree reconstruction. We also have the proteomics tools that are part of ExPASy, and various high-performance molecular modeling packages. 2/16/2019 NC BioGrid NC BioGrid

15 Public & Commercial Apps on NC Servers
J. da Silva 2/16/2019 Public & Commercial Apps on NC Servers Any public-domain application Open source, “Freeware” Commercial apps will vary in licensing from restrictive to relatively unrestrictive Please come forward with suggestions There are many other applications on servers in NC that could be accessible via the BioGrid. These may be in the public domain or commercial. In the case of commercial software, grid administrative software would be used to restrict access according to the terms of each license. If you have applications that you would like to make available on the grid, or know of applications that you would like to see on the grid, please contact us. 2/16/2019 NC BioGrid NC BioGrid

16 J. da Silva 2/16/2019 FEATURE “Grid-unaware”, public-domain application from the Rus Altman Lab, Stanford Identifies functional or structural sites of interest in a protein FEATURE is serial! Multiple instances run concurrently on NPACI-net LEGION grid test bed Scanned entire PDB (10,911 structures) in ~10 hrs (177 hrs or 1 wk sequentially) As an example of a grid-unaware application that has been used successfully on a grid, I would like to describe Feature from the Rus Altman lab in Stanford. Feature identifies functional or structural sites on a protein. Although feature runs on only one processor, it was parallelized by running multiple instances concurrently on different nodes of the NPACI-net Legion grid test bed, with each instance scanning a different part of a large database of protein structures. This enabled a scan of the entire Protein Data Bank in about 10 hours, rather than the one week it would take on a single processor. 2/16/2019 NC BioGrid NC BioGrid

17 FEATURE Analysis 2/16/2019 NC BioGrid J. da Silva 2/16/2019
Briefly, Feature is trained to recognize protein sites of interest, and then the statistical model of sites is used to scan a structure database. The lower panel shows a graphical user interface for viewing the results of the scan. 2/16/2019 NC BioGrid NC BioGrid

18 J. da Silva 2/16/2019 FEATURE & the Grid Compiled FEATURE code on LEGION for Intel Linux, DEC Alpha Linux, & Sun Solaris Registered binaries into “LEGION space” Provided file specifying where to find input and deposit output Used legion_run_multi command to spawn multiple instances of FEATURE (np = 50) across nodes, each scanning a single file from the PDB These are the steps that were taken to run this serial application in parallel on a grid of heterogeneous computers. Feature was compiled for the various computer platforms on the grid. The compiled binaries were registered on the grid. A file was written that specified where on the grid input files were located and where output should be written. The legion_run_multi command was issued to spawn multiple instances of Feature across nodes, with each instance scanning a different Protein Database file. 2/16/2019 NC BioGrid NC BioGrid

19 Grid-Aware Applications
J. da Silva 2/16/2019 Grid-Aware Applications Not many – production grids don’t exist TurboBLAST (TurboGenomics) Commercial Not marketed specifically to grids Distributes BLAST search over heterogeneous network of computers What about grid-aware applications? These are applications that automate to some extent what I’ve just described for Feature’s implementation on a grid. Well, there aren’t many, since production grids don’t exist yet. One grid-aware application that I’m aware of is TurboBLAST. This is a commercial application that’s not explicitly marketed to grids, but is designed to run on a network of heterogeneous computers. 2/16/2019 NC BioGrid NC BioGrid

20 TurboBLAST 2/16/2019 NC BioGrid J. da Silva 2/16/2019
TurboBLAST works on a grid in much the same way that I described for Feature, except that most of it is automated. TurboBLAST takes a master database to be queried and distributes subsets over the grid. A master node then distributes requests over individual nodes, each node running an instance of NCBI BLAST, thus parallelizing the processes. 2/16/2019 NC BioGrid NC BioGrid

21 TurboBLAST 2/16/2019 NC BioGrid J. da Silva 2/16/2019
This is the web-browser user interface to TurboBLAST. 2/16/2019 NC BioGrid NC BioGrid

22 NC BioGrid Application Service
J. da Silva 2/16/2019 NC BioGrid Application Service Establish service Housing & updating binaries, source code, documentation Public-domain & commercial Collaborative effort High band-width network environment (NCREN) Cross-referenced to databases (NCBDS) There are many grid-unaware applications, which are nonetheless useful resources on a grid, and which should be made available to users. One way of doing this is to establish an application service that would house, update, and register executables for various platforms on the BioGrid. These could then be cross-referenced to databases, which would suggest one possible form of user interface to the BioGrid. 2/16/2019 NC BioGrid NC BioGrid

23 One View of the User’s Views
J. da Silva 2/16/2019 One View of the User’s Views Database centric Cross-referenced to appropriate applications Application centric Cross-referenced to appropriate databases Analysis centric References appropriate databases & applications Suggests workflows A user could then be given the option of a database-centric view of the grid, with databases cross-referenced to relevant applications, or an application-centric view of the grid, with applications cross-referenced to relevant databases. A third option would be an analysis-centric view, where a user could choose an analysis type, which would then reference relevant databases and applications. This view could also suggest appropriate workflows for the the analysis. 2/16/2019 NC BioGrid NC BioGrid

24 User Interfaces to the BioGrid
J. da Silva 2/16/2019 User Interfaces to the BioGrid Single sign-on Simple, graphical Allow user to “see” everything on grid Give the impression that resources are on user’s desktop This naturally brings us to user interfaces to the BioGrid. These must require a single sign on, be simple and graphical, and allow the user see anything on the grid, giving the impression that it’s all sitting there on her desktop computer. 2/16/2019 NC BioGrid NC BioGrid

25 UNICORE Grid Technology
J. da Silva 2/16/2019 UNICORE Grid Technology This is an example. It’s the interface to a grid based on UNICORE grid technology, the technology used by the EuroGrid project. 2/16/2019 NC BioGrid NC BioGrid

26 European DataGrid Simulator
J. da Silva 2/16/2019 European DataGrid Simulator This is an interface to a simulation of the European DataGrid. It shows the name of the input file, the database being queried and the algorithm being used. It also shows what each node on the grid is doing and a cartographical representation of what’s going on. For instance, it shows that the CERN and CPPM are sending results to the central node in Dusseldorf, and that LPC is receiving its jobs. 2/16/2019 NC BioGrid NC BioGrid

27 Vanet LEGION Grid Test Bed (US Nodes)
J. da Silva 2/16/2019 Vanet LEGION Grid Test Bed (US Nodes) Finally, this is an example from the Vanet Legion grid project, showing the status of its US nodes. 2/16/2019 NC BioGrid NC BioGrid

28 Research & Development Opportunities
J. da Silva 2/16/2019 Research & Development Opportunities Uniform access/view of data “Gridize” applications Database & application services User interface development Collaboration required span academic-commercial boundary To accomplish all this will require considerable effort and provides many research and development opportunities. There is the problem of data standards, which can be pursued through the I3C. There is the need to “gridize” applications, to harness the power of grids. There is the possibility of establishing database and application services as part of the BioGrid. There is the need to develop a simple user interface. This will probably require a great deal of collaboration that will likely span the academic-commercial boundary in many cases. 2/16/2019 NC BioGrid NC BioGrid

29 J. da Silva 2/16/2019 Summary The NC BioGrid aims to provide easy, high band-width access to: databases applications Opportunities for collaborative R&D To summarize, the BioGrid is all about providing easy, fast access to a great variety of growing databases and the applications needed to analyze the data. Achieving this will require considerable collaborative research & development. 2/16/2019 NC BioGrid NC BioGrid


Download ppt "Databases & Applications"

Similar presentations


Ads by Google