Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The Biology Workbench – a community tool for teaching and research Mark A. Miller Principal Investigator,

Similar presentations


Presentation on theme: "Biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The Biology Workbench – a community tool for teaching and research Mark A. Miller Principal Investigator,"— Presentation transcript:

1 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The Biology Workbench – a community tool for teaching and research Mark A. Miller Principal Investigator, Biology San Diego Supercomputer Center

2 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS SDSC Mission: To serve as a premiere resource for design, development, and deployment of cyberinfrastructure for the national scientific community.

3 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS What is Cyberinfrastructure anyway? Compute Resources DataBases Wet Labs Clinical Labs Production Research Then, after many months or years of struggle……

4 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Cyberinfrastructure (We Think) Life (and Other) Scientists Need Compute Resources DataBases Research Global Data Providers Wet Labs Clinical Labs Grid Resources Grid Services Web Services Personal Electronic Notebook Discovery Portal Structure Tools Sequence Tools Microarray Tools D.L. Workflow Wet Labs Clinical Labs Data Capture Portals Integration Software Data Deposition PortalsDevelopment Production

5 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS SDSC Production Resources for HEC and Grid Computing Tools we provide to the community for U.S. NSF: Allocations on Large architectures via NRAC DataStar; TeraGrid; Blue Gene Allocations for Data Collection Storage 1 PB of on-line disc space; 12 PB of tape space User Services Allocation awards are accompanied by personal service to get you going. Everyone receives courteous advice and assistance! Development allocations are awarded on request. Software Services Rocks cluster management tools Storage Resource Broker (SRB) The Kepler Workflow Tool http://www.sdsc.edu/user_services/allocations/

6 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS What is the Next Generation Tools for Biology Group? Use the Resources of SDSC to Focus on: Both research and development. Activities that can be uniquely conducted at SDSC. Activities that partner with other institutions. Activities that are community-building. Science is the driver.

7 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Overview: Next Generation Tools for Biology at SDSC Current Projects at SDSC: IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). Cyberinfrastructure for Phylogenetic Research (CIPRES). The Next Generation Biology Workbench

8 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Overview: Next Generation Tools for Biology at SDSC Current Projects at SDSC: IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). Cyberinfrastructure for Phylogenetic Research (CIPRES). The Next Generation Biology Workbench

9 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Overview: Next Generation Tools for Biology at SDSC Current Projects at SDSC: IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). Cyberinfrastructure for Phylogenetic Research (CIPRES). The Next Generation Biology Workbench

10 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Overview: Next Generation Tools for Biology at SDSC Current Projects at SDSC: IBM Institute for Innovation in Biomedical Simulations and Imaging (IBM-I3). Cyberinfrastructure for Phylogenetic Research (CIPRES). The Next Generation Biology Workbench

11 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Next Generation Tools for Biology Current Products: CIPRES middleware CIPRES portal CIPRES/Kepler workflow Biology Workbench

12 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS CIPRES middleware SDK/libraries for Win/Mac/Linux. CORBA service architecture allows interactive access to tools across platforms. CORBA service architecture allows interactive access to tools across platforms. Currently supports tree inference/improvement. Currently supports tree inference/improvement. Can be accessed through Mesquite Can be accessed through Mesquite

13 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Portal for Tree Inference Supports: Parsimony: (PAUP) Max Likelihood: (RAxML, GARLI) Coming Soon: User configurability (via applet) MrBayesPOYSate

14 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS CIPRES/Kepler workflow http://www.phylo.org/sub_sections/kepler_workflow/help/creation.htm Status: Proof of Concept Systematics Feature Set; In Usability Development Supports:IterationCheck-pointing Data Forking Data Transfer and deposition Web services Provenance Tracking

15 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The (current) Biology Workbench Created 1996-1997 at NCSA by Shankar Subramaniam, Eric Jakobsson, Roger Unwin, Brian Saunders, Mark Stupar, Dawn Cotter, Jim Fenton, Curt Jamison, Brad Mills, George Pappas, David Tcheng

16 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The original concept behind BWB: “Wouldn't it be nice if there was a web site that would let me run BLAST, CLUSTALW, etc. on my collection of sequences, or a collection of sequence alignments and let me store the results?”

17 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Current Workbench Properties 66 individual tools. Sequences from 33 databases. From a single browser interface, one can access: Individual login password security provided. Data storage area provided for results. No required plug-ins or downloads. Can be (and is) used over phone modem. All calculations provided by the Workbench Server.

18 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Annual WB usage ’00 – ‘03 Users Jobs

19 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS

20 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Some BW User statistics 71% of the user base is domestic. 44% are academic 15% noncommercial 11% commercial 1% government The 29% international user population represents over 40 countries 50% of present users employ the BW for government-funded research programs 48% of BW users are involved in education

21 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Cyberinfrastructure Provided by the Workbench Grid Resources Grid Resources Wet Labs Clinical Labs Grid Services Web Services Personal Electronic Notebook Discovery Portal Data Capture Portals Data Deposition Portals Global Data Providers Compute Resources Structure Tools Sequence Tools Microarray Tools D.L. Workflow Wet Labs Clinical Labs Integration Software Workbench DataBase Data Integration Data storage area Tools

22 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Overall Architecture of the Biology Workbench Browser Web Server bw.cgi Ndjinn Wrapper html.pl Software Tools Databases Session Storage User Data Storage Indexing

23 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Flat file Swissprot Database Databases Public DBs Parser Chronjob: ftp download NDJINN Current Data Integration System Flat file GenBank Database Lookup Table Web Server ? User Data Storage

24 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The "Ndjinn Multiple Database Search" allows the user to specify dbs to be searched Ndjinn Multiple Database Search

25 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS User selected databases may be searched for text. Permitted text searches are “Contains", "Begins With", "Ends With", or is an "Exact Match". Boolean operators "AND", "NOT", or "OR” may also be used: Search order controlled by parentheses. Example: (myoglobin AND human) OR orangutan Constructing Queries

26 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Introducing SWAMI The Next Generation Biology Workbench (www.ngbw.org)

27 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Why SWAMI? SWAMI = Master We'll all be planning out a route We're gonna take real soon We're waxing down our surfboards We can't wait for June We'll all be gone for the summer We're on surfari to stay Tell the teacher we're surfin' Surfin' U.S.A. Haggerties and Swamies Pacific Palisades San Onofre and Sunset Redondo Beach L.A. All over La Jolla At Waimia Bay Everybody's gone surfin' Surfin' U.S.A.

28 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a New Biology workbench web site that can provide better search tools, support protein structure investigations, and allow my students to share files….” The User Says:

29 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a web site that can host all the users biological data — not just sequences allow them to analyze it using any modern tool they choose." The Developer Hears:

30 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 1. Web Services Grid Services Web Services Structure Tools Sequence Tools Microarray Tools Compute Resources Global Data Providers D.L. Local DataBases Integration Software Personal Electronic Notebook Discovery Portal Data Deposition Portals Workflow Wet Labs Clinical Labs Registry/ Discovery Registry/ Discovery Computing and data management are handled at remote sites

31 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 1. Web Services

32 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Issues: New Workbench Architecture Ideas: Take 1. Web Services Issues: Tools: No control over tool availability. Published tool registries are weak. Robust tool descriptions (UDDI) pose enormous overhead. Data: Can’t query across all data sources. Unknown bandwidth and reliability of remote data sources. API of remote data sources can change without warning. This approach is too loosely coupled!

33 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a web site that can host all the users biological data — not just sequences allow them to analyze it using any modern tool they choose." Reality Strikes: Priorities must be ordered

34 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a web site that can host all users biological data — not just sequences allow them to analyze it using any modern tool they choose with as many tools as possible with enterprise class stability….." The Developer Concludes:

35 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 2. Enterprise Solution D.L. Structure Tools Sequence Tools Microarray Tools Compute Resources Global Data Providers Local Data Warehouse Integration Software Personal Electronic Notebook Discovery Portal Data Deposition Portals Workflow Wet Labs Clinical Labs Computing and data management are handled locally

36 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 2. EJB

37 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 2. EJB

38 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Issues: New Workbench Architecture Ideas: Take 2. EJB Issues: Architecture has 8 separate modules. A change in any module breaks 1- 7 others Only a developer who can get zen with EJB can contribute to the development Modifying a web page becomes a task that a web artist cannot manage alone. After 12 months of development, we can login? This approach has too much overhead!

39 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a web site that can host all users biological data — not just sequences allow them to analyze it using any modern tool they choose with as many tools as possible with enterprise class stability….." Reality Strikes Again: Priorities must be re- ordered

40 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS "There should be a web site that can allow me can provide better search tools, and allow my students to share files and allow me to analyze it using any modern tool I choose with as many tools as possible with enterprise class stability and with enough stability so I can teach reliably…..as soon as is humanly possible…." The User Re-states:

41 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Workbench Architecture Ideas: Take 3. Integrated, Stable Solution TomCat/JAVAStruts2/Hibernate/MySQL/Lucene This approach is just right?

42 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Lesson Number 1: Get the user requirements right in the beginning

43 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Wet Labs Clinical Labs Grid Services Web Services Personal Electronic Notebook Discovery Portal Data Capture Portals Data Deposition Portals Global Data Providers Compute Resources Structure Tools Sequence Tools Sequencing Tools D.L. Workflow Wet Labs Clinical Labs Integration Software Workbench Data Warehouse Improved Data Handling

44 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Browser Web Server bw.cgi Ndjinn Wrapper html.pl Databases Session Storage User Data Storage The toolkit is limited by the ability to handle only sequences and alignments. The ability to search is limited by storing data as free (unstructured) text. Flat files Data Providers Indexing Improved Data Handling

45 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Improved Data Handling Improve Search Techniques Lucene indexing allows us to replace the single text match string with the ability to search on specific fields:

46 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Allow user to import and annotate data of many types, including a generic, unknown type. User-entered sequences and results are stored and annotated along with other user selected sequences. Use of the RDB makes it possible to repurpose data easily. Improved Data Handling: User data stored in RDB:

47 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Wet Labs Clinical Labs Grid Services Web Services Personal Electronic Notebook Discovery Portal Data Capture Portals Data Deposition Portals Global Data Providers Compute Resources Structure Tools Sequence Tools Sequencing Tools D.L. Workflow Wet Labs Clinical Labs Integration Software Workbench Data Warehouse Improved Tool Selection

48 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS bw.cgi Wrapper html.pl Software Tools bw.cgi Tool Broker Service SWAMI XML.jsp PISE XML New Discovery Portal Step 1. Improved User Access to Tools Browser Web Server Session Storage User Data Storage Software Tools PISE currently has 300+ interfaces Lesson Number 2: Lesson Number 2: Software development is incredibly expensive. Build nothing you can steal. Steal from the best.

49 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The NEW Workbench will improve on the existing functionalities Grid Resources Grid Resources Wet Labs Clinical Labs Grid Services Web Services Personal Electronic Notebook Discovery Portal Data Capture Portals Data Deposition Portals Global Data Providers Compute Resources Structure Tools Sequence Tools Sequencing Tools D.L. Workflow Wet Labs Clinical Labs Integration Software Workbench Data Warehouse Improved Portal

50 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS User-Requested ToolKits: Structural Biology: Tools to visualize protein structures. Molecular Biology: Tools to assemble contigs. Tools to visualize sequencer output. Role- Based Logins Licensed tools can be mounted for individual users Instructors and students have separate roles Folder sharing for collaborative work. NO BROWSER PLUGINS NO SUDDEN CHANGES

51 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS http://snooker.sdsc.edu/web Sneak Preview:

52 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Discovery Portal Next Steps: Improved User Access to Data

53 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS New Discovery Portal Next Steps: Improved User Access to Data

54 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Tools to assemble contigs. Tools to visualize sequencer output. The AMOS consortium at TIGR produces: BAMBUS, a genome sequence scaffolding program AutoEditor, a tool for correcting sequencing and basecaller errors using sequence alignment and chromatogram data. Assembler, a tool for assembly of large sets of overlapping sequence data such as ESTs, BACs, or small genomes. LUCY, a sequence cleanup program that prepares raw DNA sequence fragments for sequence assembly.

55 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The NEW Workbench will also create new infrastructure Grid Resources Grid Resources Wet Labs Clinical Labs Grid Services Web Services Personal Electronic Notebook Discovery Portal Data Capture Portals Data Deposition Portals Global Data Providers Compute Resources Structure Tools Sequence Tools Microarray Tools D.L. Workflow Wet Labs Clinical Labs Integration Software Workbench DataBase Pipelining Capabilities

56 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Web-Based Workflow Capability Input Tool 1 Output Send output to Tool 2 Tool 3 Tool 4 Tool 5 Tool 6 Tool 7 Tool 8 Tool 9 Tool 10

57 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Notebook Capability The Notebook will feature a local database to store results of computations, results of searches, notify you of new updates available, and enable peer-to-peer data sharing. http://www.notebookproject.org

58 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS We Need YOU! Suggest features you need atcustomerservice@ngbw.org Look and provide feedback on our pre-alpha at http://snooker.sdsc.edu/web

59 biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS Who Did the Work? Current WB: Brian Saunders Shankar Subramaniam Andrea Maer Current NGBWRoger UnwinRami Rifaieh Hannes NiednerJeremy Carver Ashton Taylor “The BOSS”Celeste Brown (University of Idaho, Moscow) NGBW AlumniAndy ZhangKevin Fowler CIPRES TeamMark Holder (Kansas)Terri Liebowitz Paul Hoover Lucie Chan Peter Midford`Rutger Vos Kepler Project:Ilkay AltintasZhijie Guan


Download ppt "Biology.sdsc.edu SAN DIEGO SUPERCOMPUTER CENTER NIGMS The Biology Workbench – a community tool for teaching and research Mark A. Miller Principal Investigator,"

Similar presentations


Ads by Google