Presentation on theme: "Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007."— Presentation transcript:
Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator WWWFG, 5-7 June 2007
2 | Bioinformatics workflow management | Mark Schreiber Outline Integration and workflows Early attempts Case studies and examples What does the future hold? Conclusions
4 | Bioinformatics workflow management | Mark Schreiber Bioinformatics at NITD BI combines data gathering, data storage and knowledge management with analytical tools to present complex and competitive information to planners and decision makers. Hypothesis Generation and Validation. Providing the right information at the right time. Decision Support.
5 | Bioinformatics workflow management | Mark Schreiber Data Sources Heterogeneity The most significant research is done when heterogeneous data sources can be combined in one analysis. Data Scapers, CGI- Bin, WS-Clients Webpages / Services Parsers (one per format) BioJava/ BioPerl Flatfiles Parser Frameworks XML Image analysis Images / Video SQL, JDBC/ODBC, J2EE,.NET Relational DB API Instrument
6 | Bioinformatics workflow management | Mark Schreiber Applications (Services) Yet more heterogeneity RDBMS Oracle, MySQL, PostGres etc Open Source Usually just a command line interface Commercial software API, scripting engine, webservice Web services and Web resources Integration is rarely seamless
7 | Bioinformatics workflow management | Mark Schreiber Productivity vs. Innovation Finding a balance Development and manufacturing prioritize productivity Research requires more innovation Standardization increases productivity Standardization limits innovation At the level it is applied Standardization promotes innovation At higher levels Workflows give a nice balance
8 | Bioinformatics workflow management | Mark Schreiber What is a workflow? In Bioinformatics A data-driven procedure consisting of one or more transformation processes (nodes). Can be represented as a directed graph. Direction is time – The order of transformations. A set of transformation rules. A flow of data from it’s source to a destination (or result) via a series of merges, joins, manipulations and interconnected tools (services). A specification designed in a Workflow Design System (modeling component) and run by a Workflow Management System (execution component).
9 | Bioinformatics workflow management | Mark Schreiber The UNIX Philosophy Analogy to workflows Write programs that do one thing and do it well Write programs that work together Write programs to handle text streams, because that is the universal interface Text formatted as XML Do one thing and do it well A workflow is made up of nodes that do one thing and do it well So is a Service Oriented Architecture (SOA)
10 | Bioinformatics workflow management | Mark Schreiber An early attempt: Polymer Unix shell scripts + Biojava objects Biojava is a large API of Java objects that are useful for bioinformatics. Biojava objects can be assembled into mini-programs tha ‘do one thing and do it well’. Polymer combines these mini-programs into a very simple workflow using Unix shell scripts. Much like Unix piping. Unfortunately it instantiates multiple JVMs Lacks management and logging systems
11 | Bioinformatics workflow management | Mark Schreiber How could Polymer have been better? Provide an execution class and allow it to execute a script. This would mean only one JVM is launched and could allow for threading of branches in the script. Use Groovy script instead of Unix shell script. But Groovy hadn’t been invented at the time. At the same time workflow management systems were emerging which made Polymer redundant.
12 | Bioinformatics workflow management | Mark Schreiber A production example: Drug Target Identification Rational bioinformatics prioritization In collaboration with biologists identify desirable characteristics of a drug target Integrate relevant data from large datasets Combine data and score each target based on the presence or absence of desirable characteristics Prioritize targets based on their overall score
13 | Bioinformatics workflow management | Mark Schreiber Homology Essentiality Expression Druggable domains Structure Pathways AssessDrugTarget Scientist defines desirable criteria Assign weights Produce a score for each gene Select targets for promotion to D1 Competitive advantage Legal position Literature Biological feasibility DB EpidemiologyAssayability A production example: Drug Target Identification Rational bioinformatics prioritization Hasan S, Daugelat S, Rao PSS, Schreiber M (2006) Prioritizing genomic drug targets in pathogens: Application to Mycobacterium tuberculosis. PLoS Comput Biol 2(6):e61
14 | Bioinformatics workflow management | Mark Schreiber Workflow Management System Controlling the workflow A WMS should provide a means to execute a workflow in a controlled way. Ideally it will also provide: Logging Messaging Security and provenance management Scheduling and load balancing Exception handling Resource pooling (eg DB connections) Much of the above is easily accessible from a JEE/.NET application server JBoss, Glassfish
15 | Bioinformatics workflow management | Mark Schreiber Workflow Design System Building the workflow Many WMS systems are also a WDS Eg Taverna, Pipeline Pilot, Inforsense A GUI that allows rapid workflow development Increases productivity and encourages experimentation Drag and drop assembly of a workflow Provides an API or scripting interface to allow the design of new nodes A simple scripting interface would also be an alternative to using a GUI for design
16 | Bioinformatics workflow management | Mark Schreiber Simple Data Mining Workflow Each node has a discrete function. Internally the processing can be complex (eg Decision Tree) but input and output is simple and generic. Self documenting. Can be run by other users.
19 | Bioinformatics workflow management | Mark Schreiber Workflows become nodes Standing on the shoulders of giants Elements of workflows that are frequently re-used should become nodes. Workflow re-use, Object oriented workflows
20 | Bioinformatics workflow management | Mark Schreiber Example: From Arrays to Pathways Using whole workflows as nodes Process and array and find the over represented KEGG pathways and NCBI processes.
21 | Bioinformatics workflow management | Mark Schreiber Workflow design systems promote rapid development Finding orthologues and paralogues using whole genome pairwise blast. Development of the workflow took about 5mins.
22 | Bioinformatics workflow management | Mark Schreiber Workflow design systems promote experimentation Mind map data analysis
23 | Bioinformatics workflow management | Mark Schreiber Integration Via Ontology Workflows in bioinformatics typically do a lot of integration before and/ or after analysis. Integration is normally done using joins and filters. Using equality and Boolean operations. -Eg type = protease OR type = serine protease … Joins and filters should be able to be evaluated using ontology. Eg. Filtering for proteases would include all subconcepts automatically. Data sets could be quickly mapped using custom ontologies.
24 | Bioinformatics workflow management | Mark Schreiber Simplifying Service Integration Expose an API All programs likely to be called by a workflow management system should publish a webservice or expose a scripting API. Easier to learn than a full Java or C API. Should be based on an existing scripting language not a new one. Python, Groovy, Ruby or Perl While you are at it expose your stack via the scripting language. Imagine what could be done with BLAST if the stack could be manipulated via scripting.
25 | Bioinformatics workflow management | Mark Schreiber Web Services and Service Oriented Architecture ‘Outsourcing your processing’ Webservices Services can reside on different servers Platform independent HTTP protocol CGI, REST, XML-RPC, SOAP SOAP is the easiest to generically connect to and parse Results are available as XML Service Oriented Architecture Usually implies web services SOA promotes re-use and simplifies maintenance Bottleneck shifts from CPU time to network availability
26 | Bioinformatics workflow management | Mark Schreiber Resource Oriented Architecture Outsourcing your data warehouse Bioinformatics is very resource intensive ROA simplifies maintenance and removes the need for synchronization. Many resources are now accessible by webservices in XML format
27 | Bioinformatics workflow management | Mark Schreiber Resource Oriented Architecture The challenges Network latency can become a major problem Intelligent caching and increased network speed are a must Requires resource discovery and cross referencing RDF and Ontology will play an increasingly important role Workflow management systems will need to understand these Increasingly workflows will make use of loosely-coupled interoperable resources and services.
28 | Bioinformatics workflow management | Mark Schreiber Business Processes From proactive to reactive Business processes are long running, asynchronous processes Typically they react to events, e.g. a change in a stock price. -‘Push’ vs ‘Pull’ model of data access. Known as ‘programming in the large’ Defined using BPEL with very heavy use of SOA and ROA Currently, most workflows are explicitly executed, ‘short running’, synchronous processes Bioinformatics will increasingly use business processes React to streaming machine data Continuously process literature or database updates
29 | Bioinformatics workflow management | Mark Schreiber Web Service Choreography Will it be relevant to bioinformatics? Business processes and workflows are ‘orchestrations’ Scope is limited to one participant The BP or the Workflow talks to other participants but doesn’t care how they do their job or how they are managed. Choreography involves the management of several loosely coupled BP’s A network of long running asynchronous BP’s that react to the behavior of their peers. Choreography of workflows would require a standard workflow description or exposure of a workflow as a business process Web Service BP Choreography Node Workflow ??? One to Many
30 | Bioinformatics workflow management | Mark Schreiber Conclusions Design and management Workflows are created using a workflow design system and executed on a workflow management system A well designed workflow management can considerably increase productivity Promotes workflow re-use and helps organize a multi-user environment A good design system allows rapid development of a workflow A good design system promotes experimentation and data exploration
31 | Bioinformatics workflow management | Mark Schreiber Conclusions The future Ontology will play an increasing role in data integration Join and Filter operations that can reason over an ontology model Business processes and web choreography will become more relevant to bioinformatics ‘Live’ data favors programming ‘in the large’ Workflows exposed as business processes Network speed and optimal caching are key All of these approaches have been used before Used and proven in business intelligence Bioinformatics needs to acquaint itself with modern IT practice and stop re-inventing technology